Building Production RAG Systems
Retrieval-Augmented Generation is the standard architecture for AI applications that use private data. But prototypes and production systems have different problems. I built RAG systems in three domains—enterprise memory, consumer Q&A, and multimodal diagnostics—and the hard parts weren't the ones covered in tutorials.
This article covers the architectural decisions, performance patterns, and mistakes from systems handling millions of queries. It also connects implementation choices to cognitive science research that explains why some approaches retrieve better than others.
The same signal pattern that distinguishes a production-grade RAG system from a tutorial also distinguishes content that gets cited by AI search engines from content that doesn't. The 2024 Princeton GEO paper (Aggarwal et al., "GEO: Generative Engine Optimization") found that large language models preferentially surface content that combines named statistics, expert quotes, and outbound references to authoritative sources. This piece is structured that way on purpose: every claim below ties back to a specific system, with numbers.
The Three Systems
Fidelius (Enterprise Memory): A four-network memory system using the HINDSIGHT architecture. It combines pgvector, Neo4j knowledge graphs, and BM25 full-text search with Reciprocal Rank Fusion. Supports multi-tenant isolation, GraphRAG entity extraction, and memory consolidation modeled on cognitive science.
Ask Clay (Consumer Q&A): A board game rules assistant. 90,000+ document chunks, sub-10ms vector search. Pure semantic search with HNSW indexing.
Reparo (Diagnostic Dialog): A plumbing diagnostic system that runs over SMS. No vector search. Uses rule-based retrieval with dynamic context injection and a vision model.
These three systems sit at different points on the complexity spectrum. Each one broke in different ways.
Retrieval Strategy: Hybrid Search
The first design choice is retrieval method. Pure semantic similarity search—the approach in most tutorials—fails in predictable ways.
Why Pure Semantic Search Breaks
Standard RAG tutorials say: embed your documents, embed your query, find nearest neighbors. This works in demos. It breaks in production for three reasons:
Temporal blindness. Vector similarity has no concept of recency. When a user asks "What did we decide about the API?", they want last week's meeting notes, not a similar discussion from 2019.
Entity confusion. "Apple's quarterly earnings" and "apple pie recipe" can produce similar embeddings depending on the model and context. Named entity recognition matters.
Associativity gaps. If Document A mentions "Project Atlas" and Document B mentions "our Q3 initiative" (which is Project Atlas), embedding search won't connect them unless both terms co-occur somewhere in the corpus.
Bartlett's 1932 research on memory reconstruction is relevant here. Human memory doesn't work by passive retrieval—it reconstructs using mental frameworks. RAG systems that treat retrieval as similarity search miss the structural relationships between documents.
Hybrid Retrieval in Fidelius
Fidelius runs four parallel retrieval paths and combines them with Reciprocal Rank Fusion:
Query → [Semantic Search (pgvector)] →
→ [BM25 Full-Text Search] → RRF Fusion → Reranking → Results
→ [Knowledge Graph Traversal] →
→ [Temporal Filtering] →Reciprocal Rank Fusion combines results without needing normalized scores:
score(doc) = Σ 1/(k + rank_i(doc)) where k=60This is more robust than score averaging because retrieval methods produce scores on incompatible scales.
The system auto-tunes graph weight based on query characteristics:
- High entity density (>30% capitalized words) → favor graph traversal
- Question words detected → moderate graph weight
- Short queries (<5 words) → favor vector search
- Queries ≥5 words → balanced approach
Human memory uses different retrieval strategies for different kinds of questions—episodic recall for personal experiences, semantic memory for facts. The prefrontal cortex coordinates between them. This auto-tuning is a rough analog.
When Simple Wins
Ask Clay uses pure semantic search with HNSW indexing. No hybrid retrieval. This works because:
Domain constraint. Board game rules are self-contained. There's no temporal relevance, no entity confusion across contexts, no relationship traversal needed.
Pre-structured data. Each document chunk is a complete question-answer pair with metadata. The chunking problem is solved during data preparation, not at query time.
Latency requirements. HNSW returns results in 3-10ms. Hybrid retrieval would add complexity without a proportional improvement in answer quality.
The lesson: match retrieval complexity to domain complexity. Structured metadata filtering may be enough. You don't always need GraphRAG.
Chunking
Every RAG tutorial covers chunking—splitting documents into smaller pieces for embedding. The tutorials rarely cover the actual problems.
Chunk Size Trade-offs
Too small (100-200 tokens): context loss. A chunk containing "the defendant" without identifying the defendant is useless for retrieval.
Too large (1000+ tokens): embedding quality degrades. The embedding averages too many concepts and retrieval precision drops.
The production answer depends on domain and retrieval strategy.
Fidelius uses 400-character chunks with zero overlap for general text, AST-aware chunking for code (respecting function boundaries), and clause-level chunking for legal documents.
Ask Clay sidesteps the problem—each JSON entry is a complete, atomic chunk designed during data preparation.
Information Loss at Chunk Boundaries
The harder chunking problem is information loss at boundaries:
Chunk 1: "...the project was approved by the committee."
Chunk 2: "They allocated $2.4M for Phase 1...""They" in Chunk 2 refers to "the committee" in Chunk 1, but vector search might only retrieve Chunk 2.
Four approaches that work:
- Overlap with deduplication. Include 50-100 tokens of overlap between chunks, then deduplicate at retrieval time.
- Hierarchical chunking. Store both fine-grained chunks (for precision) and paragraph-level chunks (for context). Retrieve fine-grained, expand to paragraph.
- Chunk metadata enrichment. Store surrounding text in metadata:
{"preceding_context": "...approved by the committee.", "following_context": "They allocated..."}. - Parent document retrieval. Retrieve chunks, then fetch the parent document for full context.
Baddeley's working memory model is a useful frame here. The phonological loop holds about 7 items, but the episodic buffer integrates information across sources. RAG context assembly needs a similar integration step—retrieved chunks must be assembled into coherent context, not concatenated.
Context Assembly
Retrieval gets more attention, but context assembly—how you arrange retrieved content for the LLM—determines whether the model can use what you found.
The Token Budget Problem
LLMs have finite context windows. Retrieved content competes with system prompts, conversation history, the user query, and output space.
Fidelius enforces token budgets (100-16,000 tokens) with explicit allocation:
def assemble_context(retrieved_docs, conversation_history, system_prompt):
budget = MAX_CONTEXT_TOKENS
budget -= count_tokens(system_prompt) # ~500 tokens
budget -= count_tokens(user_query) # ~50 tokens
budget -= RESERVED_OUTPUT_TOKENS # ~2000 tokens
# Split remaining budget: 30% history, 70% retrieval
history_budget = min(budget * 0.3, count_tokens(conversation_history))
retrieval_budget = budget - history_budget
return truncate_to_budget(retrieved_docs, retrieval_budget)Context Ordering
LLMs weight information at the start and end of context more heavily than information in the middle (the "lost in the middle" effect).
Put the most relevant chunk closest to the query (last in the context block). Keep recent conversation messages closer to the current query. Use clear section headers (## Relevant Context, ## Conversation History) to help the model navigate.
Ask Clay structures its prompts with explicit sections:
Game Information
[title, category, player count, duration]
Relevant Rules
[Rule with similarity score]
[Rule with similarity score]
Recent Conversation
[Last 5 messages]
Instructions
[Response formatting guidelines]Dynamic Context Injection
Reparo changes the system prompt based on conversation state:
if not context.safety_checked:
return SAFETY_CHECK_PROMPT
elif context.is_emergency:
return EMERGENCY_PROTOCOL_PROMPT
elif not context.system_type:
return SYSTEM_CLASSIFICATION_PROMPT
else:
return STANDARD_DIAGNOSTIC_PROMPTAn emergency plumbing call needs different retrieved context than a routine diagnostic. The system selects prompts accordingly.
Memory Beyond RAG: The HINDSIGHT Architecture
Standard RAG is stateless—each query retrieves from the same corpus. Production systems need memory that changes over time.
Four Memory Networks
Fidelius implements the HINDSIGHT architecture with four memory networks, based on Tulving's distinction between episodic and semantic memory:
World Memory (Semantic): Objective facts. "Python 3.12 was released in October 2023." No confidence evolution—facts are static until corrected.
Experience Memory (Episodic): First-person action records with timestamps. "User sent quarterly report to Alice on March 15." Filterable by action type, searchable with time constraints.
Opinion Memory (Beliefs): Confidence-weighted preferences that change with use. "User prefers concise emails" with confidence 0.7. Supports reinforcement and decay:
def reinforce(opinion_id, delta=0.1):
new_confidence = min(1.0, current_confidence + delta)
def decay_unreinforced(grace_period_days=30, decay_rate=0.01):
for opinion in opinions_past_grace_period:
opinion.confidence = max(0.0, opinion.confidence - decay_rate)Observation Memory (Patterns): Entity-centric summaries built from the other three networks. Graph-backed with causal relationships: CAUSES, CAUSED_BY, ENABLES, PREVENTS.
This four-network split addresses a real limitation of standard RAG. Pure vector retrieval can't chain related ideas transitively or account for how a user's priorities change over time. A static vector index won't notice that a user stopped caring about Q3 metrics two months ago.
Memory Consolidation
The HINDSIGHT architecture runs periodic offline processing:
def sleep_cycle():
# 1. Cluster recent experiences by theme (UMAP + HDBSCAN)
# 2. Synthesize patterns from clusters
# 3. Remove duplicates
# 4. Reduce confidence in unverified opinions
# 5. Prune outdated informationThis is modeled on human memory consolidation during sleep, where the hippocampus replays recent events and transfers them to long-term storage. The practical result: the memory system gets more focused over time instead of accumulating noise.
Source Monitoring
Every memory in Fidelius is tagged with:
- Source (conversation, document, external API)
- Timestamp (bi-temporal: valid_from, valid_to)
- Confidence level
- Creating application
Johnson's Source Monitoring Framework describes how humans track where information came from. Without source tracking, a RAG system can blend training data with user context and produce plausible but ungrounded answers.
Multi-Tenancy and Security
Enterprise RAG systems serve multiple users and organizations. Data isolation is a requirement, not a feature.
Row-Level Security
Fidelius enforces tenant isolation at the database level:
CREATE POLICY tenant_isolation ON documents
USING (user_id = current_user_id() AND workspace_id = current_workspace_id());The system verifies RLS is enabled at startup and fails fast if it's disabled. Application-level bugs can't accidentally leak data across tenants.
Knowledge Graph Scoping
Neo4j entities are scoped with group IDs:
group_id = f"workspace_{workspace_id}" if workspace_id else f"user_{user_id}"
# All MATCH clauses filter by group_idThe OpenClaw Lesson
OpenClaw gained 60,000+ GitHub stars before security researchers found hundreds of misconfigured instances leaking credentials. The cause was architectural: everything stored in local files with no encryption, no access control, and no isolation.
Production RAG systems need:
- Encrypted storage at rest
- Per-user/organization data isolation
- Audit trails for data access
- Granular sharing permissions
- Complete user data deletion
Performance
Embedding Batch Processing
OpenAI recommends batch sizes of 100 for embeddings. Fidelius implements this with exponential backoff:
@retry(
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(5),
retry=retry_if_exception_type((RateLimitError, APIConnectionError))
)
async def create_embeddings(texts: list[str], batch_size: int = 100):
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
embeddings = await openai.embeddings.create(input=batch)
await asyncio.sleep(BATCH_DELAY)HNSW Index Optimization
Ask Clay went from 3.84s to 6ms query latency (640x) by fixing one query pattern:
The problem: WHERE clause filtering disabled HNSW index optimization.
# Slow — prevents HNSW optimization
SELECT * FROM documents
WHERE content_type = 'rule'
ORDER BY embedding <-> query_embedding LIMIT 5;
# Fast — HNSW search first, filter after
results = vector_search(query_embedding, limit=20)
filtered = [r for r in results if r.content_type == 'rule'][:5]Caching
- Embedding cache (LRU): SHA256 hash of text → embedding. 20-30% hit rate on repeated queries.
- Game metadata cache: LRU with 128 entries. Avoids repeated database lookups.
- Neo4j connection pooling: 50 max connections, 30s timeout, 3600s max lifetime.
Streaming
For long generations, stream responses with Server-Sent Events to reduce perceived latency:
async def stream_response():
async for chunk in llm.stream(prompt):
yield f"data: {json.dumps({'delta': chunk})}\n\n"Ask Clay sends only deltas—new content per event, not accumulated text.
Multimodal Retrieval
Reparo shows how vision models integrate with RAG-like systems.
Structured Vision Output
Instead of free-form image descriptions, Reparo uses structured output:
class PlumberTicketAnalysis(BaseModel):
system: str # "Water Heater (Gas)"
category: str # One of 8 categories
priority: Literal["emergency", "urgent", "routine"]
confidence: Literal["high", "medium", "low"]
visual_evidence: list[str]
materials_identified: list[str]
truck_loadout_primary: list[str]
estimated_time: strStructured output is queryable—you can filter and rank vision analysis results the same way you handle text retrieval.
Customer Verification
Vision findings require human verification before the system treats them as facts:
async def confirm_image_finding(finding_index: int, verified: bool, correction: str | None):
if verified:
record_as_confirmed_symptom(finding)
elif correction:
record_as_corrected_symptom(correction)The system tracks whether a finding came from AI analysis or human confirmation—another application of source monitoring.
Choosing a RAG Architecture
When designing a RAG system, these questions determine most of the architecture:
1. Retrieval requirements
| Situation | Approach |
|---|---|
| Static corpus, simple queries | Semantic search with HNSW |
| Temporal relevance matters | Add time-weighted retrieval |
| Entity relationships matter | Add knowledge graph |
| Keyword precision matters | Add BM25 hybrid search |
2. How context changes
| Situation | Approach |
|---|---|
| Stateless queries | Standard RAG |
| Conversation memory needed | Add session history |
| Long-term user preferences | Add persistent memory |
| Beliefs that update | Add confidence-weighted opinions with decay |
3. Latency requirements
| Budget | Options |
|---|---|
| <100ms | Pure vector search, aggressive caching |
| <500ms | Add reranking, light hybrid retrieval |
| <2s | Full hybrid retrieval, graph traversal |
| Async OK | Multi-stage pipelines |
4. Security requirements
| Situation | Approach |
|---|---|
| Single tenant | Application-level isolation |
| Multi-tenant | Database-level RLS |
| Regulated industry | Add audit trails, data lineage, deletion |
Conclusion
The main takeaway from building these systems: RAG architecture is cognitive architecture. The patterns that work correspond to how human memory works:
- Hybrid retrieval corresponds to the interplay between episodic and semantic memory
- Memory consolidation corresponds to sleep-based learning
- Source monitoring corresponds to how humans track where they learned something
- Context window constraints correspond to working memory limits
Production RAG is not about building bigger vector indexes. It's about building systems that retrieve the right information, from the right sources, at the right time.