Technical

Building Production RAG Systems

February 3, 202615 min read

Retrieval-Augmented Generation is the standard architecture for AI applications that use private data. But prototypes and production systems have different problems. I built RAG systems in three domains—enterprise memory, consumer Q&A, and multimodal diagnostics—and the hard parts weren't the ones covered in tutorials.

This article covers the architectural decisions, performance patterns, and mistakes from systems handling millions of queries. It also connects implementation choices to cognitive science research that explains why some approaches retrieve better than others.

The same signal pattern that distinguishes a production-grade RAG system from a tutorial also distinguishes content that gets cited by AI search engines from content that doesn't. The 2024 Princeton GEO paper (Aggarwal et al., "GEO: Generative Engine Optimization") found that large language models preferentially surface content that combines named statistics, expert quotes, and outbound references to authoritative sources. This piece is structured that way on purpose: every claim below ties back to a specific system, with numbers.

The Three Systems

Fidelius (Enterprise Memory): A four-network memory system using the HINDSIGHT architecture. It combines pgvector, Neo4j knowledge graphs, and BM25 full-text search with Reciprocal Rank Fusion. Supports multi-tenant isolation, GraphRAG entity extraction, and memory consolidation modeled on cognitive science.

Ask Clay (Consumer Q&A): A board game rules assistant. 90,000+ document chunks, sub-10ms vector search. Pure semantic search with HNSW indexing.

Reparo (Diagnostic Dialog): A plumbing diagnostic system that runs over SMS. No vector search. Uses rule-based retrieval with dynamic context injection and a vision model.

These three systems sit at different points on the complexity spectrum. Each one broke in different ways.

Retrieval Strategy: Hybrid Search

The first design choice is retrieval method. Pure semantic similarity search—the approach in most tutorials—fails in predictable ways.

Why Pure Semantic Search Breaks

Standard RAG tutorials say: embed your documents, embed your query, find nearest neighbors. This works in demos. It breaks in production for three reasons:

Temporal blindness. Vector similarity has no concept of recency. When a user asks "What did we decide about the API?", they want last week's meeting notes, not a similar discussion from 2019.

Entity confusion. "Apple's quarterly earnings" and "apple pie recipe" can produce similar embeddings depending on the model and context. Named entity recognition matters.

Associativity gaps. If Document A mentions "Project Atlas" and Document B mentions "our Q3 initiative" (which is Project Atlas), embedding search won't connect them unless both terms co-occur somewhere in the corpus.

Bartlett's 1932 research on memory reconstruction is relevant here. Human memory doesn't work by passive retrieval—it reconstructs using mental frameworks. RAG systems that treat retrieval as similarity search miss the structural relationships between documents.

Hybrid Retrieval in Fidelius

Fidelius runs four parallel retrieval paths and combines them with Reciprocal Rank Fusion:

Query → [Semantic Search (pgvector)]  →
      → [BM25 Full-Text Search]       → RRF Fusion → Reranking → Results
      → [Knowledge Graph Traversal]   →
      → [Temporal Filtering]          →

Reciprocal Rank Fusion combines results without needing normalized scores:

score(doc) = Σ 1/(k + rank_i(doc))    where k=60

This is more robust than score averaging because retrieval methods produce scores on incompatible scales.

The system auto-tunes graph weight based on query characteristics:

High entity density (>30% capitalized words) → favor graph traversal
Question words detected → moderate graph weight
Short queries (<5 words) → favor vector search
Queries ≥5 words → balanced approach

Human memory uses different retrieval strategies for different kinds of questions—episodic recall for personal experiences, semantic memory for facts. The prefrontal cortex coordinates between them. This auto-tuning is a rough analog.

When Simple Wins

Ask Clay uses pure semantic search with HNSW indexing. No hybrid retrieval. This works because:

Domain constraint. Board game rules are self-contained. There's no temporal relevance, no entity confusion across contexts, no relationship traversal needed.

Pre-structured data. Each document chunk is a complete question-answer pair with metadata. The chunking problem is solved during data preparation, not at query time.

Latency requirements. HNSW returns results in 3-10ms. Hybrid retrieval would add complexity without a proportional improvement in answer quality.

The lesson: match retrieval complexity to domain complexity. Structured metadata filtering may be enough. You don't always need GraphRAG.

Chunking

Every RAG tutorial covers chunking—splitting documents into smaller pieces for embedding. The tutorials rarely cover the actual problems.

Chunk Size Trade-offs

Too small (100-200 tokens): context loss. A chunk containing "the defendant" without identifying the defendant is useless for retrieval.

Too large (1000+ tokens): embedding quality degrades. The embedding averages too many concepts and retrieval precision drops.

The production answer depends on domain and retrieval strategy.

Fidelius uses 400-character chunks with zero overlap for general text, AST-aware chunking for code (respecting function boundaries), and clause-level chunking for legal documents.

Ask Clay sidesteps the problem—each JSON entry is a complete, atomic chunk designed during data preparation.

Information Loss at Chunk Boundaries

The harder chunking problem is information loss at boundaries:

Chunk 1: "...the project was approved by the committee."
Chunk 2: "They allocated $2.4M for Phase 1..."

"They" in Chunk 2 refers to "the committee" in Chunk 1, but vector search might only retrieve Chunk 2.

Four approaches that work:

Overlap with deduplication. Include 50-100 tokens of overlap between chunks, then deduplicate at retrieval time.
Hierarchical chunking. Store both fine-grained chunks (for precision) and paragraph-level chunks (for context). Retrieve fine-grained, expand to paragraph.
Chunk metadata enrichment. Store surrounding text in metadata: {"preceding_context": "...approved by the committee.", "following_context": "They allocated..."}.
Parent document retrieval. Retrieve chunks, then fetch the parent document for full context.

Baddeley's working memory model is a useful frame here. The phonological loop holds about 7 items, but the episodic buffer integrates information across sources. RAG context assembly needs a similar integration step—retrieved chunks must be assembled into coherent context, not concatenated.

Context Assembly

Retrieval gets more attention, but context assembly—how you arrange retrieved content for the LLM—determines whether the model can use what you found.

The Token Budget Problem

LLMs have finite context windows. Retrieved content competes with system prompts, conversation history, the user query, and output space.

Fidelius enforces token budgets (100-16,000 tokens) with explicit allocation:

python

def assemble_context(retrieved_docs, conversation_history, system_prompt):
    budget = MAX_CONTEXT_TOKENS
    budget -= count_tokens(system_prompt)       # ~500 tokens
    budget -= count_tokens(user_query)          # ~50 tokens
    budget -= RESERVED_OUTPUT_TOKENS            # ~2000 tokens

    # Split remaining budget: 30% history, 70% retrieval
    history_budget = min(budget * 0.3, count_tokens(conversation_history))
    retrieval_budget = budget - history_budget

    return truncate_to_budget(retrieved_docs, retrieval_budget)

Context Ordering

LLMs weight information at the start and end of context more heavily than information in the middle (the "lost in the middle" effect).

Put the most relevant chunk closest to the query (last in the context block). Keep recent conversation messages closer to the current query. Use clear section headers (## Relevant Context, ## Conversation History) to help the model navigate.

Ask Clay structures its prompts with explicit sections:

Game Information
[title, category, player count, duration]

Relevant Rules
[Rule with similarity score]
[Rule with similarity score]

Recent Conversation
[Last 5 messages]

Instructions
[Response formatting guidelines]

Dynamic Context Injection

Reparo changes the system prompt based on conversation state:

python

if not context.safety_checked:
    return SAFETY_CHECK_PROMPT
elif context.is_emergency:
    return EMERGENCY_PROTOCOL_PROMPT
elif not context.system_type:
    return SYSTEM_CLASSIFICATION_PROMPT
else:
    return STANDARD_DIAGNOSTIC_PROMPT

An emergency plumbing call needs different retrieved context than a routine diagnostic. The system selects prompts accordingly.

Memory Beyond RAG: The HINDSIGHT Architecture

Standard RAG is stateless—each query retrieves from the same corpus. Production systems need memory that changes over time.

Four Memory Networks

Fidelius implements the HINDSIGHT architecture with four memory networks, based on Tulving's distinction between episodic and semantic memory:

World Memory (Semantic): Objective facts. "Python 3.12 was released in October 2023." No confidence evolution—facts are static until corrected.

Experience Memory (Episodic): First-person action records with timestamps. "User sent quarterly report to Alice on March 15." Filterable by action type, searchable with time constraints.

Opinion Memory (Beliefs): Confidence-weighted preferences that change with use. "User prefers concise emails" with confidence 0.7. Supports reinforcement and decay:

python

def reinforce(opinion_id, delta=0.1):
    new_confidence = min(1.0, current_confidence + delta)

def decay_unreinforced(grace_period_days=30, decay_rate=0.01):
    for opinion in opinions_past_grace_period:
        opinion.confidence = max(0.0, opinion.confidence - decay_rate)

Observation Memory (Patterns): Entity-centric summaries built from the other three networks. Graph-backed with causal relationships: CAUSES, CAUSED_BY, ENABLES, PREVENTS.

This four-network split addresses a real limitation of standard RAG. Pure vector retrieval can't chain related ideas transitively or account for how a user's priorities change over time. A static vector index won't notice that a user stopped caring about Q3 metrics two months ago.

Memory Consolidation

The HINDSIGHT architecture runs periodic offline processing:

python

def sleep_cycle():
    # 1. Cluster recent experiences by theme (UMAP + HDBSCAN)
    # 2. Synthesize patterns from clusters
    # 3. Remove duplicates
    # 4. Reduce confidence in unverified opinions
    # 5. Prune outdated information

This is modeled on human memory consolidation during sleep, where the hippocampus replays recent events and transfers them to long-term storage. The practical result: the memory system gets more focused over time instead of accumulating noise.

Source Monitoring

Every memory in Fidelius is tagged with:

Source (conversation, document, external API)
Timestamp (bi-temporal: valid_from, valid_to)
Confidence level
Creating application

Johnson's Source Monitoring Framework describes how humans track where information came from. Without source tracking, a RAG system can blend training data with user context and produce plausible but ungrounded answers.

Multi-Tenancy and Security

Enterprise RAG systems serve multiple users and organizations. Data isolation is a requirement, not a feature.

Row-Level Security

Fidelius enforces tenant isolation at the database level:

sql

CREATE POLICY tenant_isolation ON documents
    USING (user_id = current_user_id() AND workspace_id = current_workspace_id());

The system verifies RLS is enabled at startup and fails fast if it's disabled. Application-level bugs can't accidentally leak data across tenants.

Knowledge Graph Scoping

Neo4j entities are scoped with group IDs:

python

group_id = f"workspace_{workspace_id}" if workspace_id else f"user_{user_id}"
# All MATCH clauses filter by group_id

The OpenClaw Lesson

OpenClaw gained 60,000+ GitHub stars before security researchers found hundreds of misconfigured instances leaking credentials. The cause was architectural: everything stored in local files with no encryption, no access control, and no isolation.

Production RAG systems need:

Encrypted storage at rest
Per-user/organization data isolation
Audit trails for data access
Granular sharing permissions
Complete user data deletion

Performance

Embedding Batch Processing

OpenAI recommends batch sizes of 100 for embeddings. Fidelius implements this with exponential backoff:

python

@retry(
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APIConnectionError))
)
async def create_embeddings(texts: list[str], batch_size: int = 100):
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        embeddings = await openai.embeddings.create(input=batch)
        await asyncio.sleep(BATCH_DELAY)

HNSW Index Optimization

Ask Clay went from 3.84s to 6ms query latency (640x) by fixing one query pattern:

The problem: WHERE clause filtering disabled HNSW index optimization.

python

# Slow — prevents HNSW optimization
SELECT * FROM documents
WHERE content_type = 'rule'
ORDER BY embedding <-> query_embedding LIMIT 5;

# Fast — HNSW search first, filter after
results = vector_search(query_embedding, limit=20)
filtered = [r for r in results if r.content_type == 'rule'][:5]

Caching

Embedding cache (LRU): SHA256 hash of text → embedding. 20-30% hit rate on repeated queries.
Game metadata cache: LRU with 128 entries. Avoids repeated database lookups.
Neo4j connection pooling: 50 max connections, 30s timeout, 3600s max lifetime.

Streaming

For long generations, stream responses with Server-Sent Events to reduce perceived latency:

python

async def stream_response():
    async for chunk in llm.stream(prompt):
        yield f"data: {json.dumps({'delta': chunk})}\n\n"

Ask Clay sends only deltas—new content per event, not accumulated text.

Multimodal Retrieval

Reparo shows how vision models integrate with RAG-like systems.

Structured Vision Output

Instead of free-form image descriptions, Reparo uses structured output:

python

class PlumberTicketAnalysis(BaseModel):
    system: str                                    # "Water Heater (Gas)"
    category: str                                  # One of 8 categories
    priority: Literal["emergency", "urgent", "routine"]
    confidence: Literal["high", "medium", "low"]
    visual_evidence: list[str]
    materials_identified: list[str]
    truck_loadout_primary: list[str]
    estimated_time: str

Structured output is queryable—you can filter and rank vision analysis results the same way you handle text retrieval.

Customer Verification

Vision findings require human verification before the system treats them as facts:

python

async def confirm_image_finding(finding_index: int, verified: bool, correction: str | None):
    if verified:
        record_as_confirmed_symptom(finding)
    elif correction:
        record_as_corrected_symptom(correction)

The system tracks whether a finding came from AI analysis or human confirmation—another application of source monitoring.

Choosing a RAG Architecture

When designing a RAG system, these questions determine most of the architecture:

1. Retrieval requirements

Situation	Approach
Static corpus, simple queries	Semantic search with HNSW
Temporal relevance matters	Add time-weighted retrieval
Entity relationships matter	Add knowledge graph
Keyword precision matters	Add BM25 hybrid search

2. How context changes

Situation	Approach
Stateless queries	Standard RAG
Conversation memory needed	Add session history
Long-term user preferences	Add persistent memory
Beliefs that update	Add confidence-weighted opinions with decay

3. Latency requirements

Budget	Options
<100ms	Pure vector search, aggressive caching
<500ms	Add reranking, light hybrid retrieval
<2s	Full hybrid retrieval, graph traversal
Async OK	Multi-stage pipelines

4. Security requirements

Situation	Approach
Single tenant	Application-level isolation
Multi-tenant	Database-level RLS
Regulated industry	Add audit trails, data lineage, deletion

Conclusion

The main takeaway from building these systems: RAG architecture is cognitive architecture. The patterns that work correspond to how human memory works:

Hybrid retrieval corresponds to the interplay between episodic and semantic memory
Memory consolidation corresponds to sleep-based learning
Source monitoring corresponds to how humans track where they learned something
Context window constraints correspond to working memory limits

Production RAG is not about building bigger vector indexes. It's about building systems that retrieve the right information, from the right sources, at the right time.

Technical

Short-term Memory Beyond the Context Window

6 min read

Strategy

AI Readiness Checklist: Is Your Company Prepared for Production AI?

8 min read

Building Production RAG Systems

The Three Systems

Retrieval Strategy: Hybrid Search

Why Pure Semantic Search Breaks

Hybrid Retrieval in Fidelius

When Simple Wins

Chunking

Chunk Size Trade-offs

Information Loss at Chunk Boundaries

Context Assembly

The Token Budget Problem

Context Ordering

Dynamic Context Injection

Memory Beyond RAG: The HINDSIGHT Architecture

Four Memory Networks

Memory Consolidation

Source Monitoring

Multi-Tenancy and Security

Row-Level Security

Knowledge Graph Scoping

The OpenClaw Lesson

Performance

Embedding Batch Processing

HNSW Index Optimization

Caching

Streaming

Multimodal Retrieval

Structured Vision Output

Customer Verification

Choosing a RAG Architecture

Conclusion

Related Articles

Short-term Memory Beyond the Context Window

AI Readiness Checklist: Is Your Company Prepared for Production AI?

Ready to Build AI That Remembers?