Technical

    Building Production RAG Systems

    February 3, 202615 min read

    Retrieval-Augmented Generation is the standard architecture for AI applications that use private data. But prototypes and production systems have different problems. I built RAG systems in three domains—enterprise memory, consumer Q&A, and multimodal diagnostics—and the hard parts weren't the ones covered in tutorials.

    This article covers the architectural decisions, performance patterns, and mistakes from systems handling millions of queries. It also connects implementation choices to cognitive science research that explains why some approaches retrieve better than others.

    The same signal pattern that distinguishes a production-grade RAG system from a tutorial also distinguishes content that gets cited by AI search engines from content that doesn't. The 2024 Princeton GEO paper (Aggarwal et al., "GEO: Generative Engine Optimization") found that large language models preferentially surface content that combines named statistics, expert quotes, and outbound references to authoritative sources. This piece is structured that way on purpose: every claim below ties back to a specific system, with numbers.

    The Three Systems

    Fidelius (Enterprise Memory): A four-network memory system using the HINDSIGHT architecture. It combines pgvector, Neo4j knowledge graphs, and BM25 full-text search with Reciprocal Rank Fusion. Supports multi-tenant isolation, GraphRAG entity extraction, and memory consolidation modeled on cognitive science.

    Ask Clay (Consumer Q&A): A board game rules assistant. 90,000+ document chunks, sub-10ms vector search. Pure semantic search with HNSW indexing.

    Reparo (Diagnostic Dialog): A plumbing diagnostic system that runs over SMS. No vector search. Uses rule-based retrieval with dynamic context injection and a vision model.

    These three systems sit at different points on the complexity spectrum. Each one broke in different ways.

    The first design choice is retrieval method. Pure semantic similarity search—the approach in most tutorials—fails in predictable ways.

    Why Pure Semantic Search Breaks

    Standard RAG tutorials say: embed your documents, embed your query, find nearest neighbors. This works in demos. It breaks in production for three reasons:

    Temporal blindness. Vector similarity has no concept of recency. When a user asks "What did we decide about the API?", they want last week's meeting notes, not a similar discussion from 2019.

    Entity confusion. "Apple's quarterly earnings" and "apple pie recipe" can produce similar embeddings depending on the model and context. Named entity recognition matters.

    Associativity gaps. If Document A mentions "Project Atlas" and Document B mentions "our Q3 initiative" (which is Project Atlas), embedding search won't connect them unless both terms co-occur somewhere in the corpus.

    Bartlett's 1932 research on memory reconstruction is relevant here. Human memory doesn't work by passive retrieval—it reconstructs using mental frameworks. RAG systems that treat retrieval as similarity search miss the structural relationships between documents.

    Hybrid Retrieval in Fidelius

    Fidelius runs four parallel retrieval paths and combines them with Reciprocal Rank Fusion:

    Query → [Semantic Search (pgvector)]  →
          → [BM25 Full-Text Search]       → RRF Fusion → Reranking → Results
          → [Knowledge Graph Traversal]   →
          → [Temporal Filtering]          →

    Reciprocal Rank Fusion combines results without needing normalized scores:

    score(doc) = Σ 1/(k + rank_i(doc))    where k=60

    This is more robust than score averaging because retrieval methods produce scores on incompatible scales.

    The system auto-tunes graph weight based on query characteristics:

    • High entity density (>30% capitalized words) → favor graph traversal
    • Question words detected → moderate graph weight
    • Short queries (<5 words) → favor vector search
    • Queries ≥5 words → balanced approach

    Human memory uses different retrieval strategies for different kinds of questions—episodic recall for personal experiences, semantic memory for facts. The prefrontal cortex coordinates between them. This auto-tuning is a rough analog.

    When Simple Wins

    Ask Clay uses pure semantic search with HNSW indexing. No hybrid retrieval. This works because:

    Domain constraint. Board game rules are self-contained. There's no temporal relevance, no entity confusion across contexts, no relationship traversal needed.

    Pre-structured data. Each document chunk is a complete question-answer pair with metadata. The chunking problem is solved during data preparation, not at query time.

    Latency requirements. HNSW returns results in 3-10ms. Hybrid retrieval would add complexity without a proportional improvement in answer quality.

    The lesson: match retrieval complexity to domain complexity. Structured metadata filtering may be enough. You don't always need GraphRAG.

    Chunking

    Every RAG tutorial covers chunking—splitting documents into smaller pieces for embedding. The tutorials rarely cover the actual problems.

    Chunk Size Trade-offs

    Too small (100-200 tokens): context loss. A chunk containing "the defendant" without identifying the defendant is useless for retrieval.

    Too large (1000+ tokens): embedding quality degrades. The embedding averages too many concepts and retrieval precision drops.

    The production answer depends on domain and retrieval strategy.

    Fidelius uses 400-character chunks with zero overlap for general text, AST-aware chunking for code (respecting function boundaries), and clause-level chunking for legal documents.

    Ask Clay sidesteps the problem—each JSON entry is a complete, atomic chunk designed during data preparation.

    Information Loss at Chunk Boundaries

    The harder chunking problem is information loss at boundaries:

    Chunk 1: "...the project was approved by the committee."
    Chunk 2: "They allocated $2.4M for Phase 1..."

    "They" in Chunk 2 refers to "the committee" in Chunk 1, but vector search might only retrieve Chunk 2.

    Four approaches that work:

    1. Overlap with deduplication. Include 50-100 tokens of overlap between chunks, then deduplicate at retrieval time.
    2. Hierarchical chunking. Store both fine-grained chunks (for precision) and paragraph-level chunks (for context). Retrieve fine-grained, expand to paragraph.
    3. Chunk metadata enrichment. Store surrounding text in metadata: {"preceding_context": "...approved by the committee.", "following_context": "They allocated..."}.
    4. Parent document retrieval. Retrieve chunks, then fetch the parent document for full context.

    Baddeley's working memory model is a useful frame here. The phonological loop holds about 7 items, but the episodic buffer integrates information across sources. RAG context assembly needs a similar integration step—retrieved chunks must be assembled into coherent context, not concatenated.

    Context Assembly

    Retrieval gets more attention, but context assembly—how you arrange retrieved content for the LLM—determines whether the model can use what you found.

    The Token Budget Problem

    LLMs have finite context windows. Retrieved content competes with system prompts, conversation history, the user query, and output space.

    Fidelius enforces token budgets (100-16,000 tokens) with explicit allocation:

    python
    def assemble_context(retrieved_docs, conversation_history, system_prompt):
        budget = MAX_CONTEXT_TOKENS
        budget -= count_tokens(system_prompt)       # ~500 tokens
        budget -= count_tokens(user_query)          # ~50 tokens
        budget -= RESERVED_OUTPUT_TOKENS            # ~2000 tokens
    
        # Split remaining budget: 30% history, 70% retrieval
        history_budget = min(budget * 0.3, count_tokens(conversation_history))
        retrieval_budget = budget - history_budget
    
        return truncate_to_budget(retrieved_docs, retrieval_budget)

    Context Ordering

    LLMs weight information at the start and end of context more heavily than information in the middle (the "lost in the middle" effect).

    Put the most relevant chunk closest to the query (last in the context block). Keep recent conversation messages closer to the current query. Use clear section headers (## Relevant Context, ## Conversation History) to help the model navigate.

    Ask Clay structures its prompts with explicit sections:

    Game Information
    [title, category, player count, duration]
    
    Relevant Rules
    [Rule with similarity score]
    [Rule with similarity score]
    
    Recent Conversation
    [Last 5 messages]
    
    Instructions
    [Response formatting guidelines]

    Dynamic Context Injection

    Reparo changes the system prompt based on conversation state:

    python
    if not context.safety_checked:
        return SAFETY_CHECK_PROMPT
    elif context.is_emergency:
        return EMERGENCY_PROTOCOL_PROMPT
    elif not context.system_type:
        return SYSTEM_CLASSIFICATION_PROMPT
    else:
        return STANDARD_DIAGNOSTIC_PROMPT

    An emergency plumbing call needs different retrieved context than a routine diagnostic. The system selects prompts accordingly.

    Memory Beyond RAG: The HINDSIGHT Architecture

    Standard RAG is stateless—each query retrieves from the same corpus. Production systems need memory that changes over time.

    Four Memory Networks

    Fidelius implements the HINDSIGHT architecture with four memory networks, based on Tulving's distinction between episodic and semantic memory:

    World Memory (Semantic): Objective facts. "Python 3.12 was released in October 2023." No confidence evolution—facts are static until corrected.

    Experience Memory (Episodic): First-person action records with timestamps. "User sent quarterly report to Alice on March 15." Filterable by action type, searchable with time constraints.

    Opinion Memory (Beliefs): Confidence-weighted preferences that change with use. "User prefers concise emails" with confidence 0.7. Supports reinforcement and decay:

    python
    def reinforce(opinion_id, delta=0.1):
        new_confidence = min(1.0, current_confidence + delta)
    
    def decay_unreinforced(grace_period_days=30, decay_rate=0.01):
        for opinion in opinions_past_grace_period:
            opinion.confidence = max(0.0, opinion.confidence - decay_rate)

    Observation Memory (Patterns): Entity-centric summaries built from the other three networks. Graph-backed with causal relationships: CAUSES, CAUSED_BY, ENABLES, PREVENTS.

    This four-network split addresses a real limitation of standard RAG. Pure vector retrieval can't chain related ideas transitively or account for how a user's priorities change over time. A static vector index won't notice that a user stopped caring about Q3 metrics two months ago.

    Memory Consolidation

    The HINDSIGHT architecture runs periodic offline processing:

    python
    def sleep_cycle():
        # 1. Cluster recent experiences by theme (UMAP + HDBSCAN)
        # 2. Synthesize patterns from clusters
        # 3. Remove duplicates
        # 4. Reduce confidence in unverified opinions
        # 5. Prune outdated information

    This is modeled on human memory consolidation during sleep, where the hippocampus replays recent events and transfers them to long-term storage. The practical result: the memory system gets more focused over time instead of accumulating noise.

    Source Monitoring

    Every memory in Fidelius is tagged with:

    • Source (conversation, document, external API)
    • Timestamp (bi-temporal: valid_from, valid_to)
    • Confidence level
    • Creating application

    Johnson's Source Monitoring Framework describes how humans track where information came from. Without source tracking, a RAG system can blend training data with user context and produce plausible but ungrounded answers.

    Multi-Tenancy and Security

    Enterprise RAG systems serve multiple users and organizations. Data isolation is a requirement, not a feature.

    Row-Level Security

    Fidelius enforces tenant isolation at the database level:

    sql
    CREATE POLICY tenant_isolation ON documents
        USING (user_id = current_user_id() AND workspace_id = current_workspace_id());

    The system verifies RLS is enabled at startup and fails fast if it's disabled. Application-level bugs can't accidentally leak data across tenants.

    Knowledge Graph Scoping

    Neo4j entities are scoped with group IDs:

    python
    group_id = f"workspace_{workspace_id}" if workspace_id else f"user_{user_id}"
    # All MATCH clauses filter by group_id

    The OpenClaw Lesson

    OpenClaw gained 60,000+ GitHub stars before security researchers found hundreds of misconfigured instances leaking credentials. The cause was architectural: everything stored in local files with no encryption, no access control, and no isolation.

    Production RAG systems need:

    • Encrypted storage at rest
    • Per-user/organization data isolation
    • Audit trails for data access
    • Granular sharing permissions
    • Complete user data deletion

    Performance

    Embedding Batch Processing

    OpenAI recommends batch sizes of 100 for embeddings. Fidelius implements this with exponential backoff:

    python
    @retry(
        wait=wait_exponential(multiplier=1, min=4, max=60),
        stop=stop_after_attempt(5),
        retry=retry_if_exception_type((RateLimitError, APIConnectionError))
    )
    async def create_embeddings(texts: list[str], batch_size: int = 100):
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            embeddings = await openai.embeddings.create(input=batch)
            await asyncio.sleep(BATCH_DELAY)

    HNSW Index Optimization

    Ask Clay went from 3.84s to 6ms query latency (640x) by fixing one query pattern:

    The problem: WHERE clause filtering disabled HNSW index optimization.

    python
    # Slow — prevents HNSW optimization
    SELECT * FROM documents
    WHERE content_type = 'rule'
    ORDER BY embedding <-> query_embedding LIMIT 5;
    
    # Fast — HNSW search first, filter after
    results = vector_search(query_embedding, limit=20)
    filtered = [r for r in results if r.content_type == 'rule'][:5]

    Caching

    • Embedding cache (LRU): SHA256 hash of text → embedding. 20-30% hit rate on repeated queries.
    • Game metadata cache: LRU with 128 entries. Avoids repeated database lookups.
    • Neo4j connection pooling: 50 max connections, 30s timeout, 3600s max lifetime.

    Streaming

    For long generations, stream responses with Server-Sent Events to reduce perceived latency:

    python
    async def stream_response():
        async for chunk in llm.stream(prompt):
            yield f"data: {json.dumps({'delta': chunk})}\n\n"

    Ask Clay sends only deltas—new content per event, not accumulated text.

    Multimodal Retrieval

    Reparo shows how vision models integrate with RAG-like systems.

    Structured Vision Output

    Instead of free-form image descriptions, Reparo uses structured output:

    python
    class PlumberTicketAnalysis(BaseModel):
        system: str                                    # "Water Heater (Gas)"
        category: str                                  # One of 8 categories
        priority: Literal["emergency", "urgent", "routine"]
        confidence: Literal["high", "medium", "low"]
        visual_evidence: list[str]
        materials_identified: list[str]
        truck_loadout_primary: list[str]
        estimated_time: str

    Structured output is queryable—you can filter and rank vision analysis results the same way you handle text retrieval.

    Customer Verification

    Vision findings require human verification before the system treats them as facts:

    python
    async def confirm_image_finding(finding_index: int, verified: bool, correction: str | None):
        if verified:
            record_as_confirmed_symptom(finding)
        elif correction:
            record_as_corrected_symptom(correction)

    The system tracks whether a finding came from AI analysis or human confirmation—another application of source monitoring.

    Choosing a RAG Architecture

    When designing a RAG system, these questions determine most of the architecture:

    1. Retrieval requirements

    SituationApproach
    Static corpus, simple queriesSemantic search with HNSW
    Temporal relevance mattersAdd time-weighted retrieval
    Entity relationships matterAdd knowledge graph
    Keyword precision mattersAdd BM25 hybrid search

    2. How context changes

    SituationApproach
    Stateless queriesStandard RAG
    Conversation memory neededAdd session history
    Long-term user preferencesAdd persistent memory
    Beliefs that updateAdd confidence-weighted opinions with decay

    3. Latency requirements

    BudgetOptions
    <100msPure vector search, aggressive caching
    <500msAdd reranking, light hybrid retrieval
    <2sFull hybrid retrieval, graph traversal
    Async OKMulti-stage pipelines

    4. Security requirements

    SituationApproach
    Single tenantApplication-level isolation
    Multi-tenantDatabase-level RLS
    Regulated industryAdd audit trails, data lineage, deletion

    Conclusion

    The main takeaway from building these systems: RAG architecture is cognitive architecture. The patterns that work correspond to how human memory works:

    • Hybrid retrieval corresponds to the interplay between episodic and semantic memory
    • Memory consolidation corresponds to sleep-based learning
    • Source monitoring corresponds to how humans track where they learned something
    • Context window constraints correspond to working memory limits

    Production RAG is not about building bigger vector indexes. It's about building systems that retrieve the right information, from the right sources, at the right time.

    Related Articles

    Ready to Build AI That Remembers?

    Transform your AI products with memory architectures grounded in cognitive science.

    Schedule a Consultation