Everyone has a RAG demo. Almost nobody has RAG in production. The gap between a slick prototype and a reliable enterprise system is where most generative AI initiatives go to die. Here's how to cross that gap.
Why RAG Demos Lie
A RAG prototype is deceptively easy to build. Take a vector database, embed some documents, wire up an LLM, and you've got a chatbot that can answer questions about your data. It takes an afternoon when working with LangChain or LlamaIndex. The demo looks magical.
Then reality hits. The system hallucinates on edge cases. It retrieves irrelevant chunks that poison the context window. It can't handle documents that were updated last week. Latency spikes to 8 seconds during peak hours. Users lose trust within a week.
The problem isn't RAG itself. It's that production RAG is an entirely different engineering discipline than demo RAG.
The Production RAG Stack
After deploying RAG systems for multiple enterprises, we've converged on an architecture with five critical layers:
1. Document Ingestion & Preprocessing
This is where most teams cut corners, and pay for it later. Raw documents (PDFs, Confluence pages, Slack threads, code repos) need to be parsed, cleaned, and chunked with precision.
Chunking strategy matters enormously. Fixed-size chunks (512 tokens) are simple but break semantic units. We use hierarchical chunking: document → section → paragraph, preserving headers and metadata at each level. This allows retrieval at the right granularity. Sometimes you need a paragraph, sometimes you need an entire section for context.
Every chunk gets enriched with metadata: source document, section title, last modified date, author, access permissions. This metadata is crucial for filtering, freshness ranking, and access control in production.
2. Embedding & Retrieval
Choosing the right embedding model is less important than choosing the right retrieval strategy. We've found that hybrid retrieval, combining dense vector search with sparse keyword search (BM25), consistently outperforms either approach alone.
The pattern: run vector similarity search and BM25 in parallel, then use Reciprocal Rank Fusion (RRF) to merge results. This catches cases where semantic similarity misses exact terminology (common in technical and legal domains) and where keyword search misses paraphrased concepts.
For the vector database, we recommend pgvector for teams already on PostgreSQL, Pinecone or Weaviate for managed infrastructure, and Qdrant for teams who want to self-host with maximum control.
3. Context Assembly & Reranking
Retrieval gives you candidate chunks. Context assembly turns them into a coherent prompt. This is the most underinvested layer in most RAG systems.
Reranking is non-negotiable. After initial retrieval, pass the top 20–30 candidates through a cross-encoder reranker (Cohere Rerank, or a fine-tuned model) to select the final 5–8 chunks. This dramatically improves relevance. We've seen 25–40% accuracy improvements from reranking alone.
Context assembly also means ordering chunks logically (chronological, hierarchical, or by relevance score), deduplicating overlapping content, and truncating to fit the context window while preserving the most important information.
4. Generation & Grounding
The LLM call itself needs careful engineering. Three techniques that dramatically reduce hallucination:
Citation enforcement: Instruct the model to cite specific chunks by ID for every claim. If a statement can't be tied to a retrieved chunk, flag it as potentially hallucinated. This turns the LLM from a creative writer into a grounded summarizer.
Confidence scoring: Ask the model to rate its confidence on a 1–5 scale for each response. Route low-confidence answers to human review rather than presenting them to users. This creates a natural quality gate.
Abstention: Explicitly instruct the model to say “I don't have enough information to answer this” rather than guessing. Users prefer honest uncertainty over confident hallucination.
5. Evaluation & Monitoring
You cannot improve what you don't measure. Production RAG needs continuous evaluation across three dimensions:
Retrieval quality: Are the right chunks being retrieved? Measure using NDCG, Mean Reciprocal Rank, and retrieval precision. Log every query + retrieved chunks for offline analysis.
Generation quality: Are answers faithful to the sources? Automated faithfulness scoring (using a separate LLM as judge) plus human evaluation on a random sample of responses.
End-to-end quality: Are users satisfied? Track thumbs up/down, follow-up question rates, and task completion rates. This is the metric that matters most.
Common Failure Modes
After dozens of RAG deployments, these are the failure modes we see most often:
- Stale data: Documents change but embeddings don't get re-indexed. Implement incremental re-indexing triggered by source system webhooks.
- Wrong granularity: Chunks are too small (missing context) or too large (diluting relevance). Use adaptive chunking based on document structure.
- Missing access control: Users retrieve documents they shouldn't have access to. Enforce permissions at retrieval time using metadata filters.
- Context window waste: Stuffing too many irrelevant chunks into the prompt. Quality over quantity. 5 excellent chunks beat 15 mediocre ones.
- No fallback: When RAG can't answer, the system should gracefully degrade. Route to search, suggest related topics, or connect to a human.
Our Production RAG Checklist
Before going live, every RAG system we deploy must pass these gates:
- Hybrid retrieval (vector + keyword) with reranking
- Incremental re-indexing pipeline with < 15 min freshness SLA
- Citation enforcement with source traceability
- Access control enforcement at retrieval layer
- Automated evaluation pipeline (retrieval + generation + e2e)
- Latency budget: < 3 seconds for 95th percentile
- Graceful degradation and abstention on low-confidence answers
- User feedback loop wired into evaluation data
The Bottom Line
RAG is the most practical path to enterprise generative AI, but only if you treat it as an engineering problem, not a science experiment. The difference between a chatbot that impresses in a demo and one that users trust every day is months of infrastructure, evaluation, and iteration.
Start with a narrow scope, invest heavily in retrieval quality, and measure everything. The magic isn't in the LLM. It's in the system around it.