RAG looks simple in demos but is notoriously hard in production. Here's a comprehensive guide to building RAG systems that actually work, based on real deployment experience.
RAG looks easy in a notebook. You chunk some documents, embed them, drop them in a vector store, retrieve the top k, stuff them into a prompt, and the model says something useful. Demo over.
In production, every one of those steps is a place where the system quietly degrades. I have shipped this pattern multiple times and watched the same failure modes appear with new names each time. This essay is the version of the talk I give to teams before they start.
The honest baseline
Before you build anything elaborate, build the dumbest possible version end to end. Naive chunking, off the shelf embeddings, one vector store, top k retrieval, a single prompt. Get it running on real questions from real users, not the polished examples in your spec.
Then read the outputs. Not skim, read. You will learn more in two hours of this than in two weeks of architecture diagrams. From projects I have seen, almost every team underinvests in this step and ends up tuning components for problems they do not actually have.
Where production RAG breaks
There are five places things go wrong, in rough order of how often I see them.
Retrieval is the bottleneck, not generation
Most teams blame the model. The model is almost always fine. The retrieved context is the problem. If the right passage was not in the top k, no amount of prompt engineering will save you.
Concrete things that help:
- Hybrid retrieval. Pure vector search misses exact terms. Pure keyword search misses paraphrases. A combination, with a reranker on top, beats either alone in almost every domain I have worked in.
- Query rewriting. User questions are often underspecified. Rewriting them, sometimes into multiple variants, before retrieval changes the result quality more than people expect.
- Reranking. A small reranker model on the top 20 to 50 candidates is one of the highest leverage additions. Cohere and others publish off the shelf options that are good enough to start.
Chunking is a load bearing decision
The chunking strategy you pick at week one will haunt you at month six. Naive fixed-window chunking destroys semantic boundaries. Whole-document chunks blow up your token budget and dilute relevance.
What I usually do: chunk along structural boundaries (headings, paragraphs, code blocks), keep some overlap, and store both the chunk and a reference back to the parent document. At retrieval time you can decide whether to expand back into a wider context window. This costs more storage and is worth it.
Evaluation is missing or vibes-based
If you cannot measure your retrieval quality and your generation quality separately, you cannot debug them. I cover this in detail in the LLM Evaluation Framework post, but the short version: you need a curated test set of real questions, ground-truth relevant passages, and a small library of automated and human checks. Without it, every change is a guess.
Cost and latency get away from you
In a notebook, calling the API once with 8,000 tokens of context feels free. In production, multiply that by 10,000 daily users and a reranker call on top, and you have a real bill. Latency is the same story. Each step in your pipeline adds tens or hundreds of milliseconds, and users notice past about a second.
A few things that help: cache aggressively at the embedding layer, batch where you can, and treat your token budget as a hard constraint, not a default. The best RAG systems I have built had strict ceilings on context size and got better, not worse, when those ceilings were enforced.
Freshness and consistency
If your data changes, your index has to change with it. This sounds obvious and is the source of a surprising number of bugs. From projects I have seen, a stale index is the most common cause of "the model is wrong" reports that turn out not to be the model at all.
Build a clear pipeline from your source of truth to your index, with monitoring on staleness. If you have a million documents and only the recent ones matter, partition accordingly.
A reference architecture
For most teams, the architecture I would start with looks like:
- Ingestion pipeline. Source data goes through a structural chunker, gets embeddings via your model of choice, and lands in Postgres with pgvector or a managed vector store.
- Retrieval layer. A query goes through optional rewriting, runs in parallel against vector and keyword indexes, and the union goes to a reranker. Top n after reranking goes into the prompt.
- Generation layer. A small set of carefully designed prompts, with structured outputs where possible, calling a frontier model like Claude or GPT.
- Evaluation layer. Automated checks on every change, plus a human review queue for samples of production traffic.
- Observability. Latency, token usage, retrieval hit rate, and a way to replay any production query end to end.
Most of the leverage is in points 2, 4, and 5. Most teams over-invest in 1 and 3.
What not to do
- Do not start with fine tuning. RAG fixes most problems people think they need fine tuning for, and fine tuning is much harder to maintain.
- Do not introduce a vector database before you understand whether Postgres pgvector is sufficient. For most teams, it is.
- Do not write custom orchestration before you have a reason. The frameworks are not great, but rolling your own usually ends worse.
- Do not skip the evaluation harness. The teams that get burned in production are always the ones who did.
Where to go from here
The next post in this series is the evaluation framework, which is the half of the work people skip until they cannot. If you are starting a RAG project and want a sanity check on the architecture, /start-project is the door. I have shipped this pattern enough times to recognize the bad turns early.
References
Tagged
Sri Vardhan
Independent technology studio of one. I help founders and small teams ship serious software without the consultancy overhead. More about me.