All blueprints
AI Systemsmoderate complexity

RAG Application Blueprint

Reference architecture for retrieval-augmented generation apps covering embedding pipelines, vector search, prompt orchestration, and evaluation.

7

Components

5

Considerations

4

Alternatives

moderate

Complexity

Fit

When this blueprint fits

And when to walk away from it

When to use this

You need an LLM to answer questions grounded in a corpus of documents (product docs, internal knowledge, customer data) and you cannot fine-tune for every customer. RAG is the right answer when the data changes often and the model has not seen it.

When NOT to use this

If your domain is narrow, stable, and small enough to fit in a system prompt, RAG is overkill. Skip the vector store and inline the relevant context in the prompt.

Architecture

System components

Key building blocks of this architecture, layered from infrastructure up.

01

Document Ingestion

Crawl, parse, and chunk source documents into embedding-ready text with metadata extraction. Most retrieval quality wins live here: a smart chunker that respects document structure beats any reranker. PDFs, HTML, Markdown, and custom formats each need their own parser.
Unstructured.ioPyPDFMarkdownBeautiful Soup
02

Embedding Pipeline

Generate embeddings with provider abstraction, batching, and incremental updates. Re-embedding the corpus is expensive, so design for incremental updates from day one. Track document versions and re-embed only what changed. See provider comparison.
OpenAI EmbeddingsCohereVoyageBGE
03

Vector Store

Store and query vectors with metadata filtering, namespacing, and approximate nearest neighbour search. pgvector is fine up to a few million vectors and removes an operational dependency. Pinecone, Qdrant, or Weaviate scale further with managed options.
pgvectorPineconeQdrantWeaviate
04

Retrieval Layer

Hybrid search combining dense (semantic) and sparse (keyword) retrieval with reranking. Pure vector search misses exact-match queries (product codes, names), pure keyword search misses semantic intent. Combine both, then rerank the top 50 with a cross-encoder. See the RAG playbook.
BM25Cohere RerankMMRHybrid Search
05

Generation Layer

LLM call with retrieved context, citation handling, and streaming. Format the context clearly, instruct the model to cite sources, and verify citations exist in the retrieved set. Streaming hides latency, which matters when retrieval added 300ms to the request.
Anthropic ClaudeGPT-4Vercel AI SDKStreaming
06

Evaluation Harness

Offline evaluation of retrieval quality (precision, recall) and online evaluation of generation quality (groundedness, helpfulness, citation accuracy). Without evals you cannot tell whether yesterday's prompt change made things better. Golden sets, LLM judges, and human review each have a role.
RagasBraintrustLLM-as-JudgeGolden Sets
07

Feedback Loop

Capture user feedback (thumbs, follow-up questions, copy-and-paste signals) and surface it back into prompt tuning and retrieval improvements. The best RAG systems improve weekly because they have a tight feedback loop.
TelemetryAnnotation UIFeedback API

Planning

Critical considerations

The things I have learned the hard way and would not skip on the next build.

Chunking strategy is the single biggest lever for retrieval quality. Respect document structure (headings, paragraphs, tables), keep chunks small enough to be precise and large enough to be useful, and overlap chunks for boundary continuity.
Always cite sources. Hallucinations are expensive in legal and healthcare, and ungrounded answers undermine trust everywhere else.
Cache embeddings, retrieval results, and generation outputs at every layer. RAG cost compounds quickly without caching: a 30% cache hit rate is the difference between a profitable feature and a runaway bill.
Plan for re-embedding when you change models. The vector space is provider-specific. Track which embedding model produced each vector so you can migrate gracefully.
Start a project for a RAG build.

Options

Alternative approaches

Where I would consider a different shape entirely, with the trade-offs spelled out.

Alternative 01
Fine-tuning instead of RAG for stable, narrow domains where the data fits in training and rarely changes.
Alternative 02
LlamaIndex for opinionated RAG orchestration with batteries-included document loaders and retrievers.
Alternative 03
Managed services like Vectorize, Mendable, or Glean when you want a complete answer engine without building it.
Alternative 04
Long-context models (Claude 200k, Gemini 1M) for narrow use cases where the entire corpus fits in the prompt window.
Need a partner on this?

Need help implementing this blueprint?

I help teams adapt blueprints like this to their specific requirements and ship from planning through production.