AI Systemsmoderate complexity

RAG Application Blueprint

Reference architecture for retrieval-augmented generation apps covering embedding pipelines, vector search, prompt orchestration, and evaluation.

Components

Considerations

Alternatives

moderate

Complexity

Fit

When this blueprint fits

And when to walk away from it

When to use this

You need an LLM to answer questions grounded in a corpus of documents (product docs, internal knowledge, customer data) and you cannot fine-tune for every customer. RAG is the right answer when the data changes often and the model has not seen it.

When NOT to use this

If your domain is narrow, stable, and small enough to fit in a system prompt, RAG is overkill. Skip the vector store and inline the relevant context in the prompt.

Architecture

System components

Key building blocks of this architecture, layered from infrastructure up.

Document Ingestion

Crawl, parse, and chunk source documents into embedding-ready text with metadata extraction. Most retrieval quality wins live here: a smart chunker that respects document structure beats any reranker. PDFs, HTML, Markdown, and custom formats each need their own parser.

Unstructured.ioPyPDFMarkdownBeautiful Soup

Embedding Pipeline

Generate embeddings with provider abstraction, batching, and incremental updates. Re-embedding the corpus is expensive, so design for incremental updates from day one. Track document versions and re-embed only what changed. See provider comparison.

OpenAI EmbeddingsCohereVoyageBGE

Vector Store

Store and query vectors with metadata filtering, namespacing, and approximate nearest neighbour search. pgvector is fine up to a few million vectors and removes an operational dependency. Pinecone, Qdrant, or Weaviate scale further with managed options.

pgvector PineconeQdrantWeaviate

Retrieval Layer

Hybrid search combining dense (semantic) and sparse (keyword) retrieval with reranking. Pure vector search misses exact-match queries (product codes, names), pure keyword search misses semantic intent. Combine both, then rerank the top 50 with a cross-encoder. See the RAG playbook.

BM25Cohere RerankMMRHybrid Search

Generation Layer

LLM call with retrieved context, citation handling, and streaming. Format the context clearly, instruct the model to cite sources, and verify citations exist in the retrieved set. Streaming hides latency, which matters when retrieval added 300ms to the request.

Anthropic ClaudeGPT-4Vercel AI SDKStreaming

Evaluation Harness

Offline evaluation of retrieval quality (precision, recall) and online evaluation of generation quality (groundedness, helpfulness, citation accuracy). Without evals you cannot tell whether yesterday's prompt change made things better. Golden sets, LLM judges, and human review each have a role.

RagasBraintrustLLM-as-JudgeGolden Sets

Feedback Loop

Capture user feedback (thumbs, follow-up questions, copy-and-paste signals) and surface it back into prompt tuning and retrieval improvements. The best RAG systems improve weekly because they have a tight feedback loop.

TelemetryAnnotation UIFeedback API

Planning

Critical considerations

The things I have learned the hard way and would not skip on the next build.

Chunking strategy is the single biggest lever for retrieval quality. Respect document structure (headings, paragraphs, tables), keep chunks small enough to be precise and large enough to be useful, and overlap chunks for boundary continuity.

Always cite sources. Hallucinations are expensive in legal and healthcare, and ungrounded answers undermine trust everywhere else.

Cache embeddings, retrieval results, and generation outputs at every layer. RAG cost compounds quickly without caching: a 30% cache hit rate is the difference between a profitable feature and a runaway bill.

Plan for re-embedding when you change models. The vector space is provider-specific. Track which embedding model produced each vector so you can migrate gracefully.

Start a project for a RAG build.

Options

Alternative approaches

Where I would consider a different shape entirely, with the trade-offs spelled out.

Alternative 01

Fine-tuning instead of RAG for stable, narrow domains where the data fits in training and rarely changes.

Alternative 02

LlamaIndex for opinionated RAG orchestration with batteries-included document loaders and retrievers.

Alternative 03

Managed services like Vectorize, Mendable, or Glean when you want a complete answer engine without building it.

Alternative 04

Long-context models (Claude 200k, Gemini 1M) for narrow use cases where the entire corpus fits in the prompt window.

Implementation

Related playbooks

Step-by-step guides for the harder parts of this architecture.

Building RAG Applications

Retrieval-augmented generation looks simple in a demo and stays simple until your knowledge base is bigger than a thousand documents, chunks overlap badly, or relevance scores stop making sense. This is my end-to-end RAG playbook: document processing, embedding pipelines, retrieval tuning, prompt design, and the evaluation harness that tells you whether changes are actually improving results.

Read playbook

Shipping AI Features Without the Hype Tax

Most AI features ship as a demo that survives one round of investor questions and then quietly dies in production. This is the discipline that gets AI features past that wall: small scope, real evals, careful rollouts, and instrumentation that catches drift early. The same loop I run when I add AI capabilities to an existing product, on a real timeline with real users.

Read playbook

In practice

Related case studies

Where I have applied this blueprint to real builds and what changed in practice.

AI Document Processing Platform

An AI-powered document processing system that transformed how a legal team handled contract review, due diligence, and compliance.

View case study

AI-Powered Enterprise Search

An AI-powered search platform that unifies search across dozens of enterprise systems with natural-language understanding and contextual results.

View case study

Thinking

Related insights

Essays where I argue the trade-offs behind the choices in this blueprint.

Building Production RAG Systems

RAG looks simple in demos but is notoriously hard in production. Here's a comprehensive guide to building RAG systems that actually work, based on real deployment experience.

Read essay

An LLM Evaluation Framework That Works

How to systematically evaluate LLM applications with a practical framework covering automated metrics, human evaluation, and continuous monitoring.

Read essay

Need a partner on this?

Need help implementing this blueprint?

I help teams adapt blueprints like this to their specific requirements and ship from planning through production.

Start a project Get in touch

Next up

Real-time Systems | complex complexity

Real-time Chat at Scale

Architecture for chat systems handling millions of concurrent users covering connection management, fanout, persistence, and moderation.

AI Systems