All Insights
essays

"RAG is Dead" - and Why That's Half True

Long context killed naive RAG. It didn't kill retrieval.

April 8, 20268 min read

Every six months someone declares RAG dead. They're wrong, but the way RAG is implemented in most production systems IS dead. Here's the difference.

Twitter loves a death announcement. "RAG is dead" trends every other month. Most of these takes are wrong. Some are right.

What's actually dead

The 2023-style RAG: chunk a document at 500 tokens, embed every chunk with text-embedding-ada-002, top-5 retrieval into GPT-3.5. That stack is no longer competitive.

It loses to:

  1. Just shoving the whole document into a 1M context window (for small corpora)
  2. Hybrid retrieval with re-ranking and 200K context (for large corpora)
  3. Agentic retrieval where the model issues its own queries (for complex/multi-hop)

If your production RAG pipeline still looks like a 2023 tutorial, it's worth a rebuild.

What's not dead

Retrieval itself. The fundamental insight that you should put relevant tokens in front of the model is more true than ever. What changes is how you decide what's "relevant" and how much you can afford to include.

I now think of retrieval as a budget allocation problem:

  • I have a context budget (1M tokens, or 200K, or 32K - depends on the model)
  • I have a latency budget
  • I have a cost budget (don't forget caching!)
  • Retrieval's job is to maximize relevance per token within those constraints

The 2023 RAG mindset was "retrieve the smallest relevant set." The 2026 mindset is "retrieve the largest relevant set you can afford."

Three patterns I use now

  1. Slop retrieval + long context. For most internal tools, dump 50-100 chunks into Claude Opus 4.7 and let the model do the relevance filtering itself. Cheaper than a re-ranker if you cache.
  2. Agentic search. For multi-hop questions, give the model a search tool and let it run 3-5 queries. This wins decisively on questions like "find the contract clause that contradicts the policy in section 4."
  3. Hierarchical summarization. For corpora bigger than any context window, summarize at multiple levels (paragraph → section → document → corpus) and search at each level. This is how you'd build a 50K-document legal-review system today.

The mistake everyone makes

People treat RAG as one architecture. It's not. RAG is a family of architectures with different trade-offs. The right one for a customer-support bot is different from the right one for a code-search assistant is different from the right one for a clinical-decision-support tool.

When I advise clients, I always ask: what's your worst-case failure mode? If a wrong answer is "annoying", you can be loose. If a wrong answer is "we get sued," you need citations and confidence scoring and human review.

The takeaway

Don't believe the death announcements. Read them as "the simple version of this technique no longer wins." That's almost always true in fast-moving fields. The general technique gets refined; it doesn't disappear.

References

airagllmarchitecture

Want to discuss this topic?

I'm always happy to dive deeper. Reach out if you have questions or want to collaborate.

Get in Touch

Command Palette

Search for a command to run...