Claude Opus 4.7 with 1M Context: What Actually Changes
I rebuilt my code-search pipeline against the new model. Here's what's different.
1M tokens isn't a marketing number - it changes how you architect retrieval. My RAG pipeline got 3x simpler and 2x more accurate. The cost math also shifted in interesting ways.
Anthropic shipped Claude Opus 4.7 with a 1M token context window. I spent a week rebuilding a real RAG pipeline against it, and the lessons surprised me.
The TL;DR
For codebases under ~150K lines, you can stop doing chunked retrieval entirely. Just dump the whole thing in. The accuracy gain over a tuned embedding-search RAG was 31% on my benchmark (a closed corpus of 2,400 questions on a real codebase).
For codebases above that, you still need retrieval - but you can be much sloppier about it. Top-50 retrieval feeding into a 1M context wins over carefully-tuned top-5.
The cost math
This is where it gets interesting. The naive read of "10x context = 10x cost" is wrong because of prompt caching. My typical workflow re-uses the same large context across many queries - caching cuts my marginal cost per query to about 1.7x what it was on Sonnet 3.5 with chunked RAG.
For a 24-hour SLA support agent that's mostly answering FAQ-style questions against the same documentation, 1M-context-with-cache is now cheaper than embedding search. That blew my mental model.
What broke
Three things tripped me up:
- Latency floor. First-token latency on a fully-loaded 1M context is 4-6 seconds. For interactive UIs (chat), that's painful. I keep a smaller "interactive" path with 200K context for synchronous use, and the full 1M is reserved for background analysis.
- Truncation behavior. When you're near the cap, models still generate, but quality degrades silently. I now actively monitor token utilization and alert when crossing 85%.
- Cache invalidation discipline. A single trailing whitespace difference invalidates the cache. I had a bug where my prompt template injected a timestamp into the cached prefix. Cost went 4x for a day before I caught it.
How I'd architect a new product against this
If I were starting today:
- Tier 1 (sync chat): retrieval with top-20 chunks, 200K context, prompt cache the system + retrieval scaffolding
- Tier 2 (async analysis): 1M context, full corpus, cache aggressively, queue and stream results
- Tier 3 (multi-turn agent loops): smaller context, but use extended thinking for the hard steps
The era of "one model, one context window" is gone. Plan your routing layer first.
What's next
I'm watching for the gap between "context is large enough for the whole codebase" and "context is large enough for the whole company". Once it's the latter, my entire approach to memory in agents has to change. I think we're 12-18 months away.