AI Summer 2025: What Actually Progressed
Past the hype cycles, what's a working engineer to make of this year?
Mid-year stocktake on AI capabilities, costs, tools, and architectures. The real progress isn't where the headlines are.
Mid-2025 stocktake. What's actually moved versus what's marketing noise.
Real progress
Coding capability. Claude 4 and GPT-4 successors are genuinely better at multi-file refactoring than they were last year. The gap between "AI assistant for code" and "AI agent for code" closed considerably.
Long context that works. Million-token contexts no longer degrade gracefully - they actually work. This changes RAG architecture (covered in another post).
Tool use reliability. Models now produce well-formed tool calls 95%+ of the time, even on edge cases. Production agents are viable.
Costs collapsed. Inference costs per token dropped 4-8x year-over-year for frontier capability. What was $1000/day in 2024 is now $150/day for the same workload.
Stalled or hype
AGI. Talking heads keep predicting it. Working engineers know we're nowhere near. We don't have agents that can run autonomously for a week without supervision. We don't have models that can debug their own code reliably.
Domain-specific small models. Promised every year. Still no compelling case for most teams over "use a frontier model with a good prompt."
Multi-agent systems. Lots of papers, few production deployments. The orchestration overhead doesn't pay off for most use cases yet.
What I'm watching for the rest of 2025
- Whether reasoning models (o1-style, Claude with extended thinking) become cheap enough for routine use
- Open-weights models catching up enough that running locally becomes viable
- Whether "memory" finally gets a standard architecture rather than every team rolling their own
- Cost per intelligence-unit - the right metric to track as the field matures
Practical advice
If you're shipping product:
- Bet on frontier models + good prompts. Don't fine-tune unless you have a hard reason to.
- Assume costs drop 2x in 12 months. Don't over-engineer for token efficiency yet.
- Build provider-agnostic code. Models leapfrog each other.
- Invest in eval harnesses. They survive every model change.
The era of "this will all be solved by GPT-5" thinking is over. The work has shifted to engineering: how do you compose capable-but-imperfect models into reliable products?