All Insights
essays

AI Summer 2025: What Actually Progressed

Past the hype cycles, what's a working engineer to make of this year?

June 30, 20259 min read

Mid-year stocktake on AI capabilities, costs, tools, and architectures. The real progress isn't where the headlines are.

Mid-2025 stocktake. What's actually moved versus what's marketing noise.

Real progress

Coding capability. Claude 4 and GPT-4 successors are genuinely better at multi-file refactoring than they were last year. The gap between "AI assistant for code" and "AI agent for code" closed considerably.

Long context that works. Million-token contexts no longer degrade gracefully - they actually work. This changes RAG architecture (covered in another post).

Tool use reliability. Models now produce well-formed tool calls 95%+ of the time, even on edge cases. Production agents are viable.

Costs collapsed. Inference costs per token dropped 4-8x year-over-year for frontier capability. What was $1000/day in 2024 is now $150/day for the same workload.

Stalled or hype

AGI. Talking heads keep predicting it. Working engineers know we're nowhere near. We don't have agents that can run autonomously for a week without supervision. We don't have models that can debug their own code reliably.

Domain-specific small models. Promised every year. Still no compelling case for most teams over "use a frontier model with a good prompt."

Multi-agent systems. Lots of papers, few production deployments. The orchestration overhead doesn't pay off for most use cases yet.

What I'm watching for the rest of 2025

  • Whether reasoning models (o1-style, Claude with extended thinking) become cheap enough for routine use
  • Open-weights models catching up enough that running locally becomes viable
  • Whether "memory" finally gets a standard architecture rather than every team rolling their own
  • Cost per intelligence-unit - the right metric to track as the field matures

Practical advice

If you're shipping product:

  • Bet on frontier models + good prompts. Don't fine-tune unless you have a hard reason to.
  • Assume costs drop 2x in 12 months. Don't over-engineer for token efficiency yet.
  • Build provider-agnostic code. Models leapfrog each other.
  • Invest in eval harnesses. They survive every model change.

The era of "this will all be solved by GPT-5" thinking is over. The work has shifted to engineering: how do you compose capable-but-imperfect models into reliable products?

References

aiindustrytrends

Want to discuss this topic?

I'm always happy to dive deeper. Reach out if you have questions or want to collaborate.

Get in Touch

Command Palette

Search for a command to run...