AI Agents in Production: What 18 Months Taught Me
Agent demos look magical. Production agents fail in mundane ways.
I've shipped four agent-based features in production since late 2024. Three of them are still running. The one that isn't taught me the most.
Production agents are nothing like the demos. I want to talk about the boring parts.
What an "agent" actually is in production
For the purposes of this post: a system where the LLM makes >1 tool call in a loop, optionally re-evaluating its plan. That covers everything from "search the docs and answer" to "open a PR with the fix."
Three things that worked
1. Bounded tool surface. My most successful agent has access to exactly four tools: search, read_file, write_diff, run_tests. That's it. Adding more tools didn't help; it confused the model.
2. Hard step limit. Every agent loop has a max-step counter. If it hits 20, we kill it and either escalate to human or return partial results. Without this, agents will loop in subtle ways forever.
3. Mandatory verification step. After the agent claims it's done, a separate (cheap) model verifies the claim against the original goal. About 8% of "done" claims fail this check. Catching those before they reach the user saved a lot of trust.
What killed my fourth agent
It was a customer-support triage agent. Read the ticket, classify, route, draft a response. Worked beautifully in eval, broke in production within a week.
The failure mode: someone wrote in with a complaint about a prior support interaction. The agent classified the complaint as a new ticket, drafted a response, and routed it. The customer felt unheard. We pulled the agent within 3 days.
The fix wasn't a better model - it was admitting the agent shouldn't operate on conversational context it didn't see. We rebuilt with a "is this part of an existing thread? If so, escalate" guardrail. Volume of agent-handled tickets dropped 40%. Customer satisfaction went up.
The lesson: agents don't have judgment about when not to act. You have to encode that judgment as guardrails.
The architecture I'd recommend
For most teams shipping their first production agent:
- Start with a single-step LLM call, not an agent. Many problems framed as "we need an agent" are actually "we need a smarter prompt with retrieval."
- Add tools one at a time. Measure win/loss after each addition.
- Treat the planner as a separate component - many bugs are planning bugs, not execution bugs. Some teams use a smaller fast model for execution and a bigger smarter model for the plan.
- Log everything. Tool calls, model responses, intermediate scratchpads. You'll need them.
Cost reality
A multi-step agent on Claude Opus 4.7 averages 8-12 steps per resolution. At full price, that's $0.30-$1.20 per task. Worth it for high-value tasks (engineering automation, deep research). Not worth it for casual chat.
The economics get much better with prompt caching - the system prompt and tool definitions are cached, only the new context gets full rate.
What I'm watching
- Multi-agent orchestration patterns (planner agent + worker agents + verifier agent) becoming more standardized
- Memory layers that persist across sessions - this is the gap between "useful agent" and "indispensable agent"
- Tool-use APIs standardizing across model providers, which would let us write agents that don't lock into one vendor
We're in the early-mainframe era of agents. Most things still don't work. The things that do are quietly reshaping product surfaces. Pay attention.