The Real Cost of Running Claude in Production
Token economics matter more when you're shipping.
I track every penny of Claude API spend across my projects. Here's the breakdown of what costs what, and where the savings are.
I run Claude in production for several clients. Here's what costs add up.
Cost components
For a typical RAG-style chat product:
- System prompt: 2-4K tokens, mostly cached
- Retrieved context: 5-30K tokens, partially cached
- User message: 50-300 tokens, never cached
- Response: 200-2K tokens, never cached
Without caching, every request pays for the full input. With caching, only the new bits.
The caching multiplier
Anthropic's prompt caching cuts cached-token cost by ~10x. The trick: keep the prefix of your prompt stable. A single timestamp in the cached prefix invalidates the cache.
For a chat product with stable system prompt + stable retrieval scaffolding + variable user messages, caching saves 50-80% on token cost.
My standard cost model
For a customer-support chat agent on Claude Sonnet 4.6:
- 8K-token system prompt (cached)
- 12K-token retrieved context (cached for the session)
- 200-token user message
- 600-token response
Per turn (with cache): ~$0.004. Per 1000 turns: $4.
Compared to a naively-implemented version (no caching): $0.024 per turn. 6x more expensive.
Where the surprises are
- Long context degrades cost. Filling 1M tokens is expensive even with caching. Use it only when it earns its keep.
- Tool use multiplies turns. Each tool call is a round-trip. An agent that uses 5 tools per task is 5x the turns of a single-shot generation.
- Streaming doesn't save cost. Same token count, different delivery.
What I tell clients
Budget for AI cost like you budget for AWS. Track it daily. Set alerts at 50%, 80%, 100% of budget. Make a person responsible for the bill.
Also: token cost will probably halve again in 12 months. Don't over-engineer for token efficiency at the cost of code quality. Sometimes "expensive" is the right answer for now.