What I Learned From My First Production Incident
TD Bank, 2024. Three hours. Customer money frozen. I wrote the postmortem.
An anonymized retrospective on the first big incident I drove. The technical fix mattered less than the communication and rollback discipline.
Some details changed; the lessons are real.
The setup
Mid-2024. We deployed a Spring Boot service that handled a portion of customer transactions. The deploy passed canary. It passed staged rollout. At 100% rollout, transactions started silently failing.
For 2 minutes, customers' transactions were submitted, accepted, and quietly lost. The service returned 200 OK with an empty body. Downstream consumers logged the silent loss as "no data."
We caught it because a customer complained on Twitter.
The first 30 minutes
Most of the early time was confusion. Logs looked clean. Metrics looked clean. Error rate was zero (because we returned 200). The only signal was a downstream queue depth that had quietly dropped.
I asked: "When did this drop?" The answer: "About when you deployed."
We rolled back at minute 32. Queue depth recovered. Transactions resumed. Lost data was recovered from the upstream system within the SLA.
The technical cause
A serialization bug. A null check that wasn't there in the new code path. The endpoint was returning empty body because of an exception that was being swallowed by an aspect-oriented logging interceptor.
The unit tests passed because they didn't test the production happy path with the specific input shape the service got in production.
What I did right
- I called the rollback. The team had been debugging forward; rollback was the right move.
- I personally wrote a customer-facing email within 90 minutes of all-clear.
- I led the post-mortem. Clear timeline. No blame. Specific action items.
What I did wrong
- I didn't notice the queue depth signal myself. I noticed only when someone else flagged it.
- I let the deploy reach 100% before checking metrics that took 2 minutes to populate. The right cadence was: deploy 10%, wait for metrics, deploy more.
- I trusted the unit-test coverage. They were 90%+ coverage but the missing 10% was the real shape.
What changed after
- Mandatory queue-depth alarms (we had only error-rate alarms before)
- Manual stop-the-world point at 25% before going to 50% on big services
- Production-shape integration tests added to CI
- A real "is the deploy actually safe?" gate, not just "did it not return 500s?"
The lesson
Postmortems don't fix the technical bug. They surface the SYSTEM bug.
The technical bug was a missing null check. The system bug was that we trusted unit tests, deployed too aggressively, and relied on customer reports for first signal.
Fixing technical bugs is fast. Fixing systems is slow. Most teams fix the technical bug and ship more bugs of the same shape next month.
If you're an engineering leader, treat your post-mortems as the only place that gets you out of the perpetual incident cycle.