What I Learned From My First Production Incident

An anonymized retrospective on the first big incident I drove. The technical fix mattered less than the communication and rollback discipline.

Some details changed; the lessons are real.

The setup

Mid-2024. We deployed a Spring Boot service that handled a portion of customer transactions. The deploy passed canary. It passed staged rollout. At 100% rollout, transactions started silently failing.

For 2 minutes, customers' transactions were submitted, accepted, and quietly lost. The service returned 200 OK with an empty body. Downstream consumers logged the silent loss as "no data."

We caught it because a customer complained on Twitter.

The first 30 minutes

Most of the early time was confusion. Logs looked clean. Metrics looked clean. Error rate was zero (because we returned 200). The only signal was a downstream queue depth that had quietly dropped.

I asked: "When did this drop?" The answer: "About when you deployed."

We rolled back at minute 32. Queue depth recovered. Transactions resumed. Lost data was recovered from the upstream system within the SLA.

The technical cause

A serialization bug. A null check that wasn't there in the new code path. The endpoint was returning empty body because of an exception that was being swallowed by an aspect-oriented logging interceptor.

The unit tests passed because they didn't test the production happy path with the specific input shape the service got in production.

What I did right

I called the rollback. The team had been debugging forward; rollback was the right move.
I personally wrote a customer-facing email within 90 minutes of all-clear.
I led the post-mortem. Clear timeline. No blame. Specific action items.

What I did wrong

I didn't notice the queue depth signal myself. I noticed only when someone else flagged it.
I let the deploy reach 100% before checking metrics that took 2 minutes to populate. The right cadence was: deploy 10%, wait for metrics, deploy more.
I trusted the unit-test coverage. They were 90%+ coverage but the missing 10% was the real shape.

What changed after

Mandatory queue-depth alarms (we had only error-rate alarms before)
Manual stop-the-world point at 25% before going to 50% on big services
Production-shape integration tests added to CI
A real "is the deploy actually safe?" gate, not just "did it not return 500s?"

The lesson

Postmortems don't fix the technical bug. They surface the SYSTEM bug.

The technical bug was a missing null check. The system bug was that we trusted unit tests, deployed too aggressively, and relied on customer reports for first signal.

Fixing technical bugs is fast. Fixing systems is slow. Most teams fix the technical bug and ship more bugs of the same shape next month.

If you're an engineering leader, treat your post-mortems as the only place that gets you out of the perpetual incident cycle.

References

Google SRE book

What I Learned From My First Production Incident

The setup

The first 30 minutes

The technical cause

What I did right

What I did wrong

What changed after

The lesson

References

Related Articles

Building in Public: An Honest Take

Taste in Engineering

AI Won't Replace Engineers (But It Will Change Us)

Want to discuss this topic?

What I Learned From My First Production Incident

The setup

The first 30 minutes

The technical cause

What I did right

What I did wrong

What changed after

The lesson

References

Related Articles

Building in Public: An Honest Take

Taste in Engineering

AI Won't Replace Engineers (But It Will Change Us)

Want to discuss this topic?

Command Palette