All Insights
guides· 16 min read

Building an Incident Response Playbook

When things go wrong (and they will)

SV
Sri VardhanNovember 25, 2023
Share on Twitter
Share on LinkedIn
Copy link

How to prepare for incidents, respond effectively, and learn from failures. A practical guide to incident management for growing teams.

Every system you operate will fail. The only choice you have is whether you fail well. A good incident response practice does not stop incidents from happening; it stops them from becoming disasters. It also stops your team from burning out under the weight of repeated, poorly handled outages.

This guide is the playbook I help teams build, distilled from years of being on call and, in my experience, leading response for systems where downtime had real consequences.

What makes an incident different

A bug is something broken in the code. An incident is something broken in the system that is affecting users right now. The distinction matters because the response is different. Bugs go through your normal triage. Incidents go through a parallel process focused on stopping the bleeding first, understanding it second.

Define what counts as an incident in writing. Most teams I work with land on three or four severity levels:

  • SEV1: total outage or data loss for a meaningful share of users.
  • SEV2: significant degradation, partial outage, or a feature down.
  • SEV3: minor issues, single tenant problems, or a workaround exists.
  • SEV4: noisy alerts and near misses worth investigating but not paging anyone.

Without clear definitions, every alert feels like a five alarm fire and people stop responding to any of them.

Roles during an incident

The single most useful change you can make is to formalize roles. Without them, six people end up doing the same thing while no one talks to customers.

  • Incident Commander: runs the response, makes calls, owns the timeline. Not necessarily the most technical person; needs to be calm and decisive.
  • Tech Lead: drives the actual investigation and mitigation.
  • Communications Lead: keeps customers, support, and internal stakeholders informed.
  • Scribe: writes everything down in the incident channel as it happens.

For small teams one person may wear two hats, but the IC and Tech Lead should never be the same person in a SEV1. The cognitive load of running the room and debugging the system at once is too high.

The first thirty minutes

The first thirty minutes set the tone. A simple checklist helps:

  1. Acknowledge the page. Within five minutes. Even if the response is "I am here, looking now."
  2. Open an incident channel. A dedicated Slack or Teams channel for this incident only.
  3. Declare severity. Out loud, in the channel. It can be revised.
  4. Assign roles. IC, Tech Lead, Comms, Scribe.
  5. Mitigate before you understand. Roll back, fail over, scale up, disable the feature flag. Understanding can wait.
  6. Communicate externally if user impact is real. A status page update within fifteen minutes of confirmed impact is a reasonable target.

Mitigation before understanding is the rule that scares engineers most. It feels wrong. But the goal of an incident is to stop user pain, not to write a flawless RCA. There will be time to be curious afterwards.

During the response

Keep the channel disciplined. The Scribe writes a running log: timestamps, actions taken, observations, hypotheses. This becomes the spine of your post mortem and saves hours of reconstruction later.

Resist the urge to invite the whole company. A crowded incident channel is slower than a focused one. The IC controls who joins and what they do.

Status updates go out at a predictable cadence even if there is no progress. Silence breeds panic. "We are still investigating, next update in 30 minutes" is a complete and useful message.

After the incident

Resolution is the halfway point. The post mortem is where the learning happens, and where teams either get safer or quietly accumulate dread.

I follow a few rules:

  • Blameless. Always. The point is to fix systems, not to find the engineer who pushed the button.
  • Written within a week. Memory fades fast.
  • Focus on contributing factors, not single causes. Real incidents have layers.
  • Action items have owners and dates. Otherwise they become wishes.
  • Share widely. Other teams learn from your mistakes.

Google's SRE book is the canonical reference for this style of post mortem and worth reading even if you do not run at Google scale.

Building the muscle

Incident response is a skill, not a document. Run game days. Inject failure on purpose in non production environments. Rotate the IC role so juniors learn to lead under pressure with a senior shadowing them. Review your alerts every quarter and delete the ones that never fire on real problems.

If you want help shaping this practice for your team, I cover this kind of work under reliability engineering and platform engineering. You can also see how it plays into broader engineering leadership engagements, or just start a conversation.

References

Tagged

#operations#incidents#reliability
SV

Sri Vardhan

Independent technology studio of one. I help founders and small teams ship serious software without the consultancy overhead. More about me.

Want to discuss this topic?

I am always happy to dig deeper. If a piece sparked an idea or a disagreement, send it over. I read every message myself.

Get in Touch