Runbook Hygiene: Why Yours Are Lying to You

#sre #devops #runbooks #operations

Your runbooks are out of date. I don't know your team, but I'd bet money on it. Most teams write runbooks once, in a panic after an outage, and then never touch them again until the next outage proves them wrong.

How runbooks rot

Steps reference deprecated tools. The grafana dashboard moved, the CLI command was renamed, the bastion host got retired. Nobody updated the runbook because nobody re-ran it during the calm months.

The team that owned the system left. Three of the five engineers who wrote it are gone. The remaining two haven't actually run the runbook in 18 months because the outage type it covers stopped happening.

Half-truths from the start. The original author skipped the obvious steps (because they were obvious to them) and the new on-call engineer can't reproduce the recovery.

The result: at 2 AM during the actual incident, the runbook sends you down a dead end. Now you're improvising under stress.

What working runbooks have in common

I've seen exactly two patterns work:

1. Runbooks live with the code. Put them in the repo of the system they document. When the code changes, the runbook PR is part of the same review. Out-of-repo wikis die first because the cognitive distance is too great.

2. Runbooks are reviewed by people who weren't there. Have a junior engineer run through the runbook quarterly on a non-incident day. Every place they get stuck is a real bug in the document. Fix it then. The author will be too close to see the gaps.

The three sections that matter

A useful runbook has exactly three sections:

Symptom: how do I know this is the right runbook? Concrete signals, with example screenshots if visual.
First 5 minutes: what to do RIGHT NOW to stop the bleed. Not the root cause investigation, just the triage actions.
Investigation: where to look, what queries to run, what to escalate.

That's it. Anything else (architecture diagrams, history, philosophy) goes in a separate doc that the runbook links to.

The cultural part

The hardest part isn't the format. It's getting engineers to write runbooks for systems they think "everyone already knows." Build it into your incident review: every post-mortem either updates an existing runbook or creates a new one. No runbook update, the post-mortem isn't done.

Good runbooks are a sign your team takes incidents seriously. Bad runbooks are a sign your team is winging it.