On-Call, Incident Culture, and Blameless Post-Mortems: The Operating Manual
On-call rotations, incident command, and blameless post-mortems are the load-bearing rituals of a reliability culture.
On this page▾
- Why this is a leadership problem, not a tooling one
- Designing an on-call rotation people can sustain
- The incident command model
- The New View of human error
- Running the post-mortem meeting
- The post-mortem document that earns its keep
- Metrics that detect burnout early
- Anti-patterns that quietly destroy reliability culture
- On-call is a workload, not a volunteer activity. Design it like one or it eats your senior engineers.
- Incidents need a single Incident Commander whose only job is coordination — not debugging.
- Blameless post-mortems assume engineers acted reasonably given what they knew. Sidney Dekker calls this the New View of human error.
- The artifact that matters is the timeline + contributing factors + action items with owners — not the apology.
- Track MTTR, incident rate per service, and pager load per engineer. Burnout shows up in the pager data months before it shows up in HR.
Reliability is not a tooling problem. It is a culture problem dressed up in dashboards. Teams with the same observability stack and the same SLO templates produce wildly different outcomes because of three rituals: how they run on-call, how they command incidents, and how they learn from them. This is the operator's manual for all three, grounded in two decades of Site Reliability Engineering practice and the human-factors research of Sidney Dekker, James Reason, and Erik Hollnagel.
Why this is a leadership problem, not a tooling one
Google's SRE book popularised error budgets and runbooks, but the harder lesson buried in the same book is that reliability is bounded by the cognitive and emotional health of the on-call engineer at 3am. Pager fatigue, blame culture, and ambiguous incident roles cause more outages than any single deploy. Leadership owns the rotation design, the post-mortem norms, and the staffing model. Tooling vendors do not.
Research by PagerDuty (State of Digital Operations) consistently finds that engineers paged outside business hours more than twice a week are 3–4x more likely to leave the company within 12 months. The attrition cost almost always exceeds the cost of fixing the rotation.
Designing an on-call rotation people can sustain
- 11. Shift lengthWeekly is standard but punishing for parents and Europeans on US-hour services. Consider split (weekday primary + weekend primary) or follow-the-sun for global teams. Avoid 24/7 single-engineer rotations smaller than 6 people — math says someone is always on-call.
- 22. Primary and secondaryAlways staff a secondary. Primary's job is to acknowledge within 5 minutes. Secondary's job is to take over if primary doesn't respond or needs sleep. Without a secondary, primary cannot rest during the shift.
- 33. Compensation modelThree honest options: paid on-call (per-shift stipend + per-page), comp time (hour-for-hour off after a paged night), or built-into-comp (band loaded to reflect on-call). Pick one and document it. Unpaid, unacknowledged on-call is the most common labour-relations failure in engineering.
- 44. Page budgetDefine a per-shift threshold (e.g., more than 2 actionable pages outside business hours = the rotation has failed). Treat exceeding the budget as a reliability incident in itself.
- 55. OnboardingNo one goes primary alone before shadowing two rotations and being shadowed for one. Document this. The shadow rotation is non-negotiable.
| Team size | Recommended pattern | Risks |
|---|---|---|
| 3–4 engineers | Don't run 24/7. Use a vendor MSP for off-hours or business-hours-only support. | Burnout is mathematical. One person off sick breaks the rotation. |
| 5–7 engineers | Weekly primary + weekly secondary; rotate weekends. | Tight, but workable if page budget is enforced. |
| 8–12 engineers | Weekly primary + secondary; separate weekday/weekend rotations. | Sustainable. Begin tracking pager load per engineer. |
| 12+ engineers | Follow-the-sun across two or three regions, 8–12 hour shifts. | Coordination overhead grows. Needs strong handoff ritual. |
The incident command model
Borrowed from FEMA's Incident Command System and adapted by Google SRE, the model assigns three roles during any Sev-2 or higher incident. The point is not titles — it is that one human owns coordination, another owns the technical work, and a third owns external communication. Without role separation, senior engineers do all three badly under pressure.
- 1Incident Commander (IC)Owns the timeline, decides when to escalate, decides when the incident is over. Does not debug. Speaks last in any disagreement. Often the most senior person, but seniority isn't required — clarity is.
- 2Operations Lead (Ops)Owns the technical investigation and mitigation. Talks to engineers, runs queries, ships rollbacks. Reports state changes to IC, not to the channel.
- 3Communications Lead (Comms)Owns the status page, customer emails, and internal updates. Writes in plain language. Updates on a fixed cadence (every 15 or 30 minutes) even when there's nothing new.
If you cannot answer 'Who is IC right now?' in 5 seconds, you do not have an incident response — you have a group chat. Name the IC explicitly in the first message: 'I am IC.'
The New View of human error
Sidney Dekker's Field Guide to Understanding 'Human Error' draws a hard line between two worldviews. The Old View treats human error as the cause: the engineer pushed the wrong button, fire the engineer, problem solved. The New View treats human error as a symptom of deeper systemic conditions: the engineer pushed the wrong button because the staging environment matched production, the rollback button was three clicks deep, and the runbook hadn't been updated since 2021. The button-pusher is the last person you should be examining.
“Human error is not a cause of failure. Human error is the effect, or symptom, of deeper trouble inside the system.”
- Asks 'who did this?'
- Looks for the broken human
- Ends with retraining or PIP
- Generates fear, hidden errors, slow disclosure
- Repeats the same incident in 6 months
- Asks 'how did this make sense to them at the time?'
- Looks for the conditions that made the error reasonable
- Ends with system changes
- Generates honesty, fast disclosure, deep analysis
- Eliminates a class of incident, not just this one
James Reason's Swiss Cheese Model complements this: incidents happen when holes in multiple defensive layers (monitoring, review, automation, runbooks, training) momentarily align. The engineer who took the final action is one slice. Asking why all the other slices also had holes is the real work.
Running the post-mortem meeting
- Schedule within 5 business days of resolution. Memory decays fast.
- Invite everyone involved in the incident, the service owner, and one stakeholder from a different team (fresh eyes catch system patterns).
- Do NOT invite the CEO or VP unless they were directly in the incident. Their presence collapses honesty.
- Start by reading the timeline aloud. No commentary yet.
- Identify contributing factors — plural. If you have one 'root cause', you stopped looking too early.
- For each contributing factor, ask: 'What signal would have caught this earlier? What made the bad path easier than the safe path?'
- Generate action items with named owners and dates. Vague owners = no follow-through.
- Close the meeting by explicitly thanking the on-call engineer. This is ritual. Skip it and the culture rots quietly.
The post-mortem document that earns its keep
- 1SummaryTwo paragraphs. What happened, who was affected, how long, and what the impact was. A non-technical executive should understand it.
- 2TimelineWall-clock entries from first signal to resolution, including the false starts. Times matter; speculation does not.
- 3ImpactQuantified: users affected, revenue lost, SLO budget consumed, customer escalations received. Vague impact = nobody funds the fix.
- 4Contributing factorsMultiple. Use the 'and' rule: write factors connected by 'and', not 'but'. 'The deploy was untested AND the staging environment didn't match production AND the rollback was manual.'
- 5What went wellAlways include this. Detection was fast? IC handoff was clean? Comms updated on cadence? Naming what worked reinforces it.
- 6Action itemsEach with owner, due date, and priority. Tracked in the normal backlog with a tag. If they are not tracked, they will not ship.
Public-facing post-mortems often start with an apology. That's fine externally. Internal post-mortems that lead with apology drift into emotional labour and skip the systemic analysis. Lead with the timeline.
Metrics that detect burnout early
| Metric | How to measure | Healthy | Warning |
|---|---|---|---|
| Pages per engineer per shift | Sum of pages divided by primary shifts | <2 actionable pages/shift | >5/shift or any out-of-hours spike |
| Off-hours page rate | % of pages between 22:00 and 07:00 local | <20% | >40% |
| MTTA (acknowledge) | Median time from page to ack | <5 minutes | >15 minutes (means people are sleeping through pages) |
| Rotation evenness | Std deviation of pages across engineers in a quarter | Low — load is even | One or two engineers absorbing most pages = future attrition |
| Post-mortem completion | % of Sev-2+ with PM published within 5 business days | >90% | <70% means learning is leaking |
Anti-patterns that quietly destroy reliability culture
- Naming the engineer who triggered the incident in the public post-mortem. Even with the kindest intent, it teaches everyone to hide errors.
- Closing post-mortems without action items, or with action items that have no owner.
- Treating Sev-3s as 'not worth a post-mortem.' The cheap incidents are where you learn cheaply.
- Running on-call as a 'volunteer' rotation — i.e., the same 3 engineers carry the whole org.
- Compensating on-call with 'we appreciate you' rather than money or time.
- Letting executives attend post-mortems as observers. The room stops being honest.
- Optimising MTTR by skipping the post-mortem step. You're trading short-term metric for long-term debt.
Where to read further
Read next
All playbooksYour team's calendars are the most honest org-health dataset in the company — and nobody reads them. A methodology for reading calendars like an x-ray…
Tech debt has a sibling no one names: org topology debt — the compounding coordination cost of teams whose boundaries no longer match the system they own.
The honest field manual for engineers stepping into leadership — first-time tech leads, engineering managers, CTOs, and founder-CEOs.