On-Call, Incident Culture, and Blameless Post-Mortems: The Operating Manual

60-Second Summary

On-call is a workload, not a volunteer activity. Design it like one or it eats your senior engineers.
Incidents need a single Incident Commander whose only job is coordination — not debugging.
Blameless post-mortems assume engineers acted reasonably given what they knew. Sidney Dekker calls this the New View of human error.
The artifact that matters is the timeline + contributing factors + action items with owners — not the apology.
Track MTTR, incident rate per service, and pager load per engineer. Burnout shows up in the pager data months before it shows up in HR.

Reliability is not a tooling problem. It is a culture problem dressed up in dashboards. Teams with the same observability stack and the same SLO templates produce wildly different outcomes because of three rituals: how they run on-call, how they command incidents, and how they learn from them. This is the operator's manual for all three, grounded in two decades of Site Reliability Engineering practice and the human-factors research of Sidney Dekker, James Reason, and Erik Hollnagel.

Why this is a leadership problem, not a tooling one

Google's SRE book popularised error budgets and runbooks, but the harder lesson buried in the same book is that reliability is bounded by the cognitive and emotional health of the on-call engineer at 3am. Pager fatigue, blame culture, and ambiguous incident roles cause more outages than any single deploy. Leadership owns the rotation design, the post-mortem norms, and the staffing model. Tooling vendors do not.

The hidden cost of bad on-call

Research by PagerDuty (State of Digital Operations) consistently finds that engineers paged outside business hours more than twice a week are 3–4x more likely to leave the company within 12 months. The attrition cost almost always exceeds the cost of fixing the rotation.

Designing an on-call rotation people can sustain

The five design choices every rotation makes (whether you decide them or not)

1
1. Shift length
Weekly is standard but punishing for parents and Europeans on US-hour services. Consider split (weekday primary + weekend primary) or follow-the-sun for global teams. Avoid 24/7 single-engineer rotations smaller than 6 people — math says someone is always on-call.
2
2. Primary and secondary
Always staff a secondary. Primary's job is to acknowledge within 5 minutes. Secondary's job is to take over if primary doesn't respond or needs sleep. Without a secondary, primary cannot rest during the shift.
3
3. Compensation model
Three honest options: paid on-call (per-shift stipend + per-page), comp time (hour-for-hour off after a paged night), or built-into-comp (band loaded to reflect on-call). Pick one and document it. Unpaid, unacknowledged on-call is the most common labour-relations failure in engineering.
4
4. Page budget
Define a per-shift threshold (e.g., more than 2 actionable pages outside business hours = the rotation has failed). Treat exceeding the budget as a reliability incident in itself.
5
5. Onboarding
No one goes primary alone before shadowing two rotations and being shadowed for one. Document this. The shadow rotation is non-negotiable.

Rotation patterns by team size

Team size	Recommended pattern	Risks
3–4 engineers	Don't run 24/7. Use a vendor MSP for off-hours or business-hours-only support.	Burnout is mathematical. One person off sick breaks the rotation.
5–7 engineers	Weekly primary + weekly secondary; rotate weekends.	Tight, but workable if page budget is enforced.
8–12 engineers	Weekly primary + secondary; separate weekday/weekend rotations.	Sustainable. Begin tracking pager load per engineer.
12+ engineers	Follow-the-sun across two or three regions, 8–12 hour shifts.	Coordination overhead grows. Needs strong handoff ritual.

The incident command model

Borrowed from FEMA's Incident Command System and adapted by Google SRE, the model assigns three roles during any Sev-2 or higher incident. The point is not titles — it is that one human owns coordination, another owns the technical work, and a third owns external communication. Without role separation, senior engineers do all three badly under pressure.

The three incident roles

1
Incident Commander (IC)
Owns the timeline, decides when to escalate, decides when the incident is over. Does not debug. Speaks last in any disagreement. Often the most senior person, but seniority isn't required — clarity is.
2
Operations Lead (Ops)
Owns the technical investigation and mitigation. Talks to engineers, runs queries, ships rollbacks. Reports state changes to IC, not to the channel.
3
Communications Lead (Comms)
Owns the status page, customer emails, and internal updates. Writes in plain language. Updates on a fixed cadence (every 15 or 30 minutes) even when there's nothing new.

The one-line incident rule

If you cannot answer 'Who is IC right now?' in 5 seconds, you do not have an incident response — you have a group chat. Name the IC explicitly in the first message: 'I am IC.'

The New View of human error

Sidney Dekker's Field Guide to Understanding 'Human Error' draws a hard line between two worldviews. The Old View treats human error as the cause: the engineer pushed the wrong button, fire the engineer, problem solved. The New View treats human error as a symptom of deeper systemic conditions: the engineer pushed the wrong button because the staging environment matched production, the rollback button was three clicks deep, and the runbook hadn't been updated since 2021. The button-pusher is the last person you should be examining.

“Human error is not a cause of failure. Human error is the effect, or symptom, of deeper trouble inside the system.”
— Sidney Dekker, The Field Guide to Understanding 'Human Error'

Old View vs. New View of incident analysis

Old View (blame culture)

Asks 'who did this?'
Looks for the broken human
Ends with retraining or PIP
Generates fear, hidden errors, slow disclosure
Repeats the same incident in 6 months

New View (learning culture)

Asks 'how did this make sense to them at the time?'
Looks for the conditions that made the error reasonable
Ends with system changes
Generates honesty, fast disclosure, deep analysis
Eliminates a class of incident, not just this one

James Reason's Swiss Cheese Model complements this: incidents happen when holes in multiple defensive layers (monitoring, review, automation, runbooks, training) momentarily align. The engineer who took the final action is one slice. Asking why all the other slices also had holes is the real work.

Running the post-mortem meeting

Schedule within 5 business days of resolution. Memory decays fast.
Invite everyone involved in the incident, the service owner, and one stakeholder from a different team (fresh eyes catch system patterns).
Do NOT invite the CEO or VP unless they were directly in the incident. Their presence collapses honesty.
Start by reading the timeline aloud. No commentary yet.
Identify contributing factors — plural. If you have one 'root cause', you stopped looking too early.
For each contributing factor, ask: 'What signal would have caught this earlier? What made the bad path easier than the safe path?'
Generate action items with named owners and dates. Vague owners = no follow-through.
Close the meeting by explicitly thanking the on-call engineer. This is ritual. Skip it and the culture rots quietly.

The post-mortem document that earns its keep

The six sections every post-mortem needs

1
Summary
Two paragraphs. What happened, who was affected, how long, and what the impact was. A non-technical executive should understand it.
2
Timeline
Wall-clock entries from first signal to resolution, including the false starts. Times matter; speculation does not.
3
Impact
Quantified: users affected, revenue lost, SLO budget consumed, customer escalations received. Vague impact = nobody funds the fix.
4
Contributing factors
Multiple. Use the 'and' rule: write factors connected by 'and', not 'but'. 'The deploy was untested AND the staging environment didn't match production AND the rollback was manual.'
5
What went well
Always include this. Detection was fast? IC handoff was clean? Comms updated on cadence? Naming what worked reinforces it.
6
Action items
Each with owner, due date, and priority. Tracked in the normal backlog with a tag. If they are not tracked, they will not ship.

The 'apology paragraph' trap

Public-facing post-mortems often start with an apology. That's fine externally. Internal post-mortems that lead with apology drift into emotional labour and skip the systemic analysis. Lead with the timeline.

Metrics that detect burnout early

On-call health metrics every engineering leader should review monthly

Metric	How to measure	Healthy	Warning
Pages per engineer per shift	Sum of pages divided by primary shifts	<2 actionable pages/shift	>5/shift or any out-of-hours spike
Off-hours page rate	% of pages between 22:00 and 07:00 local	<20%	>40%
MTTA (acknowledge)	Median time from page to ack	<5 minutes	>15 minutes (means people are sleeping through pages)
Rotation evenness	Std deviation of pages across engineers in a quarter	Low — load is even	One or two engineers absorbing most pages = future attrition
Post-mortem completion	% of Sev-2+ with PM published within 5 business days	>90%	<70% means learning is leaking

Anti-patterns that quietly destroy reliability culture

Naming the engineer who triggered the incident in the public post-mortem. Even with the kindest intent, it teaches everyone to hide errors.
Closing post-mortems without action items, or with action items that have no owner.
Treating Sev-3s as 'not worth a post-mortem.' The cheap incidents are where you learn cheaply.
Running on-call as a 'volunteer' rotation — i.e., the same 3 engineers carry the whole org.
Compensating on-call with 'we appreciate you' rather than money or time.
Letting executives attend post-mortems as observers. The room stops being honest.
Optimising MTTR by skipping the post-mortem step. You're trading short-term metric for long-term debt.

Where to read further

References

Google SRE Book — Postmortem Culture: Learning from Failure — Google
Sidney Dekker — The Field Guide to Understanding 'Human Error' — Dekker
PagerDuty Incident Response Documentation — PagerDuty
James Reason — Human Error (1990) — Cambridge UP

Written by Pawan Joshi.Sources cited inline.

First published 15 May 2026See site changelog →