Calibration Sessions Run Well: The Hidden Operating Layer of Performance Management
Calibration is the meeting that decides whose ratings are real. Done well, it removes manager bias and produces defensible decisions on pay, promotion, and…
On this page▾
- Calibration exists to counter the strongest force in performance reviews — leniency bias.
- The facilitator's job is to surface evidence, not to enforce a distribution.
- Pre-work matters more than the meeting: managers arrive with evidence, not opinions.
- Avoid forced ranking. Distribution as a discussion tool, not a quota.
- Document the rationale for changes, not just the final ratings — auditability matters at exit.
Calibration is the least-visible and most-determinative ritual in modern performance management. It is where ratings become real, where promotion decisions firm up, and where bias either gets corrected or codified. Most performance systems collapse not because the reviews were bad but because the calibration was performative — a meeting where managers traded approvals rather than evaluated evidence.
What calibration is actually for
The primary purpose is to counter rater bias — particularly leniency bias, which is the single most consistent finding in performance-rating research. Left to themselves, managers rate their own reports systematically higher than peer-managers would. Calibration creates the cross-manager comparison that surfaces the inconsistency.
- 1ConsistencyTwo engineers with similar evidence get similar ratings across managers.
- 2DefensibilityEach rating decision has a documented rationale that would withstand external scrutiny (in disputes, exit packages, or audits).
- 3Talent intelligenceCross-team visibility into who the strong performers, hidden contributors, and emerging issues are — input into succession and growth planning.
The pre-work is the meeting
A calibration session lives or dies on what managers bring to it. Showing up with adjectives ('she's great') wastes everyone's time. Showing up with evidence ('she shipped X under Y conditions, mentored Z, and stretched into A') makes calibration possible.
- Managers complete draft ratings 7–10 days before the session.
- HR partner reviews drafts for completeness and obvious leniency outliers.
- Each report's case is summarised on a single page: evidence, level expectations met, growth areas.
- Facilitator surfaces the cases where calibration is most likely needed — outliers, edge cases, contested ratings.
- Distribution analysis is shared in advance, not revealed in the meeting.
Running the session
- 11. Frame the bar (5 min)Facilitator restates what each rating actually means with concrete examples. Anchors the room.
- 22. Walk the distribution (10 min)Show the proposed rating distribution without naming people. Surfaces obvious leniency without singling out managers.
- 33. Discuss edge cases (60–90% of time)Focus exclusively on the cases where calibration matters: edges between two ratings, outliers, anyone proposed for the top or bottom band.
- 44. Reconcile and decideFacilitator drives to a decision. Managers can disagree; the facilitator owns the call when consensus doesn't emerge.
- 55. Document and closeFor every changed rating, capture the rationale in writing. The artifact matters as much as the final number.
The forced-ranking debate
Stack ranking — forcing a fixed distribution (top 20%, middle 70%, bottom 10%) — was popularised by Jack Welch at GE and copied widely in the 1990s and 2000s. It was largely abandoned in the 2010s after research and high-profile failures (Microsoft most prominently) showed it eroded collaboration and produced gaming behaviour. The modern view: distribution as a sanity-check, not a quota.
- Mandates fixed % in each rating band
- Forces a 'bottom 10%' regardless of actual performance
- Punishes high-performing teams
- Erodes collaboration; rewards visibility over contribution
- Documented failures: Microsoft, GE in later years
- Suggests expected shape ('roughly bell-shaped overall')
- Triggers conversation when a team deviates dramatically
- Allows justified deviations with evidence
- Preserves collaboration while countering leniency
- Modern default at most mature companies
Common calibration failures
- Manager who advocates loudest wins — facilitator is too passive.
- Facilitator enforces a distribution mechanically — kills evidence-based discussion.
- Edge cases get rushed because the easy ratings consumed the time.
- Senior leader walks in late and overturns decisions without seeing the evidence.
- Documentation is skipped — six months later nobody remembers why a rating moved.
- Calibration outcomes leak — destroys candor for next cycle.
- Same managers, same biases, every cycle — rotating facilitator and pairing helps.
Frequently asked questions
Who should attend a calibration session?
All people-managers within a function or org level, plus an HR business partner as facilitator, plus the senior leader of that function. Cross-team representation matters more than seniority — a calibration with only one manager's reports represented is not calibration.
How long should a calibration session take?
Roughly 5–10 minutes per person being calibrated, after pre-work. A team of 40 reports across 6 managers typically needs a 3–4 hour session, split across two meetings if attention drops.
Should we tell employees their calibrated rating?
Tell them their final rating, yes. Tell them the process produced that rating — including that other managers reviewed it. Do not share the specific debate or how the rating moved during calibration; that breaks the candor of the session for next cycle.
Can calibration introduce its own bias?
Yes — particularly favouritism toward employees who are visible to multiple managers. Mitigations: structured pre-work, evidence-first discussion, and explicit attention to less-visible contributors. The bias calibration removes (leniency) is much larger than the bias it can introduce.
How does calibration interact with continuous performance management?
Continuous feedback provides the evidence stream that makes calibration possible. Without it, calibration becomes opinion-based. The two are complements, not alternatives.
Where to read further
Read next
All playbooksThe system around the review matters more than the review itself. A modern approach to goals, feedback, calibration, and the conversation.
The 2010s 'kill the annual review' movement promised continuous feedback would replace ratings. A decade of evidence shows the picture is more nuanced — most…
The 9-box performance-vs-potential grid is the most-used and most-misused tool in talent management. The discipline that makes it useful — and the rules that…