Interview Scorecards: Writing Them Well
A scorecard turns 'I liked them' into 'they demonstrated X'. Here's how to write one that calibrates a whole loop, reduces bias, and survives legal scrutiny.
Structured interviews are roughly twice as predictive of job performance as unstructured ones (Schmidt & Hunter, 1998; updated in Sackett et al., 2022). The scorecard is the artifact that makes structure real. Without one, your hiring loop is a vibes-based committee — and the legal exposure is the cherry on top.
Why scorecards matter
- Force the hiring manager to define 'great' before meeting candidates
- Give every interviewer the same target, so feedback is comparable
- Reduce affinity bias by anchoring on evidence, not impression
- Create a defensible record under EEOC/Title VII and UK Equality Act scrutiny
- Make debriefs about evidence, not who spoke last or loudest
Structured interviewing has been validated across decades of meta-analyses. Google's hiring research (Project Oxygen / re:Work) and the US OPM's structured interview guide both anchor on the same idea: define competencies, ask consistent questions, rate against behavioral anchors.
Anatomy of a scorecard
- 1Role contextOne-line mission for the role, level, and the team it sits on.
- 2Outcomes (3–5)What success in 12 months looks like — measurable wherever possible.
- 3Must-have competenciesThe 4–6 competencies the loop will actually assess. Not 20.
- 4Nice-to-havesExplicitly separated so they don't sneak into rejection rationale.
- 5Anti-signalsBehaviors that should down-vote regardless of other strengths (e.g., punching down in a behavioral story).
- 6Loop mapWhich interviewer assesses which competency, with the question bank.
- 7Rating scale + anchorsA defined scale with behavioral examples per level.
Defining competencies
A competency is a cluster of observable behavior tied to job performance — not a personality trait. 'Smart' is not a competency. 'Decomposes ambiguous problems into testable hypotheses' is.
- Smart
- Cultural fit
- Good communicator
- Self-starter
- Passionate
- Decomposes ambiguous problems into testable hypotheses
- Adapts message to technical vs non-technical audiences
- Names disagreement with peers and resolves it without escalation
- Identifies missing work and starts it without being asked
- Asks clarifying questions before solving
It is vague, varies by interviewer, and correlates with affinity bias. Replace it with 'values alignment' tied to 2–3 specific written values, each with a behavioral anchor. EEOC guidance treats subjective criteria as a higher-risk hiring signal.
Rating scales that work
A 4-point scale is the sweet spot: it forces a directional call (no neutral middle), is granular enough to distinguish candidates, and is simple enough that interviewers actually use it consistently.
| Rating | Label | Meaning |
|---|---|---|
| 1 | Strong No Hire | Demonstrated the opposite of the competency, or showed an anti-signal. |
| 2 | No Hire | Did not demonstrate the competency at the bar for this level. |
| 3 | Hire | Demonstrated the competency at the bar with concrete evidence. |
| 4 | Strong Hire | Demonstrated the competency well above the bar, with depth. |
5-point scales reliably produce a clump at '3 — Mixed', which carries no decision. Force interviewers to choose a direction; the debrief is where nuance lives.
One scorecard, many interviewers
- Map each competency to exactly one or two interviewers — no full overlap, no gaps
- Pair each interviewer with 2–3 questions per competency from a shared bank
- Every interviewer must submit ratings + written evidence before seeing others' scores
- Block scorecard visibility until submission (Greenhouse, Ashby, and Lever all support this)
- Debrief is the synthesis — not a re-vote
Calibration and debrief
- 1Pre-readHiring manager reads all scorecards before the room opens. Notes patterns.
- 2Per-competency walkFor each competency: each interviewer states rating and 1–2 pieces of evidence. No 'feelings' before evidence.
- 3Disagreement protocolWhen ratings diverge by 2+ points, ask: did we hear different evidence, or interpret the same evidence differently?
- 4DecisionHiring manager makes the call. Recruiter records the rationale tied to the scorecard, not to vibes.
- 5Post-mortem after 6 monthsLook back at hires vs the bar set in the scorecard. Was the scorecard predictive?
Worked example
| Competency | Interviewer | Bar | Sample question |
|---|---|---|---|
| System design | Tech lead | Designs a service with explicit trade-offs on consistency vs availability | Design a rate limiter for a public API at 1M RPS. |
| Decomposition | Peer engineer | Breaks an ambiguous problem into testable hypotheses | Walk me through a time you debugged a production issue with no obvious cause. |
| Collaboration | Cross-team partner | Names disagreement, resolves without escalation | Tell me about a technical decision you lost. What happened next? |
| Ownership | Hiring manager | Identifies missing work, starts it without prompting | Describe something you fixed that wasn't your job. |
Common mistakes
- Scorecards written after the loop is designed (should be the other way around)
- Overlapping competencies across interviewers — wastes signal
- No behavioral anchors — every interviewer invents their own bar
- Allowing 'culture fit' as a competency
- Reading other interviewers' scores before submitting your own
- Debriefs that revisit 'gut feel' instead of staying on the scorecard
- Never auditing whether the scorecard predicted on-the-job performance
- Schmidt & Hunter (1998) + Sackett et al. (2022) — Selection validity meta-analyses — APA / Journal of Applied Psychology
- Google re:Work — Structured interviewing — re:Work
- US OPM — Structured Interviews Guide — OPM
- EEOC — Employer best practices — EEOC
- Greenhouse — Scorecards in practice — Greenhouse
- Ashby — Interview kits and scorecards — Ashby
Read next
All playbooksThe minimum operating system for fair, fast, and predictive hiring at any company size.
How to build a multi-channel pipeline that doesn't depend on one job board, how to write outreach that gets replies, and how to measure source quality honestly.
What an ATS actually does, the features that matter, how the big vendors compare, and the implementation pitfalls that cost six months.