Game Theory in Performance Calibration: Building Cheat-Resistant Metrics
Goodhart's Law guarantees that any single metric you measure will be gamed. The fix is multi-variable balanced models where maxing one metric only works if you hold the others. A playbook.
- Goodhart's Law: 'when a measure becomes a target, it ceases to be a good measure'.
- Engineers are exceptionally good at optimising for whatever you measure — including the wrong thing.
- Single-metric performance models always get gamed. Examples: LOC, commit count, story points, ticket close rate.
- Cheat-resistant models use 2–4 balanced metrics where gaming one degrades another (e.g. velocity × code-quality × peer-review depth).
- DORA (DevOps Research) showed balanced 4-metric models predict 22% higher delivery performance than single-metric tracking.
A team manager once told engineers their reviews would weight 'commit volume'. Within two weeks, the median commit size dropped 67% and everyone hit their target. Velocity was untouched. The metric was being maximised. The work was not.
Goodhart's Law in plain English
“When a measure becomes a target, it ceases to be a good measure.”
| Metric | Gaming strategy | Real-world result |
|---|---|---|
| Lines of code | Verbose code, no abstraction | Tech debt explosion |
| Commit count | Split every PR into 12 commits | Unreadable history, slow reviews |
| Story points | Inflate estimates | Velocity goes up, output flat |
| Bugs closed | Close own bugs as 'won't fix' | Customer-visible bugs rise |
| Time-to-merge | Skip review, force-push | Quality and safety collapse |
Designing a cheat-resistant model
Three rules from game theory:
- Pair every output metric with a quality metric. Velocity AND incident rate. Tickets closed AND customer-satisfaction.
- Use ratios, not absolutes. Code-review depth (comments per LOC reviewed) beats 'reviews completed'.
- Add a peer signal. Reciprocal review quality, manager 360s, citizenship — humans are harder to game than numbers.
- Deploy frequency only → 'tiny noisy deploys'
- Bug close rate only → 'won't fix'
- Velocity only → 'inflated estimates'
- Deploys × change-failure × MTTR — gaming any one breaks another
- Bugs closed × customer-reported regressions — closing wrong breaks the other
- Velocity × peer-review depth × escaped defects
Three battle-tested metric sets
| Use case | Metric bundle | Why it resists gaming |
|---|---|---|
| Engineering team performance | DORA 4: deploy freq, lead time, change-failure rate, MTTR | Speed up one without quality and another suffers |
| Individual IC performance | Shipped impact + code-review depth + citizenship + peer trust score | Gaming the visible metrics tanks the peer signal |
| Sales performance | Closed revenue + retention 12mo + customer health score | Sandbagging or over-promising breaks retention |
Forsgren, Humble & Kim's Accelerate research (2018, updated annually) showed the 4-metric DORA bundle is statistically valid as a balanced cheat-resistant model. Elite-performing organisations score in the top quartile on all four simultaneously — impossible to fake.
Takeaways
- Single metrics will always be gamed. Stop pretending otherwise.
- Bundles where gaming one degrades another are cheat-resistant.
- Peer signals are the hardest to game and the cheapest to collect.
- Goodhart's Law — Original Lecture — 1975 / Strathern 1997
- Forsgren, Humble & Kim — Accelerate — IT Revolution, 2018
- DORA State of DevOps 2024 — Google Cloud DORA
Read next
All playbooksStop talking about 'upskilling'. Start measuring exactly what percentage of each role belongs to the human and what belongs to the AI agent — and design the job around the split.
When the algorithm becomes your boss, employees stop pushing back in meetings and start pushing back in code. A digital-anthropology field guide to the 2026 workplace.
The engineers who review other teams' PRs, answer #help-eng questions, and keep the internal docs alive are invisible in standard performance reviews. InnerSourcing HR uses Git metadata to find and reward them.