Playbook
AdvancedHRFounderCEO

Game Theory in Performance Calibration: Building Cheat-Resistant Metrics

Goodhart's Law guarantees that any single metric you measure will be gamed. The fix is multi-variable balanced models where maxing one metric only works if you hold the others. A playbook.

10 min read Updated 2026-05-21
60-Second Summary
  • Goodhart's Law: 'when a measure becomes a target, it ceases to be a good measure'.
  • Engineers are exceptionally good at optimising for whatever you measure — including the wrong thing.
  • Single-metric performance models always get gamed. Examples: LOC, commit count, story points, ticket close rate.
  • Cheat-resistant models use 2–4 balanced metrics where gaming one degrades another (e.g. velocity × code-quality × peer-review depth).
  • DORA (DevOps Research) showed balanced 4-metric models predict 22% higher delivery performance than single-metric tracking.

A team manager once told engineers their reviews would weight 'commit volume'. Within two weeks, the median commit size dropped 67% and everyone hit their target. Velocity was untouched. The metric was being maximised. The work was not.

Goodhart's Law in plain English

When a measure becomes a target, it ceases to be a good measure.
Charles Goodhart, 1975 — Marilyn Strathern's pithy reformulation, 1997
MetricGaming strategyReal-world result
Lines of codeVerbose code, no abstractionTech debt explosion
Commit countSplit every PR into 12 commitsUnreadable history, slow reviews
Story pointsInflate estimatesVelocity goes up, output flat
Bugs closedClose own bugs as 'won't fix'Customer-visible bugs rise
Time-to-mergeSkip review, force-pushQuality and safety collapse

Designing a cheat-resistant model

Three rules from game theory:

  1. Pair every output metric with a quality metric. Velocity AND incident rate. Tickets closed AND customer-satisfaction.
  2. Use ratios, not absolutes. Code-review depth (comments per LOC reviewed) beats 'reviews completed'.
  3. Add a peer signal. Reciprocal review quality, manager 360s, citizenship — humans are harder to game than numbers.
Single metric vs balanced
Single metric (gameable)
  • Deploy frequency only → 'tiny noisy deploys'
  • Bug close rate only → 'won't fix'
  • Velocity only → 'inflated estimates'
Balanced (cheat-resistant)
  • Deploys × change-failure × MTTR — gaming any one breaks another
  • Bugs closed × customer-reported regressions — closing wrong breaks the other
  • Velocity × peer-review depth × escaped defects

Three battle-tested metric sets

Use caseMetric bundleWhy it resists gaming
Engineering team performanceDORA 4: deploy freq, lead time, change-failure rate, MTTRSpeed up one without quality and another suffers
Individual IC performanceShipped impact + code-review depth + citizenship + peer trust scoreGaming the visible metrics tanks the peer signal
Sales performanceClosed revenue + retention 12mo + customer health scoreSandbagging or over-promising breaks retention
DORA's finding

Forsgren, Humble & Kim's Accelerate research (2018, updated annually) showed the 4-metric DORA bundle is statistically valid as a balanced cheat-resistant model. Elite-performing organisations score in the top quartile on all four simultaneously — impossible to fake.

Takeaways

  • Single metrics will always be gamed. Stop pretending otherwise.
  • Bundles where gaming one degrades another are cheat-resistant.
  • Peer signals are the hardest to game and the cheapest to collect.
References
Written by Pawan Joshi. Sources cited inline. Last updated 2026-05-21.