Game Theory in Performance Calibration: Building Cheat-Resistant Metrics

60-Second Summary

Goodhart's Law: 'when a measure becomes a target, it ceases to be a good measure'.
Engineers are exceptionally good at optimising for whatever you measure — including the wrong thing.
Single-metric performance models always get gamed. Examples: LOC, commit count, story points, ticket close rate.
Cheat-resistant models use 2–4 balanced metrics where gaming one degrades another (e.g. velocity × code-quality × peer-review depth).
DORA (DevOps Research) showed balanced 4-metric models predict 22% higher delivery performance than single-metric tracking.

A team manager once told engineers their reviews would weight 'commit volume'. Within two weeks, the median commit size dropped 67% and everyone hit their target. Velocity was untouched. The metric was being maximised. The work was not.

Goodhart's Law in plain English

“When a measure becomes a target, it ceases to be a good measure.”
— Charles Goodhart, 1975 — Marilyn Strathern's pithy reformulation, 1997

Metric	Gaming strategy	Real-world result
Lines of code	Verbose code, no abstraction	Tech debt explosion
Commit count	Split every PR into 12 commits	Unreadable history, slow reviews
Story points	Inflate estimates	Velocity goes up, output flat
Bugs closed	Close own bugs as 'won't fix'	Customer-visible bugs rise
Time-to-merge	Skip review, force-push	Quality and safety collapse

Designing a cheat-resistant model

Three rules from game theory:

Pair every output metric with a quality metric. Velocity AND incident rate. Tickets closed AND customer-satisfaction.
Use ratios, not absolutes. Code-review depth (comments per LOC reviewed) beats 'reviews completed'.
Add a peer signal. Reciprocal review quality, manager 360s, citizenship — humans are harder to game than numbers.

Single metric vs balanced

Single metric (gameable)

Deploy frequency only → 'tiny noisy deploys'
Bug close rate only → 'won't fix'
Velocity only → 'inflated estimates'

Balanced (cheat-resistant)

Deploys × change-failure × MTTR — gaming any one breaks another
Bugs closed × customer-reported regressions — closing wrong breaks the other
Velocity × peer-review depth × escaped defects

Three battle-tested metric sets

Use case	Metric bundle	Why it resists gaming
Engineering team performance	DORA 4: deploy freq, lead time, change-failure rate, MTTR	Speed up one without quality and another suffers
Individual IC performance	Shipped impact + code-review depth + citizenship + peer trust score	Gaming the visible metrics tanks the peer signal
Sales performance	Closed revenue + retention 12mo + customer health score	Sandbagging or over-promising breaks retention

DORA's finding

Forsgren, Humble & Kim's Accelerate research (2018, updated annually) showed the 4-metric DORA bundle is statistically valid as a balanced cheat-resistant model. Elite-performing organisations score in the top quartile on all four simultaneously — impossible to fake.

Takeaways

Single metrics will always be gamed. Stop pretending otherwise.
Bundles where gaming one degrades another are cheat-resistant.
Peer signals are the hardest to game and the cheapest to collect.

References

Goodhart's Law — Original Lecture — 1975 / Strathern 1997
Forsgren, Humble & Kim — Accelerate — IT Revolution, 2018
DORA State of DevOps 2024 — Google Cloud DORA