Metrics Mislead!

Posted May 14, 2015 by Bill Wake

Metrics mislead! In complex domains like software development, metrics can't capture everything important. Setting goals for metrics can make things worse.

A Real Case

I was once asked to review a team’s unit tests.

Senior management had pushed hard for 85%+ coverage. The team reached this goal by having half the team work for 10 weeks, backfilling unit tests.

Their typical test case (of hundreds) looked like this:

public void testSomeMethod() {
    try {
        SomeObject obj = new SomeObject(parameters);    // 1
        obj.someMethod(more-parameters);                // 2
        // No assertions made                           // 3
    } catch (Throwable t) {
        // Ignore any exceptions                        // 4

I had to tell management that these tests were worthless.

The major problem comes at markers //3 and //4. //3 is the watchdog that can’t bark: this test has no assertions about the results of calling the method in //1 and //2. Whatever the called method does, it is accepted by this test as the “right” behavior.

Well, not quite whatever; the called code could throw an exception. But if it does that, the catch clause at marker //4 ensures that the exception is caught and ignored.

This test still provides high coverage! Everything called in //1 and //2, plus whatever they call, will be touched.

That is the heart of the problem: When a metric isn't a direct measure, we can game it by focusing on the metric instead of the goal. [Tweet this]

When a metric isn't a direct measure, we can game it by focusing on the metric instead of the goal.

What about //1 and //2? Isn’t there still some value in calling all the system’s methods? There’s less value there than you might think. Rather than pass in parameters that made for realistic cases, the test authors passed in null or a simple object whenever that would generate sufficient coverage.

So, even if we add asserts and remove try-catch clauses, the examples aren’t representative. The system would only pass nulls around if something were broken, and we care more about realistic situations.

We were warned about problems like this: Goodhart’s Law from 1975 -

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

or more succinctly:

“When a measure becomes a target, it ceases to be a good measure.” [Tweet this]
“When a measure becomes a target,
it ceases to be a good measure.”—Goodhart’s Law

The Logic of Metrics

Why do we want metrics?

We’ll look at code coverage, but the argument is similar for most measures.

Suppose I claim to have a complete set of tests for your system, but when you run those tests only 5% of the lines of code in your system are ever executed.

Not plausible, right? (Unless you have a lot of dead code!) How could those supposedly complete tests find a defect in the other 95% of the code?

We’re making this argument:

If the tests have low coverage, then they’re an incomplete set of tests.

This is logically equivalent to:

If the tests are complete, then they have high coverage.

But this is not equivalent to:

If the tests have high coverage, then they’re a complete set of tests.

Many metrics share this pattern:

When it comes to metrics, good scores often lie; poor scores rarely do. [Tweet this]
When it comes to metrics, good scores often lie;
poor scores rarely do.

Gaming Metrics

Goals for metrics distort the truth

Goals for Metrics Distort the Truth

If you use metrics to pressure people, they will respond.

Some of that response may address the underlying causes of poor scores. However, doing things right takes time, especially if you also have to learn what right is.

If you keep increasing the pressure, people will still respond. When they’ve exhausted their ability to easily improve what you want, they often veer into doing things poorly in a way that “shows” well.

This is gaming the metric. (Rob Austin's book Measuring and Managing Performance in Organizations shows that even simple situations are vulnerable to this.)

My favorite example is in a Dilbert cartoon: the boss promises a bonus for bugs found and fixed, and Wally declares, “I’m gonna write me a new minivan...”

Gaming isn’t always intentional. When the team doesn’t have the (intellectual or physical) tools to effectively improve the underlying problems, they may evolve behavior patterns that improve the numbers without improving the code.

Where Does That Leave Us?

Metrics have value as a tool for understanding rather than as a tool for control.

When you think metrics will help:

  • Take a human perspective: look at the code, talk to the team, talk to customers. Numbers and graphs don't tell the whole story.
  • Help teams interpret metrics they see (rather than using metrics as a tool of control by imposing targets).
  • Look for metrics closer to the real value you want. For example, can you measure your customer’s benefits from your software (rather than code coverage)?
  • Don’t focus on a single metric; there is no single metric that can simultaneously tell you how hard your team is working, how good the code is, how valuable it is to customers, the code’s quality, and its maintainability. Yet you care about all those things and more.

When you use metrics, use them softly: don’t make them a primary focus. Recognize that setting goals for metrics can actually make things worse.

I'll leave the last words to Tom DeMarco: "At its best, the use of software metrics can inform and guide developers, and help organizations to improve. At its worst, it can do actual harm."

Further Reading