What Should We Measure?

Posted November 17, 2016 by Tim Ottinger

The Agile world is awash in metrics and measures, but most provide little benefit to teams.

What if we could change our set of measurements to support safer software development, continuous improvement, happiness of our project community, and careful craftsmanship?

In short, what are some Modern Agile metrics?

Why We Crave Measurements

Knowledge work is confusing, messy, and a tad unpredictable. It is mostly thinking, learning, experimenting, and retrying.

Programming is a kind of "lossy compression." The thinking that comprises the majority of the effort isn't directly visible in the end work. Only the final, working solution chosen by the developers is present in the code.

While measuring physical work is relatively easy, most of our work is thinking. This confounds attempts to measure degrees of completion. One programmer may have entered several dozen of lines of code, but that code may be only 20% finished and have many indirect consequences (which will manifest as errors) that the authors cannot see. Another feature may be implemented in a few well-chosen lines of code, so that having a fully-formed thought really was 90% of the effort.

The ephemeral nature of the work leaves us with some real problems:

  • How do we know if we are "on track?"
  • Are we "doing a good job?"
  • Are we looking at cost and time overruns?
  • Is this a good time to intervene, or should we wait?
  • Is this "special variation" or "normal variation?"
  • Are we keeping our promises to stakeholders?

Naturally enough, we turn to the collection and analysis of metrics to help us understand our processes and our progress.

caliper-1121746_1280

A Quick Warning

We could write thousands of blogs and stories on the improper and harmful use of metrics. Let's not do that here.

If we Make Safety A Prerequisite then we need to begin with safely handling metrics.

Measuring can, in itself, be hazardous.

The use of metrics is guided by three very important laws:

  • When a measure becomes a target, it ceases to be a good measure (Goodhart's Law)
  • Measures tend to be corrupted/gamed when used for target-setting (Campbell's Law).
  • Monitoring a metric may subtly influence people to maximize that measure (The Observer Effect).

A metric is just an indicator. Just as a high temperature reading on your dash indicates a problem with your engine or cooling system, a metric only lets you know that something might be wrong — forcing that number to change doesn't necessarily fix anything.

You can have too many metrics.

Some people succumb to planning stress when seeing the vast number of metrics which might be meaningful, and they spend an awful lot of time on gathering and analyzing metrics. The human costs of collecting so much data are so high that they skew the results, and may damage the organization's culture.

Relax. Your choice of metrics is not a once-for-all event.

  • You should not begin with all the metrics you might need. Start with one or two.
  • Drop a metric when it outlives its usefulness.
  • Don't accumulate a huge set of metrics, lest you frustrate your teams.
  • Consider giving each metric an expiration date on which to evaluate whether you will continue to use it.

Can the measurements we choose help to make people awesome?

At any level of the organization, metrics are best used to help people understand and improve their own work. They are less useful at judging whether other people are doing their jobs, especially when we are measuring people two levels or more above or below us in the organization (AKA: The Law Of The Second Floor).

Measurements can, when used properly, help teams to choose and enact small changes which accumulate in significant improvements over time.

Measuring output quantity tends to prevent improvement. We can feel good about writing dozens of lines of code and hundreds of tests and increasing test coverage by 20%, and forget that our work has not yet improved anyone's life. We can obsess over squeezing out a few percentage points more code, and neglect quality and user value in doing so.

We don't want to make people awesome at measuring, we want to measure how we are making people awesome.

With the warnings out of the way, let's progress to some measurements .

Slack Time:

Without slack time, a team will keep avoiding or working around the same old problems. They will not have a productive breakthrough.

Teams need to spend time focused on improving their process, tools, knowledge, and interactions.

Providing slack time supports our Modern Agile values of Make People Awesome and Learn and Experiment Rapidly.

According to Tom DeMarco:

“It’s possible to make an organization more efficient without making it better. That’s what happens when you drive out slack.”

Does the team invest time toward improving their work system?

There are at least two reasonable ways to collect information about a team's use of slack time:

  • Anonymously survey how much time was spent this week on learning and fixing skill/process problems?
  • Enumerate changes the team has decided to enact and the results of those changes.

For teams using an online tracking tool, it may be possible to create a separate horizontal swim lane for improvements and track them as normal tasks.

Support and permission are crucial in order for teams to to feel free to make changes, in line with our Modern Agile value of Make Safety A Prerequisite.

Without the safety of management expecting and enabling slack time, developers and testers will often feel guilty for time not spent directly on producing product.

Management must provide an "umbrella of permission" for their team.

Speed:

Speed measurements can be misused, which is why people are wary of velocity and story points.

Paradoxically, setting targets on development speed often results in slower delivery times due to harmful local optimizations.

Still, there are operations which can benefit from the use of a stopwatch. Measures of machine time, queueing, or waiting can be particularly useful.

  • How long does it take a new idea to go from a green light to positive user feedback?
  • How long does it take a build to complete?
  • How long does it take to recover from an unexpected delay or production "event?"
  • How long does it take to run the tests?
  • How long does it take (on average) to get an answer from a subject matter expert?
  • How much time does the team spend waiting on vendors?
  • How long are features finished and approved, waiting for a release?
  • How long does a branch live before being merged to the main code line?

Rather than measure everything (see above warning), it might be best to measure green-light-to-completion, and then also measure the wait times that seem to cause the team the most stress.

You will find that most testing and most CI tools will happily report their elapsed time for you, either by default or with a simple tweak to the settings.

Timing human interactions may require either a physical or online ticket system. It is not hard to collect those timings, but it requires additional effort. Teams without slack time tend to be too busy to measure their work.

Measuring elapsed time gives us a baseline on which to Experiment And Learn Rapidly.

Condition of Most-Edited Files

We need to know if the way we manage our code is effective. When we add new features and capabilities, do we improve the design, or do we pollute the design?

Are we perpetually degrading the quality
and readability of our code?

You probably don't need to examine any file whose last change was more than 6 months ago. Odds are that code isn't currently accruing new changes, and neither is it the site of a lot of bug fixes. The most important code is the code we visit most.

It is easy with most version control systems to find the most-edited files of the past few months. You can approximate this by counting commits or merges to the main line, or by summing lines-of-code changes.

You may choose to do a spot-check, and have developers compare the code from 6 months ago to the same files in their current shape. They can usually let you know if the code is more polluted or less polluted now. Is it pleasant to change? Does it make sense?

Such a qualitative review is useful. You may convert it to a numeric qualitative measure if you have each reviewer individually give a numeric rating, and average average those ratings. When you see the trend go from 3.7 to 4.2 you will know that either the code is getting better, or the reviews are becoming less strict.

Alternatively, you can run one of the many code quality tools to determine objectively if the code measures up better or worse now. One useful metric is the so-called Change Risks Anti-Patterns (C.R.A.P.) metric, which gives a ratio of complexity of the code to its test coverage. Untested, complex code has a high C.R.A.P. rating, whereas well-tested or simple code will have a very low C.R.A.P. rating.

The trend is the crucial thing.

Having a C.R.A.P. number of N is not all that informative. It is more important to know whether we are trending upward or downward — are we becoming better craftspeople, or are we trading away the quality of our code for expediency (which will eventually cause us to have to work much more slowly).

Keeping the code clean provides safety and makes our developers more awesome by enabling continuous delivery. Poorly managed code makes delivery hazardous and painful.

Note that there is a flaw in measuring only the most-recently changed files. Sometimes there is a file that needs to be improved, but it is such a horrible mess that the entire programming staff avoids touching it for fear of breaking it. Any "officially avoided" code is a great candidate for focused microtesting and refactoring work.

Escaped Defects

Escaped Defects are defects which have made it through the process without being noticed, and have been uncovered in the field. Defects found within our walls are less interesting because they don't inconvenience our users; they merely let us know that our internal systems could be better.

Escaped defects show us where we lack safety in our testing pipeline.

Escaped defects cost us time in customer support, in management, in reputation, and in tracking and prioritizing. This is why it's better to fix them all than to carry them for weeks or months.

The count and trend for escaped defects are both important.

Even more important is the learning that comes from root-cause analysis. Rapid ad-hoc fixing of bugs will keep the trend high, but reduce the count. Truly learning the root cause will change the way we produce software, making it safer to release code to customers.

Who is Happier?

Any internal metrics are going to tell us how we are doing the work, but the real question is how our work changes the world.

Is our work helping to Make People Awesome?

One measure is the Net Promoter Score, which answers the question "how likely are our users to recommend our product or service?" The net promoter score is more important than the number of defects or the time to release or the internal improvements made over the past few weeks.

Likewise, how does our work affect customer service and technical support? How easy is it to manage our software in operations? If we have a separate testing or certification group (for FDA or Security) then how easily can they do setup, test, and teardown?

How about our salespeople? What could we do that makes the system easier to sell, demo, or explain?

If you use chartering for your projects, you will have a roster of your project community. For all the roles identified in that community, how do they see your product? How is it trending? Is it better every release than the release before? Is it becoming ever more useful?

Whether you use the planguage to set measurable goals, or use user surveys and feedback mechanisms, you will be interested in how profitable or helpful people find the use of your software. Otherwise, why bother writing it?

Summing Up

We have discussed a few kinds of metrics that we can use to gauge the effectiveness of a team at pleasing the project community, improving their own processes, and managing their source code.

  • Slack Time
  • Speed of Processes
  • Condition of Most-Edited Files
  • Escaped Defects
  • Net Promoters

By watching the trends on some of these measures, we feel confident you will be able to identify and implement meaningful changes.

Please join us in the comments with any stories supporting or dissenting, so that we may learn and improve rapidly!

Special thanks to Bill Wake for his help pulling this list of metrics together, and to the crew at Industrial Logic and the Modern Agile slack group for helping with suggestions and revisions.