RSS Feed Subscribe to RSS Feed

 

Measuring Developer Productivity

Most metrics for measuring developer productivity, such as lines of code or issues closed, are notoriously ineffective. But the research in the excellent State of Devops report shows that, rather than focusing on local metrics and individual developer performance, it is better to look at overall development and delivery practices. Specifically, there are metrics that predict and reflect a team’s ability to successfully deliver working software into production, including deployment frequency, and the mean time to restore service after an incident. This articles discusses why some metrics are useless, and takes a closer look at the recommendations in the 2019 State of Devops report.

 

The futility of measuring developer productivity

Can you measure developer productivity? And do you want to?

Up until recently, I would have said no, you can’t, and there are a lot of articles out there saying that you can’t and shouldn’t.

Martin Fowler wrote an (old but still worth reading) article on why we cannot measure the productivity of software development in general. “This is somewhere I think we have to admit to our ignorance.”

This “Do not measure developers” article discusses the futility of some of these metrics: “Metrics are subjective and informational. You can’t make any judgement on individual performances based on metrics.” While this “The myth of developer productivity” article states that “There still doesn’t exist a reliable, objective metric of developer productivity. I posit that this problem is unsolved, and will likely remain unsolved.”

Misleading measures

Here are some of the metrics we could track as a proxy for productivity, and some reasons why we shouldn’t.

  • Hours worked
  • Lines of Code
  • Number of (GitHub) Pull Requests
  • Number of (Jira) tickets raised
  • Number of (Jira) bugs closed
  • Code coverage

Hours worked

Any of these metrics can be easily shown to not be a good measure of productivity but hours worked may be the most obvious. Hours in the office != meaningful work done. It is all too easy to show up and do nothing useful, a notion sometimes call presenteeism (hbr.org) defined as “on the job but not fully functioning”.

Lines of code

Lines of code is notoriously useless a s a measure of productivity. As Fowler mentioned in his aforementioned Cannot Measure Productivity, “Any good developer knows that they can code the same stuff with huge variations in lines of code, furthermore code that’s well designed and factored will be shorter because it eliminates the duplication.”

Issues raised or closed

The number of (Jira) tickets raised and closed can all too easily be gamed. Obligatory Dilbert comic:

Even if not gamed, different engineers have different ideas on the necessary granularity of tickets.

Code coverage

Code coverage is one of the metrics that I do think has some merit, but even it is of limited use. A value of 0% speaks volumes (you have zero unit tests), but a high value may tell you very little. I once heard of a large software consultancy mandating a goal of 100% test coverage. The engineers promptly achieved this (they had to, after all) but did so by creating “tests” with no assertions. The target was achieved but with zero real world benefit. The number of PRs (or number of raw commits) is another metric that has some value, but ditto can be easily gamed and different engineers have different styles.

Common problems with measures

And that brings me to a law I find myself quoting a lot when talking about ways to measure Developer Productivity: Goodhart’s law

“When a measure becomes a target, it ceases to be a good measure”

As soon as you focus on a metric, any team can game the system. Intentionally, inadvertently, for fun, or simply because they think that is what you want them to do. You did ask them to focus on these metrics after all.

And of course focusing on any of these metrics means you may ignore some of the more important but intangible aspects of team performance. What if an engineer personally makes no contributions in any of the metrics listed above but spends the whole sprint helping other engineers be more productive? Or building long term benefits involving efficiency and infrastructure. Does that not count for anything?

To quote the State of Devops 2019 report, a common pitfall of software metrics is that they “pit different functions against each other and result in local optimizations at the cost of overall outcomes.”

“Productivity cannot be captured with a simple metric such as lines of code, story points, or bugs closed; doing so results in unintended consequences that sacrifice the overall goals of the team. For example, teams may refuse to help others because it would negatively impact their velocity, even if their help is important to achieve organizational goals.”

So, these common metrics make the mistake of focussing on individual & local measurements, rather than team or global ones. So, if these output measurements are not the way to measure performance, how do you?

 

 

State of Devops Report

Nicole Forsgren, Jez Humble, and Gene Kim wrote a great book called Accelerate: Building and Scaling High Performing Technology Organizations. It talks about the contributing factors in high-performing teams and organizations, using the data from the annual State of DevOps survey and reports. And If you are pushed for time, jumping straight to the latest 2019 report is a good place to start. (I previously posted an abridged version of the report here).

The basic question that the State of Devops report tries to answer is: what makes some organizations highly performant?

The report discusses “metrics that provide a high-level systems view of software delivery and performance and predict an organization’s ability to achieve its goals.” These metrics aren’t so much about measuring developer productivity, but instead can be used by an organization or team to “identify the specific capabilities they can use to improve their software delivery performance and ultimately become an elite performer”.

To paraphrase: Rather than focusing on local metrics that try to measure individual developer performance, look at the overall development and delivery practices and focus on metrics that predict and reflect a team’s ability to successfully deliver working software into production.

Five key metrics

So what are the metrics? The researchers first determined four key metrics differentiate between low, medium and high performers. The first two, lead time and deployment frequency, are related to Throughput. The other two, time to restore, and failure rate, are related to stability.

From State of Devops 2019 report

And one key point to note is that the research reports that speed and stability are possible together. You don’t need to sacrifice one for the other; You don’t need to slow down in order to be stable.

Deployment frequency

AKA release frequency. How often you release new software to production?

Side note: it is not clear to me whether this metric should include “bug fix” releases. My current thinking is that it should, mainly to simplify things. Determining if a release is strictly only a bug fix release gets tricky, and if you are constantly releasing release for bug fixes, that will get reflected in the change fail percentage.

Lead Time

Delivery lead time is defined as the time from commit to deploy. Or, to put it another way, the time from version control to production.

Time to restore service

AKA mean time to restore (MTTR). I have written about MTTR before here.

The report found that high performers had a mean time to recover from downtime that was 170 times faster. If you only fail once a year but that failure has a massive blast radius, and it takes you weeks to recover, your customers will notice.

Change fail percentage

What percentage of the changes you push to production result in any kind of degradation (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)? Low performers tend to be in the range of 46-60% of releases having problems. Elite performers are a much more attractive 0-15% range.

On a side note, although it is not discussed in the report, I suspect that there is a high link with how many changes you release in a single release here. If you are doing releases once per year, that release likely contains many changes: many lines of code updated across many modules. It would seem very likely that such a release is likely to have at least one problem that result in some kind of degradation. So you would veer towards all release having problems (even if most of the changes released were in fact good and healthy). Conversely, if you do frequent releases, each with a small number of changes, it would seem obvious that each individual release stands a much smaller chance of causing a degradation, simply because you are changing way less.

Availability

Finally, in the latter versions of the report (since 2018 I believe), an addition metric is identified as being important for operational performance: Availability.

The report defines availability as the “ability for technology teams and organizations to keep promises and assertions about the software they are operating. Notably, availability is about ensuring a product or service is available to and can be accessed by your end users. Availability reflects how well teams define their availability targets, track their current availability, and learn from any outages, making sure their feedback loops are complete.”

I have discussed availability metrics in a previous blog post on SLAs, SLOs and SLIs, but the definitive resource remains the SRE book.

Other metrics

Note that as well as identifying these 5 key metrics, the report also identify the capabilities that drive improvement in those metrics, including technical practices, cloud adoption, technical best practices (including change approval processes), disaster recovery testing, and culture (particularly around psychological safety).

For me personally, it was the section on disaster recovery testing that struck a chord most. In particular, the advice to test in production (I have previously written and talked about Testing in Production), the mention of Chaos Engineering, and emphasis on learning from any failures in production by carrying out a post mortem (aka Root Cause Analysis).

So, I would like to include two additional metrics worth measuring, namely:

Chaos Engineering

Chaos Engineering is defined as

  • “Thoughtful planned experiments designed to reveal the weaknesses in our systems” – Kolton Andrus, Gremlin CEO
  • “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” – Principles of Chaos Engineering

I have blogged about it before. While I would expect that…

  • “Elite” performers would be routinely and randomly terminating production virtual machine instances and containers, perhaps using tools like Chaos Monkey.
  • “Low” performers are likely not doing any Chaos Engineering at all, and likely little production testing, other than perhaps some basic smoke tests after a production deployment.

The interesting thing is that this isn’t exactly what the data shows:

I find it interesting that low performers are more likely to perform simulations that disrupt production. Rather, I would have expected elite performers to do more Chaos Engineering type testing. The only throty I cam think of it that perhaps low performers test in prod without proper rigor and defining suitable blast radiuses etc.

Still, the report making it clear that not performing tests ing production would be a concern

Production Failure Learning

Failure happens. This is a foregone conclusion when working with complex systems. A case of when, not if. (see my post on Blameless PostMortems post by John Allspaw, and Postmortem Culture, from the SRE Book).

From the 2019 State of Devops report, “blameless post-mortems contributes to a learning culture and an organizational culture that optimizes for software and organizational performance outcomes”.

While the State of Devops report does not provide detailed information on the criteria that might differentiate low from elite performers, I suspect that

  • “Elite” performers perform blameless post-mortems for every production outage and indeed for near misses too, creating remediation action items based on what the learnings. And, importantly, following up on those action items.
  • “Low” performers either don’t conduct post mortems, and hence don’t learn from their outages, or they conduct them in a blameful way that may punish engineers, reduce trust and increase the likelihod that future incidents will be covered up.

But like I said, this is just conjecture on my part!

One this the analysis in the report does show however is that “organizations that create and implement action items based on what they learn from disaster recovery exercises are 1.4 times more likely to be in the elite performing group.”

Conclusion

Why is all of this important? The State of Devops report states that high performing teams are twice as likely to to exceed their commercial goals. i.e. twice as likely to exceed profitability, productivity and market share goals. See this related video, that is well worth watching.

If you want to measure productivity, don’t try to measure individual developer productivity using demonstrably misleading metrics such as lines of code written or bugs closed. Such metrics can result in local improvements at the cost of better team and org outcomes. They can also be very easily gamed.

Instead, the report suggests focussing on overall development and delivery practices; on metrics that predict and reflect a team’s ability to successfully deliver working software into production. Those metrics are:

1. Deployment frequency

2. Lead Time

3. Time to restore service

4. Change fail percentage

5. Availability

In addition, there is also benefit to tracking

6. Chaos engineering efforts

7. Production Failure Learning

 

 

References and Further Reading

Books

  • Accelerate: Building and Scaling High Performing Technology Organizations

Reports

Talks/Videos

Blog posts

Tags: , , , ,

Leave a Reply