RSS Feed Subscribe to RSS Feed

 

Why to avoid Mean Time to Recover (MTTR)

The 2022 Void Report came out in late 2022, It is a recommended read, and I previously summarized it here. This article focuses on one aspect of the report: why mean time to recover (MTTR) is not an appropriate metric for complex software systems.

The takeaway are:

  • Do track time to recover (TTR) for each incident. It can be a useful exercise to think about when an incident started and stopped. That can help when calculating the cost of an incident.
  • Don’t report those times in aggregate, such as MTTR. Systems fail in non-uniform ways and averaging numbers to represent their reliability (or the performance of the supporting teams) is likely to be misleading.
  • Instead, use:
    • Post-incident learning reviews to learn (and share!) everything you can from an incident
    • SLOs to help align technical system metrics with business objectives
    • Consider sociotechnical incident data too

(more…)

Tags: , , , , , , ,

How much is your slow lead time costing you?

In a previous blog post, I discussed slow build times and estimated the associated costs. The build process is only one part of getting software out the door however.

Lead time is the time it takes to go from code committed to successfully running in production. This will include the build time we covered in the previous blog post, as well as all the other things required to get your code into users hands such as testing & deployments. This article focuses on the costs of that lead time.

Using the example of a team of 10 engineers, I estimate that the costs of a slow (one week) lead time could be the approximate equivalent of more than 3 engineers, or $400,000 per year. And I think it’s entirely possible that is on the low side since there are other costs that are just difficult to estimate. Imagine how much more you could achieve with 3+ extra engineers on the team.

Charity Majors goes further (discussed below) and suggests that reducing the lead time to hours could save the cost of 5 engineers on such a team. I was initially skeptical on that claim, but after trying out these estimates, she think may well be more accurate that my possibly over-conservative math.

Thanks

A big thank you to my former colleagues Dave Taubler, Abhijit Karpe, Josh Outwater and Steve Mauro for providing feedback and input on this article.

Most of the feedback took issue with some aspect of the estimates, which is fair, but the common theme seemed to be that everyone agreed that there is a very real cost to slow lead times, that it is high, and that using data where you can and estimates where needed is a good way to surface and highlight that cost.

 

(more…)

Tags: , , , , , , , , ,

Book Summary: Accelerate

Accelerate: Building and Scaling High Performing Technology Organizations is a book by by Nicole Forsgren, Jez Humble and Gene Kim. It is a follow on from the State of DevOps Reports that Forsgren and Humble used to publish (and which I wrote about before in Development and delivery practices for team success). I highly recommend buying the book, but here are some chapter summaries for the highlights.

 

(more…)

Tags: , , , , , , ,

Development and delivery practices for team success

Most metrics for measuring developer productivity, such as lines of code or issues closed, are notoriously ineffective. But the research in the excellent State of Devops report shows that, rather than focusing on local metrics and individual developer performance, it is better to look at overall development and delivery practices. Specifically, there are metrics that predict and reflect a team’s ability to successfully deliver working software into production, including deployment frequency, and the mean time to restore service after an incident. This articles discusses why some metrics are useless, and takes a closer look at the recommendations in the 2019 State of Devops report.

(more…)

Tags: , , , ,

Report Summary: Accelerate State of DevOps 2019

This is an abridged version of The Accelerate State of DevOps Report 2019; essentially a cut and paste of the most salient parts. The original is about 18,000 words; This is about 2,500 words.

I highly recommend reading the original in its entirety, if you have time, and I’m a big fan of the Accelerate book too. As with all the other summaries I create, this just as as way to help me digest and understand an excellent article.

(more…)

Tags: , , , , , ,