RSS Feed Subscribe to RSS Feed

 

Why to avoid Mean Time to Recover (MTTR)

The 2022 Void Report came out in late 2022, It is a recommended read, and I previously summarized it here. This article focuses on one aspect of the report: why mean time to recover (MTTR) is not an appropriate metric for complex software systems.

The takeaway are:

  • Do track time to recover (TTR) for each incident. It can be a useful exercise to think about when an incident started and stopped. That can help when calculating the cost of an incident.
  • Don’t report those times in aggregate, such as MTTR. Systems fail in non-uniform ways and averaging numbers to represent their reliability (or the performance of the supporting teams) is likely to be misleading.
  • Instead, use:
    • Post-incident learning reviews to learn (and share!) everything you can from an incident
    • SLOs to help align technical system metrics with business objectives
    • Consider sociotechnical incident data too

(more…)

Tags: , , , , , , ,

Book Summary: Accelerate

Accelerate: Building and Scaling High Performing Technology Organizations is a book by by Nicole Forsgren, Jez Humble and Gene Kim. It is a follow on from the State of DevOps Reports that Forsgren and Humble used to publish (and which I wrote about before in Development and delivery practices for team success). I highly recommend buying the book, but here are some chapter summaries for the highlights.

 

(more…)

Tags: , , , , , , ,

Post Production Debugging

Monitoring and Observing Your App Post Release

Pre-release tests are essential, but the ability to debug, monitor and observe your application suite post-release is what allows you to detect, and quickly fix, the production problems that will inevitably rise.

(more…)

Tags: , , , , , , , , , ,