RSS Feed Subscribe to RSS Feed

 

Why to avoid Mean Time to Recover (MTTR)

The 2022 Void Report came out in late 2022, It is a recommended read, and I previously summarized it here. This article focuses on one aspect of the report: why mean time to recover (MTTR) is not an appropriate metric for complex software systems.

The takeaway are:

  • Do track time to recover (TTR) for each incident. It can be a useful exercise to think about when an incident started and stopped. That can help when calculating the cost of an incident.
  • Don’t report those times in aggregate, such as MTTR. Systems fail in non-uniform ways and averaging numbers to represent their reliability (or the performance of the supporting teams) is likely to be misleading.
  • Instead, use:
    • Post-incident learning reviews to learn (and share!) everything you can from an incident
    • SLOs to help align technical system metrics with business objectives
    • Consider sociotechnical incident data too

Why not to use MTTR – in plain English

Mean Time To Recovery (or restore/resolve/repair) is a measure of the time it takes to recover from a production outage or system failure. It is essentially an average of outage durations.

But duration data is inherently subjective, and can be:

  • Fuzzy on both ends (start/stop)

  • Sometimes automated, often not

  • Sometimes updated, sometimes not

For example, for any given incident, did it end

  • When the problematic code is reverted

  • When a temporary fix is made that alleviates the issue

  • When a “permanent” change is made

  • When metrics start reflecting the issue is over

  • When a customer reports the issue is over

etc.

But even if you can agree on a clear definition of start/stop, duration itself does necessarily convey how reliable or effective a team is at responding to the issues. For example, if you have 3 issues one month that were all relatively easy to fix, and 3 issues the next month that were inherently more complex and took longer to fix, is your team really getting worse? It is entirely possible they’re improving and even handled the complex issues better, but the MTTR doesn’t reflect that.

Time to recover does not convey anything useful about the event’s impact or the effort required to deal with the incident either.

Furthermore, failures in complex systems don’t arrive uniformly over time. Each failure is inherently different, and so it is inadvisable to use a single averaged number to try to represent the reliability of complex sociotechnical systems.

Why not to use MTTR – in Math

The VOID report references this Incident Metrics in SRE report from Google, which is a much tougher read but contains the math on why MTTx metrics aren’t useful. The author used Monte Carlo simulations against real incident data, comparing the unaltered incident durations with a set of incidents that had their durations intentionally reduced. The conclusion was that the large amount of variance in duration data renders MTTR useless, because changes were effectively impossible to detect.

The VOID report sought to replicate the research, using a different incident data set, and came to largely the same conclusion.

What to use instead of MTTR

It doesn’t matter what your MTTR indicates, you need to investigate your incidents to understand what is happening. The following are suggested alternatives.

Post-Incident Reviews

Post-Incident Reviews are a way to learn from incidents. Consider tracking the degree of participation, sharing, and dissemination of post-incident review information. Such as the number of people contributing to, reading, and linking to these reports.

What you learn from incidents can help you make the case for backlog/improvements that may not have gotten prioritization in the past.

SLOs

SLOs help align technical system metrics with business objectives, making them a more useful frame for “reliability.”

Sociotechnical Incident Data

Instead of focusing on technical data to assess how they are doing, consider sources of sociotechnical data such as:

  • The number of people involved hands-on in an incident

  • Which tools were used

  • The number of unique teams

  • The number of chat channels

  • Were there concurrent incidents

  • What was communicated about the incident outside of the team

Collecting this kind of information allows you to see how your organization actually responds to incidents.

And also…

Avoid the term Root Cause

The report recommends avoiding terms like Root Cause Analysis.

Language like “root cause” can hinder the ability to look beyond the so-called “single cause”. Complex software failures are never simple. They comprise numerous lurking latent failures, and the system is always operating in some form of degradation. An event like a failure results from a specific combination of those latent factors that combine to create an unexpected outcome.

Instead, consider using “Contributing Factors”.

Does Severity level have any meaning?

Severity is typically assigned to an incident on a descending numeric scale from 4 (“low” impact) to 1 (most severe or “critical”). While a discrete scale like this lends clear, easy categorization, severity is also plagued by many of the same fuzzy issues as duration, such as being subjectively assigned, implemented consistently, and sometimes updated but sometimes not.

From a SageSure perspective, I agree with this one less. If we can agree with some broad definitions of severity, I think there is some benefit to assigning a severity level. “It’s an S1” may be a coded message, but it gets people’s attention.

Takeaways

Should you track time to recover for each incident? Yes. It can be a useful exercise to think about when an incident really started and stopped. It can also be an essential ingredient for calculating the cost of an incident.

What is not recommended is reporting those times in aggregate, such as mean. We work with complex, sociotechnical systems that fail sporadically and in non-uniform ways. Averaged numbers to represent their reliability (or the performance of the supporting teams) is likely to be misleading.

Instead, conduct post-incident learning reviews to learn (and share!) everything you can from an incident. After all, you have already been forced to make an investment because of the outage, so you might as well learn from it.

Also use SLOs to help align technical system metrics with business objectives, and consider sociotechnical incident data, such as what individuals and teams were involved, what tools they used, and how communication worked.

 

Tags: , , , , , , ,

Leave a Reply