RSS Feed Subscribe to RSS Feed

 

2022 VOID Report Summary

The following is a summary of the 2022 VOID report.

The original is ~10,000 words. This is ~1500.

Introduction

Adrian Cockcroft opens by talking about his work at Netflix: building a resilient system to work in the presence of failures, game day exercises, Chaos Engineering, and publishing incident reviews when things did go wrong. Netflix still had major incidents, which tended to be incidents never seen before; a series of unfortunate events conspiring together to find a weak spot in the system.

There were however very few relevant published incident reviews from others to help.

That is where the Verica Open Incident Database (VOID) comes in. It reports on incident analysis, using data from over 10,000 incident reports from almost 600 organizations. The goals are to raise awareness and increase understanding of software-based failures to make software & the internet a more resilient and safe place.

Software systems are complex sociotechnical systems. The pressures, mental models, and interactions within these systems only become more apparent when we seek to investigate and learn from their failures.

If you aren’t recording and publishing incidents because you want to look good, then you are more likely to have a much bigger failure.

 

Delving Deeper Into Duration

Incident duration data, aka Time to Recover and commonly calculated as the end time minus the start time, is the underlying source of MTTR (Mean Time to Resolve/Recover/Repair/Restore).

Duration data is:

  • Fuzzy on both ends (start/stop)
  • Sometimes automated, often not
  • Sometimes updated, sometimes not
  • A lagging indicator of what happened in the past in your system(s)
  • Inherently subjective

For example, for any given incident, did it end

  • When a temporary fix is made that alleviates the issue
  • When the problematic code is reverted
  • When a “permanent” change is made
  • When metrics start reflecting the issue is over
  • When a customer reports the issue is over

Different teams can use different criteria. The same issues apply to start time. 

The report also talks about an incident where an unrelated config change accidentally fixed a bug before it was detected. How do you measure that?

The distribution of duration data is skewed instead of normally distributed.

That is it looks like this:

Instead of the familiar bell curve shape.

This means that central tendency measures like the mean aren’t accurate representations of the underlying data.

(I found more about this in Skewed Distribution: Definition & Examples. Specifically, it seems that incident duration data is right-skewed, which can happen when the data cannot be less than zero but can have unusually large values.

“Income and wealth are classic examples of right skewed distributions. Most people earn a modest amount, but some millionaires and billionaires extend the right tail into very high values. Consequently, reports frequently refer to median incomes because the mean overestimates the most common values.”)

 

 

Moving on From MTTR

In part because of the distribution of duration data discussed above, and because “failures” in such systems don’t arrive uniformly over time, the report concludes that report believe that MTTR is not an appropriate metric for complex software systems. Each failure is inherently different.

The report references Štěpán Davidovič’s Incident Metrics in SRE: Critically Evaluating MTTR and Friends report. Davidovič Monte Carlo simulations on incident data, comparing the unaltered incident durations with a set of incidents that had their durations intentionally reduced by 10%. 

The VOID report sought to replicate Davidovič’s approach, using a different incident data set, but came to largely the same conclusion.

The large amount of variance in duration data renders MTTR useless, because changes were effectively impossible to detect.

 

Alternatives To MTTR

OK, so MTTR isn’t fit for purpose. What are (better) alternatives?

The reports argues that: “We never should have used a single averaged number to try to measure or represent the reliability of complex sociotechnical systems”.

It doesn’t matter what your MTTR indicates, you need to investigate your incidents to understand what is happening. You need Qualitative incident analysis…

Some alternatives to consider instead of MTTR…

SLOs and Customer Feedback

Service Level Objectives (SLOs) are commitments (or targets) that a service provider makes to ensure they are serving users adequately (and investing in reliability when needed). SLOs help align technical system metrics with business objectives, making them a more useful frame for “reliability.”

 

Sociotechnical Incident Data

Sociotechnical refers to systems that involve a complex interaction between humans, machines and the environmental aspects of the work system (src). 

Our systems are sociotechnical, comprising code, machines, and the humans who build and maintain them. Instead of focusing on technical data to assess how they are doing, consider sources of sociotechnical data such as:

  • The number of people involved hands-on in an incident
  • Which tools were used
  • The number of unique teams
  • The number of chat channels
  • Were there concurrent incidents
  • What was communicated about the incident outside of the team

Collecting this kind of information allows you to see how your organization actually responds to incidents. Collecting data about who was involved, cognitive load, and the tools and technical resources involved can provide a more accurate account of how your teams responds to incidents.

 

Post-Incident Review Data

Another way to assess the effectiveness of incident analysis in your organization is to track the degree of participation, sharing, and dissemination of post-incident review information. This can include the number of:

  • People reading write-ups
  • People voluntarily attending post-incident review meetings.
  • Links back to write-ups from other documents, code and diagrams

Near Misses

Another incipient practice is to prioritize learning from near misses, which can provide a useful understanding of gaps in knowledge, misaligned mental models, and other forms of organizational and technical “blind spots”. Information about near misses can help teams invest in changes to help avoid similar, and more serious, incidents in the future. 

The difficulty can be in deciding what constitutes a near miss.

Companies that can track, analyze, adapt to, and learn from near misses are studying both successes and failures. This provides a much more complete picture of how their systems function.

 

Duration and Severity Aren’t Related

“Severity levels are not objective measures of anything in practice, even if they’re assumed to be so in theory. They are negotiable constructs that provide an illusion of control or understanding, or footholds for people as they attempt to cope with complexity.” – John Allspaw

Severity is typically assigned to an incident on a descending numeric scale from 4 (“low” impact)  to 1 (most severe or “critical”). While a discrete scale like this lends clear, easy categorization, severity is also plagued by many of the same fuzzy issues as duration. Specifically:

  • Subjectively assigned and implemented consistently, even within one team
  • In some cases, a proxy for “customer impact”, or “engineering effort required to fix”, or “urgency”
  • Sometimes automated, often not
  • Sometimes updated over the course of an incident

The report goes on to investigate a potential correlation between Duration and Severity, but found no correlation between duration and severity. 

Companies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between. Not only can duration not tell a team how reliable or effective they are, but it also doesn’t convey anything useful about the event’s impact or the effort required to deal with the incident.

 

Root Cause

“Cause is not something you find. Cause is something you construct. How you construct it and from what evidence, depends on where you look, what you look for, who you talk to, what you have seen before, and likely on who you work for.” – Sidney Dekker

Language like “root cause” can hinder the ability to look beyond the so-called “single cause”. 

The concept of an RCA suggests that an incident has a single, specific cause or trigger, without which the incident wouldn’t have happened. However, complex software failures are never simple. They comprise numerous lurking latent failures, and the system is always operating in some form of degradation. An event like a failure results from a specific combination of those latent factors that combine to create an unexpected outcome.

Instead, consider using “Contributing Factors Analysis”.

The language we use matters: it shapes how we think about failures and incidents. The structure of how an event is described can influence how people perceive and recall those events.

What you learn from incidents can help you make the case for backlog/improvements that may not have gotten prioritization in the past.

 

Going Forward

The final conclusion of the report: Resilience saves time.

Taking the time to learn how to better respond to incidents – learning from the people, the processes, and the systems – will make the next incident smoother. And there will always be the next incident. Incidents contain multitudes. They reveal contradictions, assumptions, and systemic pressures intermingled with successes, sources of resilience, and adaptive capacity.

We continue to encourage organizations to go beyond checking a few boxes and saying “It won’t happen again”. Companies that don’t realize the benefit of supporting in-depth incident analysis will eventually fall behind their forward-thinking competitors.

Incidents are inevitable in any organization. The key to success is turning these incidents into learning opportunities. By studying what goes right along with what goes wrong, you can create a process that not only prevents future incidents but also allows your team to learn and grow from the experience. 

Tags: , ,

Leave a Reply