RSS Feed Subscribe to RSS Feed

 

Book chapter summary: Postmortem Culture, from the SRE Book

I’m really enjoying reading the excellent “SRE Book“. Chapter 15 in particular, “Postmortem Culture: Learning from Failure”, really struck a chord with me. The following is a slightly summarized version of it.

TLDR: Failures are inevitable, especially in distributed systems. To learn from them, document in Postmortems, avoiding blame, and share the newly gained learnings across your org.

Incidents and outages are inevitable with large-scale, complex, distributed systems. Unless we have some formalized process of learning from these incidents in place, they may recur ad infinitum. Therefore, postmortems are an essential tool for SRE.

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. 

Creating a Postmortem

The primary goals of writing a postmortem are to ensure that

  • the incident is documented,
  • all contributing root cause(s) are well understood, and, especially,
  • that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.

Writing a postmortem is not punishment—it is a learning opportunity for the entire company. Postmortems do have a time and effort cost, so carefully chose which incidents will be covered. Triggers may include:

  • User impact
  • Data loss
  • On-call engineer intervention (release rollback, rerouting of traffic, etc.)
  • Slow resolution time
  • A monitoring failure (which usually implies manual incident discovery)

Define postmortem criteria in advance so everyone knows when a postmortem is necessary. In addition however, any stakeholder may request a postmortem for an event.

Blameless postmortems are important, and should focus on identifying the causes of the incident without indicting anyone for inappropriate behavior. Assume that everyone involved had good intentions and did the right thing with the information they had, otherwise people will not bring issues to light for fear of punishment.

A “mistake” is an opportunity to strengthen the system. Instead of allocating blame, investigate the systematic reasons why an individual or team had incomplete or incorrect information, so that prevention plans can be put in place. You can’t “fix” people, but you can fix systems.

A postmortem is not written as a formality to be forgotten. Instead it is an opportunity to fix a weakness. While a blameless postmortem doesn’t point fingers, it should call out where and how services can be improved. 

Best Practice: Avoid Blame and Keep It Constructive
Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization.

Collaborate and Share Knowledge

The postmortem workflow includes collaboration and knowledge-sharing at every stage.

Consider using a template, for example https://landing.google.com/sre/sre-book/chapters/postmortem/

Writing a postmortem also involves formal review and publication. Teams share the first postmortem draft internally and solicit feedback on completeness. Once the initial review is complete, the postmortem is shared more broadly. The goal is to share postmortems to the widest possible audience that would benefit from the knowledge or lessons imparted.

Best Practice: No Postmortem Left Unreviewed

An unreviewed postmortem might as well never have existed. Have regular review sessions for postmortems, then add it to a repository of past incidents. Transparent sharing makes it easier for others to find and learn from the postmortem.

Introducing a Postmortem Culture

Introducing a postmortem culture is difficult. While senior management should reinforce a collaborative postmortem culture through active participation in the review and collaboration process, blameless postmortems are ideally the product of engineer self-motivation. 

Consider

  • Showcasing a “Postmortem of the month” to share any interesting and well-written postmortems with the wider organization
  • Making sure that writing effective postmortems is a rewarded and celebrated practice

Best Practice: Visibly Reward People for Doing the Right Thing

For example, if an incident is handled well, averting a much longer and larger-scale outage. and documented well in a postmortem, publicly praise and/or reward the team members involved.

Conclusion and Ongoing Improvements

A continuous investment in cultivating a postmortem culture should result in fewer outages and foster a better user experience. 

Tags: , , , , , , ,

Leave a Reply