RSS Feed Subscribe to RSS Feed

 

Book chapter summary: Managing Incidents

This is a slight abridged version of Chapter 14, “Managing Incidents, by Andrew Stribblehill from the excellent “SRE Book“. (Original is 2200 words, this is 1200)

 

Intro

Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. Plan out your response to potential incidents in advance.

 

An Example of an Unmanaged Incident

Mary, the on-call engineer, gets paged. A service has stopped serving any traffic in 1, 2, then 3 of your five datacenters, causing overload and zero requests being served from anywhere.

Mary unsuccessfully reverts to the previous release. 
She calls the engineer who wrote most of the code for the failing service.
Colleagues start poking around from their own terminals.

Management demand answers.
The VPs call on their prior engineering experience and make irrelevant but hard-to-refute comments like, “Increase the page size!”

Unbeknown to Mary, the main engineer called a colleague who deploy a “fix”. Within seconds, the servers restart, pick up the change. And die again.

The Anatomy of an Unmanaged Incident

Everybody in the preceding scenario was doing their job, but a few common hazards caused this incident to spiral out of control.

Sharp Focus on the Technical Problem

The on-call engineer was hired their technical prowess and focussed on making operational changes to the system to solve the problem. She wasn’t in a position to think about the bigger picture of how to mitigate the problem.

Poor Communication

The on-call engineer was too busy to communicate clearly. Nobody knew what actions their coworkers were taking and other engineers weren’t used effectively.

Freelancing

One engineer was making changes to the system with the best of intentions but didn’t coordinate with coworkers and the changes made a bad situation worse.

Elements of Incident Management Process

Incident management skills and practices exist to channel the energies. Google’s incident management system is based on the Incident Command System (FEMA.gov).

A well-designed incident management process has the following features.

Separation of Responsibilities

Ensure that everyone knows their role and sticks to it. 

Several distinct roles should be delegated to particular individuals:

Incident Command

The incident commander holds the high-level state about the incident. They structure the incident response task force, assigning responsibilities and holds all positions that they have not delegated.

Operational Work

The Ops lead works with the incident commander by applying operational tools to the task at hand. The operations team should be the only group modifying the system during an incident.

Communication

The public face of the incident response task force who issues periodic updates to the incident response team and stakeholders, often via email and the incident document.

Planning

Supports Ops by dealing with longer-term issues, such as filing bugs, arranging handoffs, and tracking how the system has diverged from the norm so it can be reverted later.

A Recognized Command Post

Understand where everyone can interact with the incident commander. e.g., A dedicated Zoom call or Slack room.

Live Incident State Document

The incident commander must keep a living incident document which can be messy, but must be functional. Keep the most important information at the top. Retain this documentation for postmortem analysis.

Clear, Live Handoff

The post of incident commander needs to be handed off clearly and explicitly when necessary.

A Managed Incident

Now let’s examine how this incident might have played out if it were handled using principles of incident management.

Mary, the on-call engineer, gets paged. A service has stopped serving traffic.  This is a rapidly growing issue will benefit from the structure of an incident management framework. Mary ask Sabrina to take command, who agrees and gets a rundown of what’s occurred, capturing the details in an email to a prearranged mailing list.

When the third alert fires, Sabrina follows up to the email thread with an update, keeping VPs abreast of the high-level status without bogging them down in minutiae. Sabrina asks an external communications representative to start drafting user messaging. She then follows up with Mary to see if they should contact the developer on-call (currently Josephine). Receiving Mary’s approval, Sabrina loops in Josephine.

By the time Josephine logs in, another engineer has already volunteered to help out. Sabrina reminds both engineers to prioritize any tasks delegated to them by Mary, and to keep Mary informed of any actions they take. The engineers familiarize themselves with the current situation by reading the incident document.

By now, Mary has tried the old binary release. Robin updates IRC to say that this attempted fix didn’t work. Sabrina pastes this update into the live incident management document.

At 5 p.m., Sabrina starts finding replacement staff to take on the incident and updates the incident document. A brief phone conference takes place so everyone is aware of the current situation and responsibilities are handed off to colleagues in another office.

By the following morning the other office has mitigated the problem, started the postmortem. The team plans structural improvements so problems of this category don’t afflict the team again.

When to Declare an Incident

It is better to declare an incident early and then find a simple fix than to have to spin up the incident management framework hours into a burgeoning problem. If any of the following are true, it’s an incident:

  • Do you need to involve a second team in fixing the problem?
  • Is the outage visible to customers?
  • Is the issue unsolved even after an hour’s concentrated analysis?

Incident management proficiency can atrophy when not in constant use but the incident management framework can apply to other operational changes such as disaster-recovery testing. Role-playing the response can also help.

In Summary

By formulating an incident management strategy in advance and regularly putting it to use, we were able to reduce our mean time to recovery and provide staff a less stressful way to work on emergent problems. Any organization concerned with reliability would benefit from pursuing a similar strategy.

Best Practices for Incident Management

  • Prioritize: Stop the bleeding, restore service, and preserve the evidence for root-causing.
  • Prepare: Develop and document your incident management procedures in advance.
  • Trust: Give full autonomy within the assigned role to all incident participants.
  • Introspect: Watch for emotional states while responding to an incident. Solicit more support when needed.
  • Consider alternatives: Periodically consider your options and re-evaluate responses.
  • Practice: Use the process routinely so it becomes second nature.
  • Change it around: Take on a different roles. Encourage every team member to acquire familiarity with each role.

Tags: , , ,

Leave a Reply