Shaun Abram
Technology and Leadership Blog
Book chapter summary: Managing Incidents
This is a slight abridged version of Chapter 14, “Managing Incidents, by Andrew Stribblehill from the excellent “SRE Book“. (Original is 2200 words, this is 1200)
Tags: sitereliabilityengineering, sre, summary, thesrebook
Beginning with SRE
This post is an introduction into some basic SRE practices we have been implementing at my company recently.
I’ve written before on SRE, including on SRE resources, SLIs, SLOs and SLAs, and Creating an SRE team, but this is a more practical guide to getting started.
Tags: servicelevelagreements, sitereliabilityengineering, sla, sli, slo, sre, thesrebook
SRE Resources
The following are a list of SRE resources I’m finding useful. I will update it as I find more. The good news is that most of the books (including all 3 of the Google SRE books) are available for free download at https://landing.google.com/sre/books.
Tags: seekingsrebook, sitereliabilityengineering, sre, srebooks, sreworkbook, thesrebook
SLI, SLO and SLA
What are SLIs, SLOs and SLAs?
Service Level Indicators (SLIs) are metrics that you choose to measure the health and performance of your services. Service Level Objectives (SLOs) are the desired target for those indicators. Service Level Agreements (SLAs) build on this and include the consequences of not meeting those targets. All are fundamental to Site Reliability Engineering.
In this post, I’ll try to explain each in more detail, how they relate to each other, and some examples of each.
Tags: seekingsrebook, servicelevelagreements, sitereliabilityengineering, sla, SLAs, sli, SLIs, slo, SLOs, sre, thesrebook
Book chapter summary: Postmortem Culture, from the SRE Book
I’m really enjoying reading the excellent “SRE Book“. Chapter 15 “Postmortem Culture: Learning from Failure” in particular, really struck a chord with me. The following is a slightly summarized version of it.
TLDR: Failures are inevitable, especially in distributed systems. To learn from them, document in Postmortems, avoiding blame, and share the newly gained learnings across your org.
Tags: blameless, postmortems, RCA, rootcauseanalysis, sitereliabilityengineering, sre, summary, thesrebook