RSS Feed Subscribe to RSS Feed

 

Beginning with SRE

This post is an introduction into some basic SRE practices we have been implementing at my company recently.

I’ve written before on SRE, including on SRE resources, SLIs, SLOs and SLAs, and Creating an SRE team, but this is a more practical guide to getting started.

(more…)

Tags: , , , , , ,

SRE Metrics

A very quick post on some of the most commonly used SRE metrics: The Four Golden Metrics, and RED & USE.

(more…)

Tags: , , ,

Creating an SRE team

If you wanted to build an SRE team at your company, how would you go about it? How would you structure it?

(more…)

Tags: , , ,

SRE Resources

The following are a list of SRE resources I’m finding useful. I will update it as I find more. The good news is that most of the books (including all 3 of the Google SRE books) are available for free download at https://landing.google.com/sre/books.

(more…)

Tags: , , , , ,

eBook Summary: What Is SRE?

What Is SRE? An Introduction to Site Reliability Engineering” (registration required but free), is an ebook by Kurt Andersen & Craig Sebenik, published by O’Reilly. The following is a summary (abridged copy and paste) of the parts I found most useful, with a few of my own notes. The original is about 9,000 words; this is about 2,000.

 

(more…)

Tags: , , ,

SRE vs DevOps

I’m really enjoying the Seeking SRE book. Chapter 12 covers SRE vs DevOps; a community sourced compare and contrast type discussion.

My favorite description is from Thomas Limoncelli, who suggested that:

DevOps engineers focus on the SDLC pipeline with occasional responsibilities for production operations. SREs focus on production operations with occasional responsibilities for the SDLC pipeline.

(more…)

Tags: , , , ,

Book chapter summary: Postmortem Culture, from the SRE Book

I’m really enjoying reading¬†the excellent “SRE Book“. Chapter 15 “Postmortem Culture: Learning from Failure” in particular, really struck a chord with me. The following is a slightly summarized version of it.

TLDR: Failures are inevitable, especially in distributed systems. To learn from them, document in Postmortems, avoiding blame, and share the newly gained learnings across your org.

(more…)

Tags: , , , , , , ,

Talk summary: SRE principles by Tori Wieldt @ AWS re:Invent 2018

I caught a talk by¬†Tori Wieldt at the New Relic booth at AWS re:Invent on “SRE principles”. Even though it was a short talk in the expo hall, rather than a formal scheduled one, it had a ton of good SRE material.

(more…)

Tags: , , , , , , ,

Why to avoid Mean Time to Recover (MTTR)

The 2022 Void Report came out in late 2022, It is a recommended read, and I previously summarized it here. This article focuses on one aspect of the report: why mean time to recover (MTTR) is not an appropriate metric for complex software systems.

The takeaway are:

  • Do track time to recover (TTR) for each incident. It can be a useful exercise to think about when an incident started and stopped. That can help when calculating the cost of an incident.
  • Don’t report those times in aggregate, such as MTTR. Systems fail in non-uniform ways and averaging numbers to represent their reliability (or the performance of the supporting teams) is likely to be misleading.
  • Instead, use:
    • Post-incident learning reviews to learn (and share!) everything you can from an incident
    • SLOs to help align technical system metrics with business objectives
    • Consider sociotechnical incident data too

(more…)

Tags: , , , , , , ,

Summary: The SPACE of Developer Productivity

The SPACE of Developer Productivity is a 2021 paper by researchers at GitHub, University of Victoria, and Microsoft (including Dr Nicole Forsgren, co-author of Accelerate) that looks into ways to measure and predict productivity for both individuals and teams.

The following is a summary of the paper. The original is ~5400 words. This is ~2000.

(more…)

Tags: , , ,

2022 VOID Report Summary

The following is a summary of the 2022 VOID report.

The original is ~10,000 words. This is ~1500.

(more…)

Tags: , ,

Book chapter summary: Managing Incidents

This is a slight abridged version of Chapter 14, “Managing Incidents, by Andrew Stribblehill from the excellent “SRE Book“. (Original is 2200 words, this is 1200)

(more…)

Tags: , , ,

Development and delivery practices for team success

Most metrics for measuring developer productivity, such as lines of code or issues closed, are notoriously ineffective. But the research in the excellent State of Devops report shows that, rather than focusing on local metrics and individual developer performance, it is better to look at overall development and delivery practices. Specifically, there are metrics that predict and reflect a team’s ability to successfully deliver working software into production, including deployment frequency, and the mean time to restore service after an incident. This articles discusses why some metrics are useless, and takes a closer look at the recommendations in the 2019 State of Devops report.

(more…)

Tags: , , , ,

Report Summary: Accelerate State of DevOps 2019

This is an abridged version of The Accelerate State of DevOps Report 2019; essentially a cut and paste of the most salient parts. The original is about 18,000 words; This is about 2,500 words.

I highly recommend reading the original in its entirety, if you have time, and I’m a big fan of the Accelerate book too. As with all the other summaries I create, this just as as way to help me digest and understand an excellent article.

(more…)

Tags: , , , , , ,

Is Apdex useful?

I’ve been trying to figure out what SLOs to define for some services recently, and wondering if Apdex is a useful metric. (See my previous post on the difference between SLIs, SLOs and SLAs)

(more…)

Tags: , , , , ,