RSS Feed Subscribe to RSS Feed

 

Creating an SRE team

If you wanted to build an SRE team at your company, how would you go about it? How would you structure it?

 

Better resources

First, as with most of my posts, there are already many better resources out there that discuss this! See also my post of SRE resources, but for SRE team building specific resources, I would recommend:

How SRE teams are organized and how to get started (google.com)

This article suggests starting SRE by “allocating some engineering time of multiple folks” to SRE work, before creating dedicated SRE teams which may generalize in a consulting type approach, or specialize in Infrastructure, Tools or specific Products. They also discuss having SREs embedded in a team with their developer counterparts. It is worth reading the linked Why SRE Documents Matter document.

SRE Team Lifecycles (google.com)

The “SRE Team Lifecycles” of the information-packed SRE Workbook talks about the principles needed at each stage of your SRE “journey”, starting with simply defining SLOs, through to finding and placing your first SRE. It also discusses approaches to creating SRE teams including converting an existing team on engineers to SRE, or establishing a horizontal SRE team (“a small team of SREs consults across a number of teams”).

It finishes with how to roll out SRE at a level most of of us outside Google likely haven’t had the need for, including rolling out large numbers of SRE teams and how split them e.g. by geographic regions, product maturity, language, architecture etc.

This chapter also includes one of my favorite SRE phrases “SREs must have time to make tomorrow better than today“.

So, You Want to Build an SRE Team (oreilly.com)

Chapter 3 “So, You Want to Build an SRE Team” of the Seeking SRE book states that the chapter is for “leaders in organizations that are not practicing SRE” and helps them decide whether they should be.

One of the key points I loves was that “an organization is compatible with SRE only if it’s driven by facts and data” and even suggests the novel idea of giving “awards, not punishment, to people who take responsibility for causing an outage, if they follow up with an appropriate incident response and postmortem”. It also contains another great quote about SRE in general: “SRE is about managing your reliability so that it doesn’t get in the way of other business goals. Some amount of downtime must be anticipated, accepted, and even embraced.

Finally, it finishes by talking about some suggestions for getting started with SRE including

    • Think about why you want SRE (“More reliability with fewer people? Greater release velocity?”),
    • Find some influential people in your organization to kickstart SRE
    • Set some SLOs and think about error budgets

OK, with those excellent resources in mind, here are some of my thoughts on putting together a plan to build an SRE team…

 

Why have an SRE team?

Or perhaps, more generally your question might be, why SRE?

I think one of the main reasons for adopting an SRE perspective is that “Reliability is the most important feature” (See “SRE: Reaching Beyond Your Walls” from the SRE Workbook). It would be difficult to argue that any individual or even collection of features is more important than the service being as a whole being available.

Whatever your reasons, you’re likely to start thinking about how you can implement SRE in your org…

 

Types of Site Reliability Engineers

The “What Is SRE?” book breaks SREs into 3 categories:

“SREs can be deployed to focus on infrastructure components,
as short-term consultants for feature-oriented teams, or as
long-term “embedded” teams working with their feature-oriented
counterparts.”

Another way to consider organizing SREs is along the following lines:

1. Horizontal SRE in Engineering

The work of a Horizontal SRE is common across all engineering teams. Examples might include:

  • Defining engineering-wide SRE best practices such as incident handling and RCAs
  • Creating generic dashboard or alert templates for quick and easy adoption across engineering teams
  • Defining canary releases procedures
  • Creating generic deployment or production readiness checklists.

2. Embedded SRE in Engineering

Unlike a “horizontal” that works across all of engineering, an embedded SRE works on one specific team, and has an intimate understanding of that team’s needs, business context, code and app-specific configuration. Responsibilities may includes:

  • Is responsible for that team’s own alerts, dashboard and runbooks
  • Focuses on team specific preventative work, issue resolution, and learnings (e.g. leads those RCAs)
  • Defines and signs off on release specific checklists. E.g.,
  • Pre conditions that need to be met for release to go ahead
  • Definition of healthy – How do you know this release is working OK
  • Details on how reverts will work if needed
  • Possible slow burning issues to watch for. Some issues may not be seen immediately, e.g. memory leaks.

3. SRE in Ops

Similar to the “Horizontal SRE in Engineering” we discussed above, this SRE is horizontal, but across whole organization. SREs with an “Ops” heavy perspective may focus on issues and troubleshooting related to networking and datacenter infrastructure, whether colo or cloud-provider based. Specifics may include handling components such as API Gateways, VPCs, VPNs and subnets, as well as operating system and VM level issues. In other words, things beyond “code” and that many software engineers take for granted.

 

It is worth pointing out that there is obviously overlap between all 3. For example, all might

  • Research open-source SRE-related tools.
  • Both Ops-SREs and Eng-Horizontal-SREs may be responsible for creating generic dashboards that can be used through the company, and may work together on defining canary releases procedures. There is likely to be heavy, on-going interaction between these two groups.
  • Engineering Horizontal SREs may create the templates that the Embedded SREs take and customize. There may even be a continuous rotation between these two groups.
  • Horizontal SREs may periodically act as an consulting embedded SREs by embedding themselves in the team that needs them most for short period of time.

Org structure

I personally think there are benefits to having a dedicated SRE team, with a dedicated manager.

Why not have the embedded SREs reporting directly to the team Dev Manager, or indeed broader org Director? I think one reason is that for SREs to be truly efficient, they have to work on a team whose sole purpose is SRE, making things better, improving customer experience and all the other things that go with SRE. If an SRE is working on a feature team, he may be pressured to start focusing on features (“We haven’t had any outages this week, can you work on this feature instead of SRE?”). SREs together need to create company wide SRE standards, and since no team (or service) is an island, this can severely impact reliability on your team/service.

To pull several quotes from Chapter 20 “SRE Team Lifecycles” of the SRE Workbook, SREs must have the “the ability to regulate their workload”,  “time to make tomorrow better than today”, all with  “a view toward the holistic health of the system”. I believe all those things are more likely on a team where SRE is the sole focus.

Where to find SREs

Finally, if you have decided to build an SRE team, in whatever form, you are going to need SREs. Where can you find them? One option is to convert internally. Software Engineers, Operations (sys admins, devop, or whatever they may be called in your org), architects and even DBAs can all be eligible candidates. A passion for reliability, metric based decisions and customer experience is a key across all.

If you wish to recruit people with existing SRE experience, conferences may be a good source. For example, SRECon, DevOpsCon, and DevOpsDays. Meetups are also an option, for example, I am in the San Francisco Bay Area, which has a San Francisco Reliability Engineering meetup.

These are all in addition to the usual recruitment channels and referrals. They can just be slower than usual since SRE is still somewhat of a niche field.

Tags: , , ,

Leave a Reply