RSS Feed Subscribe to RSS Feed

 

eBook Summary: What Is SRE?

What Is SRE? An Introduction to Site Reliability Engineering” (registration required but free), is an ebook by Kurt Andersen & Craig Sebenik, published by O’Reilly. The following is a summary (abridged copy and paste) of the parts I found most useful, with a few of my own notes. The original is about 9,000 words; this is about 2,000.

 

 

Defining SRE

  • S: The original interpretation of the “S” (“Site,” as in “website”) has expanded over time to include “System,” “Service,” “Software,” and even more widely “online Stuff.”
  • R: Reliability or even Resilience
  • E: “E” can stand for the practice (“Engineering”) or the people (“Engineers”)

In general, SREs work across the realm of “Anything” as a Service, whether that is Infrastructure (IaaS), Networking (NaaS), Software (SaaS), or Platforms (PaaS).

The use of service level indicators (SLIs) and service level objectives (SLOs) as meaningful indicia of service health is one of the distinguishing characteristics of SRE practice. SLOs are symptoms of a healthy relationship between the reliability (SRE) team and the feature team, not a mandate. Major areas of expertise can include:

  • Release engineering
  • Change management
  • Monitoring and observability
  • Managing and learning from incidents
  • Self-service automation
  • Troubleshooting
  • Performance
  • The use of deliberate adversity (chaos engineering)

SRE works to sustainably achieve the appropriate level of reliability for services through data-informed production feedback loops.

I would personally phrase the major components of SRE as being:

  • Monitoring and observability (for prevention and early detection)
  • Troubleshooting (for when things do go wrong)
  • Incident management, and conducting RCAs (to learn and reduce the likelihood of repeats)
  • Resilience Engineering (aka Chaos Engineering)
  • (Last but not least) Reduction of toil, and self-service automation

It may also include change management, release engineering (devops) and performance tuning

 

Digging into the Terms in These Definitions

Production Feedback Loops

Feedback loops are about communication between the social and technical aspects of an organization. Inadequate feedback and communication channels lead to scenarios such as the classical divide, and conflict, between feature developers and operations. An SREs role includes establishing and maintaining feedback loops from operations to the feature developers. Feedback loops let developers know if services are not working well.

Data-Informed

It is critical that feedback loops be automated in order to scale, and that they rely on data rather than opinion. Continually improving measurements to adequately inform product
decisions is one of the benefits of having a standing SRE team.

As noted by Lord Kelvin:

When you can measure what you are speaking about, and express it
in numbers, you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowledge
is of a meagre and unsatisfactory kind; it may be the beginning
of knowledge, but you have scarcely, in your thoughts,
advanced to the stage of science, whatever the matter may be.

Appropriate Levels of Reliability

The suppliers of essential consumer services, like water and electricity put a significant amount of work into making them “always” available, but in reality those too have frequently outages. Often so short they may go unnoticed by the end consumers, but occasionally so prolonged, the loss of usual services becomes a headline issue.

In reality, no service is truly always available. The paradox of trying to make a service never have an outage is best captured by this old Chinese phrase

“A one-foot stick, every day take away half of it, in a myriad ages it will not be exhausted”

This applies directly to reducing outages. If a service has 500 units of outage in a given measurement period, it will take progressively greater efforts to maintain that same cumulative outage count as longer and longer measurement periods are considered.

Determining the appropriate level of availability for a service is a business decision, not a technical one. The usual industry terms in this realm are:

  • Service level indicator (SLI): What you measure and where the measurement is taken
  • Service level objective (SLO): The goal or threshold of acceptable values for the SLI within a given time period

For more on this topic, see Chapter 4 of Site Reliability Engineering and Chapters 2 and 3 of The Site Reliability Workbook, edited by Betsy Beyer et al. (O’Reilly).

Sustainable

A site needs to have an appropriate target of availability, based on an analysis of the business costs and benefits.

Avoid the need for heroic measures and instead strive for response patterns and system capabilities that do not require extraordinary efforts. This leads to valuing low-noise, actionable alerting, automated response and remediation, and self-service platforms.

“Sustainable” also drives the emphasis on blameless postmortems to learn from the failures that do happen so that the systemic defects that led to a failure can be addressed in both current and future services.

Reliability-Focused Engineering Work

An SRE team should be be working on projects that will “make tomorrow better than today”, fixing reliability problems and building supporting tools.

In some cases this may involve building out continuous integration/continuous delivery (CI/CD) pipelines for the organization, but in many cases SREs take that level of automation for granted and are able to focus elsewhere: on fixing design and code choices that degrade reliability or working on monitoring/alerting/observability or capacity modeling/forecasting or load balancing or chaos engineering or dozens of other areas appropriate to a given organization’s needs.

Organizational Model

Effective and successful SRE teams don’t happen by accident. The discussions and agreements around SLOs take time and effort. In order to make reliability a priority, an organizational commitment is needed to allow SREs to focus on engineering reliability.

Where Did SRE Come From?

Site Reliability Engineering is an outgrowth of the “always-on” world of online services. The time to detect (TTD) problems and time to respond (TTR) to, or mitigate (TTM), become critical measures.

The term SRE can be traced back to Google, around 2003, when they realized that traditional approaches could not scale to handle the massive growth of online services and began to apply software engineering approaches to the previously manual processes of system operations.

What’s the Relationship Between SRE and DevOps?

One definition of DevOps is

“the union of people, process, and products to enable continuous delivery of value to our end users.”

And the distinction between DevOps and SRE could be described as:

DevOps focuses on engineering continuous delivery to the point of deployment; SRE focuses on engineering continuous operations at the point of customer consumption.

So, the priority for SRE teams is on the “delivery of value to end users”. Value can’t be delivered if end users can’t rely upon accessing a service, hence the importance of identifying and tracking service reliability. By focusing on value delivery, SRE provides a complement to teams that focus on developer productivity and CI/CD pipelines.

 

 

Understanding the SRE Role

Culture/Capabilities/Configuration

The underlying components in highly innovative companies could be classified into the “three C’s”:

• Culture. An organisation’s shared values and beliefs

• Capabilities. The combination of skills, technology, and knowledge

• Configuration. The structure of the organisation

These same dynamics apply to reliability.

 

Culture

Executive-level support for SRE is essential, but a culture that recognizes and rewards work that enhances reliability provides the environment in which SRE can be viable.

Other important cultural components include:

  • Fostering a learning mindset1
  • Always looking for continuous improvement2
  • Establishing psychological safety3 to enable truth telling

Capabilities

For SREs, the ideal team member has a broad understanding of computer system dynamics—especially distributed systems. Effective SREs are able to zoom out to deal with system interrelationships and to zoom in, as needed, to debug the bit-level intricacies of networking or memory usage patterns.

 

Configuration

A strong SRE practice requires reporting structures that allow SREs to be correctly evaluated and rewarded, not just on how quickly features get shipped. SRE success can be tracked across five areas that contribute to reliability:

  • Providing useful monitoring frameworks that empower system understanding
  • Characterizing, measuring, and improving availability and performance
  • Accurately forecasting capacity requirements and improving efficiencies without undue impact on feature deployment velocity
  • Improving velocity through reduction of toil and manual exception handling
  • Effectively handling and learning from incidents

Distinguishing SRE from Other Operational Models

 

SRE for Internal Services

SRE practices can be equally beneficial when applied to internal-facing platforms. And SRE is possibly even more relevant in a situation where everything becomes a platform than for standalone functionality. User-facing services can’t provide higher reliability than that provided by their critical backend dependencies. Adequate feedback loops help to avoid cascading problems and the so called “dark” debt within distributed systems.

 

Implementing SRE

When implementing a new SRE team, the main factor is whether you are starting “fresh”—a “greenfield” project—or taking a “brownfield” approach and migrating an existing team. In either scenario, the amount of cultural change needed can be daunting. Even before a team is formed, one must prioritize the work to be done. A guide for figuring out where to start is the Hierarchy of Reliability…

Hierarchy of Reliability

With pressure to get things working quickly, you may need a way to explain “reliability” in a simple and straightforward way. The Hierarchy of Reliability (based Maslow’s Hierarchy of Needs form psychology) may help.

Each level is not exclusively dependent on the levels below it but they build on one another. When each level is done well, then the other levels naturally benefit.

For example, your company could have a product without monitoring but nobody would know if half of your customers only saw error pages.

Finally, being aware of the impact that outages and being on call has on the people that work in the team is fundamental.

Starting a New Organization with SRE

If you are solid in every level, then you could say that you have implemented the SRE model. You don’t have to hire people that specifically have the title “SRE” if the team is properly covering all of these bases. However, it is far too often the case that the engineers that are focused on feature development either have less interest or less knowledge when it comes to the details of the SRE role. In either case, the goal is to make sure the hierarchy is as solid as possible at every level.

The SRE is there to enable and empower every other engineer on the team. Each engineer is responsible for deploying their own code and managing their own configurations. The same goes for metrics, monitoring, etc. The SRE is the expert in these various aspects of delivering high-quality software. They are there to help but not to implement everything for them.

An SRE needs to remain focused on the overall reliability of the site, not on new features. To assist in this, some companies use a “double reporting model.” Essentially, the SRE works (and sits) with the development team, but they report to a different organization whose mandate is not product features. It is important that SREs continue to focus on reliability and leave the product features up to other teams.

 

Introducing SRE into an Existing Organization

The people leading an SRE change have to demonstrate that the place they are trying to get to is significantly better than where they are.

In a large organization with many teams, one way to accomplish this is to find a development team that is motivated to change and implement a small SRE team (or individual) there. Over time, you can use that success as a positive example to other teams.

Wrap up SRE

SRE is an organizational model for running reliable online services by teams that are chartered to do reliability-focused engineering work.

As a discipline, SREs are devoted to helping an organization sustainably achieve the appropriate level of reliability for its services by implementing and continually improving data-informed production feedback loops to balance availability, performance, and agility.

 

Tags: , , ,

Leave a Reply