Shaun Abram
Technology and Leadership Blog
Beginning with SRE
This post is an introduction into some basic SRE practices we have been implementing at my company recently.
I’ve written before on SRE, including on SRE resources, SLIs, SLOs and SLAs, and Creating an SRE team, but this is a more practical guide to getting started.
Getting started with SRE
OK, so you realize there is benefit to implementing SRE practices. You want to be customer centric. You want to be aware of things that will impact your customers and whether your releases are actually making your service better or worse e.g. has availability been affected?
So, what do you start looking at? In this post, I suggest starting by focussing on:
- Errors
- Runbooks
- Define an SLO, starting with availability
Errors
TLDR; Alert on errors, with exclusions where needed. Start alerting via email, not paging alerts, to avoid pager fatigue.
Get a handle on the errors your service is generating. Understand and minimize the errors. To begin with, spurious or incorrectly logged errors may be the background noise that stops you hearing the true state of your system.
You want to get to the point where when you get an alert for an error, it is a real issue; one that you want to be alerted for. To get to this point, you may need to
1. Improve your log hygiene
If you plan to alert on every error, you should only log things as errors that you would want to be awoken during the night for. Some examples of things being logged as errors that shouldn’t be are:
- Client errors. For example, if a client passes in an empty value when a valid valid was expected, it is unlikely you want to be paged. Instead, consider logging it at INFO level, and return a http 4xx error.
- Connection issues that your code will retry. Don’t log an error until the retries have failed. No one wants to get paged at 2am to then check the logs and see “retry succeeded” and go back to bed.
- Exceptions. Not all exceptions are bad. Some may be normal, if infrequent, flows and you don’t necessarily need to be alerted.
I like this quote from Use Logging Levels Consistently which says that “The Error level should only be used when the application really is in trouble. Users are being affected without having a way to work around the issue. Someone must be alerted to fix it immediately, even if it’s in the middle of the night.”
So, often the first attempt at alerting on errors will exposes a huge number of errors in your logs, and likely result in a phase of log hygiene improvement.
2. Permanently exclude “OK” errors from your alerts
Exclude errors being logged by 3rd party libraries which you have no control over and which you know are not a real issue for your service. A library that logs an error for a “file not found” for example, when that is a routine occurrence for your business logic.
3. Temporarily exclude work-in-progress errors from your alerts
That is, exclude errors that you already know about but aren’t dealing with immediately. If you find yourself saying something like “Oh yeah, thats because of a problem with X and we’re dealing with it in this sprint. You can ignore for now because Y”, then you probably do not want to be paged for it tonight.
Runbooks
TLDR; You are now being paged for errors. Document known issues and required actions.
With alerts in place, you need a way to empower those on support rotation to deal with them. That is where runbooks come in to play.
Wikipedia states that “a runbook is a compilation of routine procedures and operations that the system administrator or operator carries out. Typically, a runbook contains procedures to begin, stop, supervise, and debug the system.”
According to AWS, the purpose of a runbook is to “Enable consistent and prompt responses to well understood events by documenting procedures in runbooks.”
Ever been paged at 2am for an error that you don’t understand? It is a scary feeling. Having the error documented in an easily searchable runbook can be a godsend. For example, “If error A occurs, you may need to restart service B. If that fails, contact team C to request assistance”.
Even better than runbooks though are automated responses, and having to write up long runbook entries can often bring the realization that there may be better solutions.
Note that runbooks need to live and breath; to be updated and maintained. Well written runbooks are a sign of a healthy team and ecosystem.
Define SLOs
An important aspect of Site Reliability Engineering is measuring and improving system metrics. I talk more about metrics at https://www.shaunabram.com/sli-slo-sla/.
Here, I will use the terms SLIs and metrics interchangeably. Ditto for SLOs and targets.
Why set targets (SLOs) at all? I like this quote:
“… it is difficult to do your job well without clearly defining well.
SLOs provide the language we need to define well.”— Theo Schlossnagle, Circonus; Seeking SRE, Chapter 21
Another related quote (which I came across both in the SRE ebook, and “The SRE I Aspire to Be” talk):
“When you can measure what you are speaking about, and express it
in numbers, you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowledge
is of a meagre and unsatisfactory kind”— Lord Kelvin; Popular Lectures Vol. I, p. 73
Also, considering setting both internal and external SLOs. The internal being a stricter standard, and one that would trigger alerts sooner than the external one that you commit to for users. This is discussed in Chapter 4 Service Level Objectives of the SRE Book:
Using a tighter internal SLO than the SLO advertised to users gives you room to respond to chronic problems before they become visible externally.
OK, so you want to define a target (SLO). But, to be able define a target, you first need a way measure what it currently is.
How do you measure metrics/SLIs?
How do you measure metrics, or SLIs, in general?
From Chapter 4 Service Level Objectives: Collecting Indicators of the SRE Book:
Many indicator metrics are most naturally gathered on the server side, using a monitoring system such as Prometheus, or with periodic log analysis—for instance, HTTP 500 responses as a fraction of all requests. However, some systems should be instrumented with client-side collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics. For example, concentrating on the response latency might miss poor user latency. Measuring how long it takes for a page to become usable in the browser is a better proxy for what the user actually experiences.
A more detailed approach is discussed in this talk on “Stop Talking & Listen; Practices for Creating Effective Customer SLOs” from Cindy Quach (SRE @ Google) at QCon 2019. In it, Cindy talks (just before the 13th minute mark) about how to measure SLIs, including
- Server-side log processing
Relatively easy to start but doesn’t cover requests that didn’t reach your service.
- App server metrics
Similar to above in that it is relative fast and easy, but doesn’t provide any client side perspective.
- Front end infrastructure metrics
This mean utilizing metrics from your load balancing infrastructure. In Cindy’s words, you probably already have this, so is may be cheap, but it is not great for measuring complex requirements (although I’m not sure what complex requirements really means).
- Synthetic clients
Also known as probers. This is where a client sends a fabricated request and monitors the response, similar to black box monitoring. This has the advantage of measuring all steps of a multi request user journey but the downside of getting in to the area integration testing.
- Client instrumentation
This means adding observability features to the client that the user is interacting wth and then logging the events back to your serving infrastructure. Cindy highlights this as the most effective approach since it measures user experience accurately but points out that it there can be big variability and it may measure things out of your control.
OK, now that we know how to measure metrics/SLIs, in general. Let’s move on to measuring some specific metrics/SLIs: Availability and Latency.
Availability
Before we talk about how to measure availability, and then how to set targets (SLOs) for it, let’s talk about what it is…
What is availability
Some definitions from folks smarter than I:
-
Whether a system is able to fulfill its intended function (SRE fundamentals from google.com)
-
What percentage of the time a service is functioning. This is also referred to as the “uptime”. (Availability, Maintainability, Reliability from blameless.com)
-
The system’s ability to respond and avoid downtime – from The “Chaos Engineering” book (which I summarized at shaunabram.com/chaos-engineering-book-summary)
In my post on SLI, SLO and SLAs), I talked about availability being the fraction of time that a service is usable, sometimes measured as the percentage of well-formed requests that succeed. e.g. 99.9%.
Availability is an example of a Service Level Indicator, that is, a measure indicating how your service is performing.
An SLO, or a target, for availability might be:
The application will be available 99.95% of the time over any given 24 hour period
How to measure availability
Here are some suggestions for measuring availability…
Http Status codes from server side logs
Under the category of “Server-side log processing”, we can look at the HTTP Status Codes, from client requests, recorded in our server side logs.
Cindy makes a point in her talk that uptime != availability. For example, if your service runs on three servers, and one of the servers experiences downtime, it may not affect your users at all. Or if in fact all three servers experience downtime at the same time, your users may also not notice if that is a time that they typically do not use your service. Instead, we should measure availability based on customer usage.
In Chapter 2 of the SRE Workbook Implementing SLOs, it talks about calculating what it calls “Success rate” (and what I would call availability) as
Number of successful HTTP requests / total HTTP requests
And that is certainly a useful measure approach. For example, 99 good responses in 100 total responses gives you 99% availability.
An example if a splunk search for such a measurement might be:
host=myservice* | bin span=1h _time |eval status=case( like(httpstatus, "2%") OR like(httpstatus, "4%"),"non5xx", like(httpstatus, "5%"),"5xx") |stats count(eval(status="non5xx")) as goodCount count(eval(status="5xx")) as badCount by _time |eval availability=round(100*( goodCount-badCount)/(goodCount+badCount) ,2) |table _time availability
Http Status codes from client side logs
Under the category of “Front end infrastructure metrics”, we can also measure on the client side. The problem with the server side approach is that if your server does down completely, it won’t be reflected in your availability stats at all! So, another approach is to measure outside of the server. Since measuring and gather data from every client may be difficult, a useful intermediary can be your load balancer. It should have a record of every request and response.
In Splunk, this query would be almost identical to the one above except that you would limit data to the load-balancer and for your specific api
e.g. instead of
host=myservice*
you could use
index=loadbalancer /api/myservice
Measuring availability using health-check
Most servers have some form of health-check already built in. For example, support for health checks comes out of the box with springboot. You may well already be polling health check regularly (and alerting if there is an issue). So one option is to start recording the results on how the server responds to these health-checks. You could do this on the server or client side. These might come under the category of “Synthetic clients” since they are fabricated requests.
They do have the downside of really only testing if health check is up, not if app is actually functioning OK or delivering business value, e.g. all useful endpoint may be down, but the health check still looks OK.
So, one work around is to have more sophisticated health-checks which actually do meaningful checks. For example, if you have a “FileStorageService”, your health-check could write and read a test file as a way of verifying that the most basic features are working.
OK, great, now you have measured current availability. So what should your target availability be?
Setting availability target
The SRE Book, in particular Ch4 Service Level Objective, says
“Don’t pick a target based on current performance
While understanding the merits and limits of a system is essential, adopting values without reflection may lock you into supporting a system that requires heroic efforts to meet its targets, and that cannot be improved without significant redesign.”
Still, starting by measuring current state is a good starting point, as described above, If it is lower than you expect, it is a great kick in the butt to start improving. But often we found that it was in fact higher than we expected. e.g., for a time period of a few days, it is entirely possible that you may actually see 100% availability. This of course would not make a good (or attainable) target.
Regardless of what your current measured availability is, you should set targets by thinking about how much downtime your system might realistically be expected to handle. Get input from your customers, users and business partners. As this article (atlassian.com) says. “Craft SLAs around customer expectations”
And when setting a target, consider how your “nines” translate in to the amount of time a service would be unavailable (See wikipedia’s Availability Percentage Calculation)
For example
- Five nines (99.999%) translates into less than a second of downtime per day. That is an incredible difficult target to achieve.
- Two nines (99%) allows a much more generous 15 minutes per day, or almost 4 days outage per year. Of course, that may have very real customer impact that you can not afford. So, find a middle ground that works for you.
- Three nines (99.9%) may be a reasonable compromise for an internally facing service, for example. That allows a little under 2 mins (1.44) per day, or 10 minutes per week.
Finally, an interesting anecdote on SLOs. One company would ensure that it would meet, but not exceed, its service level objective as follows:
In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
That seems incredibly advanced. I think most of us are a long way from needing to deliberately break our services!
References & Resources
- The Art of SLOs (google.com)
- How to get into SRE
- The SRE I Aspire to Be (conference video from SREcon19)
- 20 essential books for SRE
See also https://www.shaunabram.com/sre-resources/
Tags: servicelevelagreements, sitereliabilityengineering, sla, sli, slo, sre, thesrebook