RSS Feed Subscribe to RSS Feed

 

SLI, SLO and SLA

What are SLIs, SLOs and SLAs? 

Service Level Indicators (SLIs) are metrics that you choose to measure the health and performance of your services. Service Level Objectives (SLOs) are the desired target for those indicators. Service Level Agreements (SLAs) build on this and include the consequences of not meeting those targets. All are fundamental to Site Reliability Engineering.

In this post, I’ll try to explain each in more detail, how they relate to each other, and some examples of each.

SLIs

I think of Service Level Indicators (SLIs) as a measure indicating how your service is performing.

Other, likely better, definitions of SLI include:

  • A quantitative measure of some aspect of the level of service that is provided – SRE Book – Ch4
  • A single, measurable metric related to service behavior – Seeking SRE Book – Ch 9
  • A measure of the service level provided by a service provider – Wikipedia
  • Key measurements of the availability of a system – New Relic

Common SLIs include:

  • Latency: how long it takes to respond to a request
  • Error rate: the % of all requests received that resulted in errors
  • Throughput: typically measured in requests per second.
  • Availability: the fraction of time that a service is usable, sometimes measured as the percentage of well-formed requests that succeed. e.g. 99.9% (100% is neither desirable nor possible)
  • Apdex: a standard to measure users’ satisfaction of response time

Availability is by far the most common metric, from an SRE perspective.

All these measurements are often aggregated over a certain period of time.

SLOs

Service Level Objectives are target value(s) for SLIs. While SLIs are measurable metrics, SLOs are what we want those metrics to be.

Other definitions of SLO include:

  • Goals we set for how much availability we expect out of a system – New Relic
  • A means of measuring the performance of the Service Provider – Wikipedia
  • A threshold you establish for your defined SLI – Pivotal

Example SLOs:

  • While latency is an SLI, the SLO might be:
    • Average request latency should be less than 100 milliseconds
    • 99.5% of requests will be completed in 5ms
    • 99% of all requests per day should be served in 200 ms
  • And while availability is an SLI, the SLO might be:
    • The application will be available 99.95% of the time over any given 24 hour period
  • SLOs can cover multiple SLIs too. For example: 
    • All requests will succeed 99.99% of the time, and at under 100ms latency

SLAs

I simply think of SLAs as an SLO with consequences. In other words, an SLA is an SLO with a compensatory aspect. To continue one of the examples from above, the SLO might be:

99% of all requests per day should be served in 200 ms

but the SLA might be 

99% of all requests per day should be served in 200 ms, otherwise 25% of the daily subscription fee will be refunded

Other definitions of SLO include:

  • SLAs are the legal contracts that explains what happens if our system doesn’t meet its SLO – New Relic
  • An SLA is simply an SLO that two or more parties have agreed to. Consider SLAs as “external” multiparty agreements and SLOs as “internal” single-party goals. – Seeking SRE, Chapter 21. The Art and Science of the Service-Level Objective

Example SLAs:

  • If the Monthly Uptime Percentage is less than 95.0%, you will be eligible to receive a Service Credit of 100% – AWS S3 SLA

 

SLOs vs SLAs

In addition to SLAs containing a consequence, your SLA is typically more lenient than your SLO too. So, if your service is beginning to degrade, you will fail to meet your (higher standard) SLOs first, as an early warning. If things continue to degrade, you will likely then eventually fail your (lower standard) SLAs too. So, again using the examples from above, an SLO might be

99.9% of all requests per day should be served in 200 ms

And the SLA might be 

99% of all requests per day should be served in 200 ms, otherwise 25% of the daily subscription fee will be refunded

In this example, the SLA contains both a lower target, and a consequence. 

Also, note that both SLOs and SLAs typically should have a bounded time context (although I haven’t always stuck to this in the examples above!). For example, saying that “99.9% of all requests per day should be served in 200 ms”.

 

Conclusion

I like this summary and simple infographic from New Relic:

  • SLIs are the key measurements of the availability of a system
  • SLOs are goals we set for how much availability we expect out of a system
  • SLAs are the legal contracts that explains what happens if our system doesn’t meet its SLO

 

And as the Google engineers put it:

SLIs drive SLOs which inform SLAs

Finally, in terms of setting your SLOs and SLAs, remember that the closer you get to (the unattainable) 100% availability, the harder (and more expensive) it becomes to run and maintain your system. It therefore is generally best to commit to the lowest SLO and SLA that your business can handle. 

 

Sources and Resources

Other sources and resources that may not be mentioned above:

 

Tags: , , , , , , , , , ,

Leave a Reply