RSS Feed Subscribe to RSS Feed

 

SRE Metrics

A very quick post on some of the most commonly used SRE metrics: The Four Golden Metrics, and RED & USE.

4 golden metrics

The 4GMs were defined in Chapter 6 “Monitoring Distributed Systems” of The SRE Book (also discussed at victorops.com and appoptics.com).

They are:

Latency: the time taken to respond to a request. Often (most easily) measured from the server side

Errors: Define what an “error” is for you (e.g. HTTP 5xx responses, logger.error() being called, empty responses) and monitor. While eradicating all errors is a good goal, filtering out known or innocuous errors from your alerts may be a pragmatic reality.

Traffic: The number of transactions, or incoming requests, your service. The definition of what “high” or excessive traffic is, is likely to vary enourmously from serice to service.

Saturation: I think of saturation as a meaure of utlization on your service. At what point does it become over utlized and performance becomes degraded?

RED and USE metrics

These are closely related to the RED and USE metrics, which are defined as:

  • Rate (the number of requests per second)
  • Errors (the number of those requests that are failing)
  • Duration (the amount of time those requests take)

and

  • Utilization (% time that the resource was busy)
  • Saturation (amount of work resource has to do, often queue length)
  • Errors (count of error events)

respectively.

Tags: , , ,

Leave a Reply