RSS Feed Subscribe to RSS Feed


Book summary: Distributed Systems Observability

“Distributed Systems Observability” is a book from Cindy Sridharan (find her on twitter, and medium), available as a free download here (registration required). At a little over 30 pages and 8,000 words, it is not a difficult read, and I definitely recommend it.





This post is based almost exclusively on Cindy’s book and is mostly a copy and paste from it. I’ve found the best way for me personally to understand something is to copy, paste, summarize, slice, dice and rehash. I do this mainly as a function to allow me to better understand it. If you are interested, I recommend reading Cindy’s book for yourself. 

CHAPTER 1 – The Need for Observability

Systems have become more distributed, and in the case of containerization, more
ephemeral. This book introduces the idea of observability, explains how it’s different from traditional monitoring and alerting, and why it’s so important when building distributed systems.

What Is Observability?

Observability means bringing better visibility into systems and acknowledging that:

  • No complex system is ever fully healthy
  • Distributed systems are unpredictable. In particular, it’s impossible to predict all the ways a system might fail
  • Failure needs to be embraced at every phase (from design to implementation, testing, deployment, and operation)
  • Ease of debugging is of high importance

Observability is a feature that needs to be built into the design of a system such that:

  • The system can be tested in a realistic manner (including testing in production)
  • A system can be tested for the types of failures that may result in alerts
  • A system can be deployed incrementally and easily rolled back
  • Post-release, a system reports enough about its health and behavior to be easily debugged and understood

CHAPTER 2 – Monitoring and Observability

Observability isn’t a substitute for monitoring, they are complementary.

Monitoring is a way to report on the overall health of systems, including generating alerts when required. Monitoring should provide a bird’s-eye view by recording and exposing high-level metrics across multiple components (load balancers, caches, queues, databases, and stateless services)

Observability is a superset of monitoring, providing not only that high-level overview but also details on the inner workings to better support troubleshooting and more insights into system health.

Blackbox and Whitebox Monitoring

Traditionally, most alerting was derived from blackbox monitoring: observing a system from the outside—think Nagios-style checks. This can be useful to identify symptoms (e.g., “error rate is up”), but not necessarily the causes (especially in a distributed system).
As a result, blackbox monitoring is slowly falling out of favor and is perhaps best suited to infrastructure that has been outsourced and/or uses third-party software that can be monitored only from the outside.

Whitebox monitoring refers to reporting data from inside a system. Such “insider information” can result in far more meaningful and actionable alerts compared to alerts derived from external pings.

Best Practices for Alerting

Best practices for alerting can include:

  • Alerts should be accompanied with monitoring that provides the ability to drill down into the problem, and give visibility into impact
  • All alerts need to actionable.

Debugging “Unmonitorable” Failures

Some problems are not predictable and will not be picked up by monitoring.How do we troubleshoot such production incidents? It is often an iterative process that involves the following:

  • Start with a high-level metric
  • Drill down by looking at fine-grained, contextual observations
  • Hypothezie on cause, and test that hypothesis

Observability data from the various components of the system is required. It is difficult to do such debugging from aggregates, averages, percentiles, historic patterns, or any other forms of data primarily collected for monitoring.

However, the process of knowing what information to expose and how to examine it still requires a good understanding of the system, and to have higher-level abstractions (such as good visualization tooling) to make sense of all the data.

CHAPTER 3 – Coding and Testing for Observability

Testing has historically been mainly limited to pre-production. Some companies have teams of QA engineers to perform manual or automated tests for the software built by dev teams. When given a green-light by QA, it was handed over to the operations team to run.

This model is slowly being phased out. Development teams are now responsible for testing and operating their services. As a result, it becomes essential to be able to pick and choose the right testing techniques.

There is a status quo of production being sacrosanct. Instead of testing there, it is common to keep non-prod as identical to production as possible, and test there. Such environments are usually a pale imitation of production however.

The idea of testing in production, possibly with live traffic, is either seen as alarming, or the remit of operations engineers only. Performing some amount of testing in production requires a change in mindset and a certain appetite for risk. It also requires an overhaul in system design and investment in good release engineering practices and tooling. It involves architecting, coding and testing for failure.

Testing for failure involves acknowledging that certain types of failures can only be surfaced in the production environment.
Testing in production has a certain stigma and negative connotations linked to
cowboy programming, insufficient or absent unit and integration testing, as well
as a certain recklessness or lack of care for the end-user experience.

Testing in production is by no means a substitute for pre-production testing, nor is it, by any stretch, easy. Being able to successfully and safely test in production requires diligence, rigor, as well as systems designed from the ground up to lend themselves well to this form of testing. For more, see Testing in Production.

Testing in production essentially means proactively “monitoring” the test that’s happening in production.

CHAPTER 4 – The Three Pillars of Observability

The 3 pillars of Observability are Logging, Metrics and Tracing.


While logging is already a common practice, but consider:

  • Structured logging: plain text -> splunk friendly -> json
  • Eventlog can be a great source of logs for debugging too
  • Consider sampling rather than aggregation


Metrics are a numeric representation of data measured over intervals of time. For example averages, min, max.


A trace is a representation of a series of causally related distributed events that
encode the end-to-end request flow through a distributed system.

When a request begins, it’s assigned a unique ID, which is propagated down the request path. Each point can enrich the metadata before passing the ID around to the next hop.

This allows the reconstructinion of the flow of execution for troubleshooting and allows us to debug requests spanning multiple services, for example to pinpoint the source of increased latency or resource utilization.

Zipkin and Jaeger are two popular open
source distributed tracing solutions.

Logs, metrics, and traces serve their own unique purpose and are complementary.
In unison, they provide maximum visibility into the behavior of distributed

CHAPTER 5 – Conclusion

The goal of an Observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback. It is about being data driven during debugging and using the feedback to iterate on and improve the product. Being able to debug and diagnose production issues quickly is critical.

Having a good monitoring and alerting may be enough, but sometimes being able to go further and debug needle-in-a-haystack types of problems is needed too.

Tags: , , , , , , , ,

Leave a Reply