RSS Feed Subscribe to RSS Feed


Book summary: Chaos Engineering

Chaos Engineering“Chaos Engineering” is a book from O’Reilly (free download), written by folks from the “The Chaos team” at Netflix. It is a GREAT read for anyone interested in resilience engineering. This post is one of my summaries, essentially a cut and paste of the most salient parts (the original is about 16,000 words; this is about 3,000), with some paraphrasing and merging/rewriting of sections for brevity.


“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Principles of Chaos

Distributed systems contain so many interacting components that the number of things that can go wrong is enormous. Hard disks can fail, the network can go down, a sudden surge in customer traffic can overload a functional component—the list goes on. All too often, these events trigger outages, poor performance, and other undesirable behaviors.

While we can never prevent all possible failure modes, we can identify many of the weaknesses in our system. Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light.

CHAPTER 1 – Why Do Chaos Engineering?

Chaos Engineering is an approach for learning about how your system behaves by applying a discipline of empirical exploration. Chaos Engineering uses experiments to learn about weaknesses in your system, allowing you to improve resilience.

Examples of inputs for chaos experiments:

  • Simulating the failure of an entire region or datacenter.
  • Injecting latency between services for a select percentage of traffic
  • Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
  • Time travel: forcing system clocks out of sync with each other.
  • Maxing out CPU

Prerequisites for Chaos Engineering

If you know that your system is not resilient to certain specific events such as service failures or network latency spikes, then you don’t need chaos engineering to test that. You need to do the work fix those issues before applying Chaos Engineering principles.
Chaos Engineering is great for exposing unknown weaknesses in your production system, but if you are certain that a Chaos Engineering experiment will lead to a significant problem with the system, there’s no sense in running that experiment. Fix that weakness first, then come back to Chaos Engineering and it will either uncover other weaknesses that you didn’t know about, or it will give you more confidence that your system is in fact resilient.

Another essential element of Chaos Engineering is a monitoring system that you can use to determine the current state of your system. Without visibility into your system’s behavior, you won’t be able to draw conclusions from your experiments.

Chaos Monkey

Chaos Monkey is Netflix’s tool for dealing with the fact that, with thousands of instances running, it is virtually guaranteed that VMs will fail on a regular basis.

Chaos Monkey pseudo-randomly selects a running instance in production
and turns it off. It does this during business hours, and at a much more frequent rate than we typically see instances disappear. By taking a rare and potentially catastrophic event and making it frequent, we give engineers a strong incentive to build their service in such a way that this type of event doesn’t matter. Engineers are forced to handle this type of failure early and often.

CHAPTER 2 – Managing Complexity

Software engineers typically optimize for three properties:

  • Performance: Minimization of latency or capacity costs.
  • Availability: The system’s ability to respond and avoid downtime.
  • Fault tolerance: The system’s ability to recover from any undesirable state.

At Netflix, engineers also consider a fourth property:

  • Velocity of feature development: The speed with which engineers can provide new, innovative features to customers.

Complex Systems

In Systems Theory there is something called the “bullwhip effect”. A small perturbation in input starts a self-reinforcing cycle that causes a dramatic swing in output (e.g., taking down your whole app).

Individual behaviors of microservices may be completely rational. Only taken in combination under very specific circumstances do we end up with the undesirable systemic behavior.

The Principles of Chaos

The job of a chaos engineer is not to induce chaos. On the contrary: Chaos Engineering is a discipline, an experimental discipline. How would our system fare if we injected chaos into it?

In this chapter, we walk through the design of basic chaos experiments.

We must understand how our system will behave under different conditions and we do so by running experiments on it. We push and poke on our system and observe what happens. We use a systematic approach in order to maximize the information we can obtain from each experiment.

Chaos Engineering Principles

  • Hypothesize about steady state.
  • Vary real-world events.
  • Run experiments in production.
  • Automate experiments to run continuously.
  • Minimize blast radius.

CHAPTER 3 – Hypothesize about Steady State

“Steady State” refers to a property that the system tends to maintain within a certain range or pattern. With a human bidy “system” for example, properties such as temperature and pulse can be used to identify steady state.

We can refer to the normal operation of a system as its steady state. How do you know a service is working? How do you recognize its “steady state”?

One way to know that everything is working is try to use it yourself. e.g., browse to your website. But this approach to checking system health quickly is labor-intensive, not done regularly and may not reveal all problems.

A better approach is to collect data that provide information about the health of the system. System metrics can be useful to help troubleshoot performance problems. But it is business metrics that allow us to answer the important questions like are customers able to perform critical site functions, or  are customers abandoning the site.  SREs, for example, are likely more interested in a drop in key business metrics than an increase in CPU utilization.

Netflix monitor the rate at which customers hit the play button on their video streaming device. They call this metric video-stream starts per second,
or SPS.

Characterizing Steady State

Most business metrics may fluctuate significantly. One approach may be to plot last week’s data on top of the current data in order to be able to spot anomalies. Depending on your domain, your metrics might not be too predictable. e.g., a news website. In these types of cases, characterizing the steady state behavior of the system will be difficult, but it is a necessary precondition of creating a meaningful hypothesis about it.

Forming Hypotheses

Whenever you run a chaos experiment, you should have a hypothesis in mind about what you believe the outcome of the experiment will be. It can be tempting to subject your system to different events (for example, increasing amounts of traffic) to “see what happens.” However, without having a prior hypothesis in mind, it can be difficult to draw conclusions if you don’t know what to look for in the data.

Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system.

An example hypotheses may be in the form of “the events we are injecting into the system will not cause the system’s behavior to change from steady state.”

Think about how you will measure the change in steady state behavior. Even when you have your model of steady state behavior, you need to define how you are going to measure deviations from this model. Identifying reasonable thresholds for deviation from normal can be challenging.

CHAPTER 4 – Vary Real-World Events

Every system, from simple to complex, is subject to unpredictable events and conditions if it runs long enough. Examples include increase in load, hardware malfunction, deployment of faulty software, and the introduction of invalid data.

It is not possible to prevent threats to availability, but it is possible to mitigate them. In deciding which events to induce, estimate the frequency and impact of the events and weigh them against the costs and complexity. Netflix for example:

  • Turn off machines because instance termination happens frequently in the wild and turning off a server is cheap and easy.
  • Simulate regional failures event hough to do so is costly and complex, because a regional outage has a huge impact on our customers unless we are resilient to it

We don’t need to enumerate all of the possible events that can change the system, we just need to inject the frequent and impactful ones as well as understand the resulting failure domains.

Only induce events that you expect to be able to handle! Induce real-world events, not just failures and latency. While the examples provided have focused on the software part of systems, humans play a vital role in resiliency and availability. Experimenting on the human-controlled pieces of incident response (and their tools!) will also increase availability.

CHAPTER 5 – Run Experiments in Production

The idea of doing software verification in a production environment is generally met with derision. “We’ll test it in prod” is a form of gallows humor, which translates to “we aren’t going to bother verifying this code properly before we deploy it.”

It is generally better to identify bugs as far away from production as possible, e.g. better to find in unit tests rather than in integration tests.

When it comes to Chaos Engineering, the strategy is reversed: you want to run your experiments as close to the production environment as possible.

When we do traditional software testing, we’re verifying code correctness. When we run Chaos Engineering experiments, we are interested in the behavior of the overall system. State, input and other people’s systems lead to all sorts of system behaviors that are difficult to foresee.

State and Services

State can take many forms:

  • “stateful services,” such as databases, caches, and durable message queues.
  • Configuration data (e.g., static configuration files or a dynamic configuration service like etcd)

Even in “stateless” services, there is still state in the form of:

  • In-memory data structures that persist across requests and can therefore affect the behavior of subsequent requests.
  • Even the number of virtual machines or containers in an autoscaling group is a form of system state.
  • Network hardware such as switches and routers also contain state.

Eventually, some unexpected state is going to bite you. In order to catch the threats to resiliency, you need to expose experiments to the same state problems that exist in the production environment.

Input in Production

The users of a system never seem to interact with it in the way that you expect them to. The only way to truly build confidence in the system at hand is to experiment with the actual input received by the production environment.

Other People’s Systems

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
—Leslie Lamport

We invariably depend on external systems, whose behavior we cannot possibly hope to know. The behavior of other people’s systems will always differ between production and synthetic environments. This reinforces the fact that you want to run experiments in production, the only place where you will have an authentic interaction with those other systems.

Poor Excuses for Not Practicing Chaos

We recognize that in some environments it may be difficult or even impossible to run experiments directly in a production environment but we suspect most users are not working on these kinds of safety-critical systems.

I’m pretty sure it will break!

You should go into an experiment with a reasonably high level of confidence that your system is resilient. One of the main purposes of Chaos Engineering is to identify weaknesses in your system. If you already believe the weaknesses are present, then you should be focusing on improving your system resiliency. Once you believe that the system is resilient, go ahead with chaos experiments.

If it does break, we’re in big trouble!

Even if you have high confidence that your system is resilient, you might be hesitant to run a Chaos Engineering experiment out of fear that the experiment will do too much harm if it does reveal a weakness.
This is a legitimate concern. You can minimize the potential harm in two ways:

  • Make it easy to abort the experiment
  • Minimize the blast radius of the experiment

Get as Close as You Can

Even if you cannot run directly in production, the closer your experimental environment is to production, the more confidence you can have in the results.
It’s better to risk a small amount of harm in order to protect the system from suffering a significant outage in the future.

CHAPTER 6 – Automatically Executing Experiments

Doing things manually and performing one-off experiments are great first steps. Start with a manual approach, handling everything with kid gloves to gain confidence in both the experiment and the system.

But once we have successfully conducted the experiment, the next step is to automate the experiment to run continuously. If experimentation is not automated, it is obsolescent and won’t happen.

Production is in a perpetual state of change. As a result, the confidence in a result decays with time.

We can not—and should not—ask engineers to sacrifice development velocity to spend time manually running through chaos experiments on a regular basis. Instead, invest in creating tools and platforms for chaos experimentation that lower the bar to creating new chaos experiments and running them automatically.

CHAPTER 7 – Minimize Blast Radius

Each chaos experiment has the potential to cause a production outage. Your responsibility as a chaos engineer is to understand and mitigate risks.

Ensure that it is possible to do an emergency stop of an experiment, should things go wrong, to prevent a crisis. In many ways, our experiments are looking for the unknown and unforeseen repercussions of failure, so the trick is how to shed light on these vulnerabilities
without accidentally blowing everything up. This is called “minimizing the blast radius.”

Chaos experiments should take careful, measured risks that build upon each other. This escalation of scope ratchets up confidence in the system without causing unnecessary customer harm.

It is imperative to be able to abort in-process experiments when they cause too much pain. If your test causes your system to use a degraded mode that is a minor but acceptable annoyance, but when the system becomes unavailable or unusable by your customers the experiment should be terminated immediately.

Automated termination is highly recommended too, but figuring out how to build a system that can monitor the metric of interest and unwind a chaos experiment in real time is a non-trivial exercise!

We want to build confidence in the resilience of the system, one small and contained failure at a time.

Chaos In Practice

Teams may be reluctant to implement chaos because of the customer and financial impact, but failures will happen regardless of intention or planning. While running experiments that surface vulnerabilities which may cause negative impacts, it is much better to know about them while the team is ready and prepared, and can control the extent of the impact than to be caught off-guard at 3am by the inevitable, large-scale failure.

CHAPTER 8 – Designing Experiments

Here’s an overview of designing a Chaos Engineering experiment:

  1. Pick a hypothesis
  2. Choose the scope of the experiment
  3. Identify the metrics you’re going to watch
  4. Notify the organization
  5. Run the experiment
  6. Analyze the results
  7. Increase the scope
  8. Automate

1. Pick a Hypothesis

Perhaps you pick a hypothesis based on a fix to a recent outage fix, or perhaps you’d like to verify that your active-passive database configuration fails over cleanly.

The hypothesis can also be human focused. For example, how well do the on-call engineers know the contingency plan for your main messaging or paging service being down? Running a chaos experiment is a great way to find out.

2. Choose the Scope of the Experiment

Two principles apply here:

  • “run experiments in production”, and
  • “minimize blast radius”

The closer your test is to production, the more you’ll learn from the results but there’s always a risk of doing harm.
To minimize the amount of customer pain, start with a narrowly scoped dry run in a test environment to get a signal and then ratchet up the impact until we achieve the most accurate simulation of the biggest impact we expect our systems to handle. Once you do move to production, you’ll want to start out with experiments that impact the minimal amount of customer traffic.

3. Identify the Metrics You’re Going to Watch

Be clear on what metrics you are going to use in your test. If your hypothesis is “if we fail the primary database, then everything should be ok,” you’ll want to have a crisp definition of “ok” before you run the experiment. If you have a clear business metric like “orders per second,” or lower-level metrics like response latency and response error rate, be explicit about what range of values are within tolerance before you run the experiment.

If the experiment has a more serious impact than you expected, you should be prepared to abort early. A firm threshold could look like: 5% or more of the requests are failing to return a response to client devices. This will make it easier for you to know whether you need to hit the big red “stop” button when you’re in the moment.

4. Notify the Organization

When you first start off running chaos experiments in the production environment, you’ll want to inform members of your organization about what you’re doing, why you’re doing it, and (only
initially) when you’re doing it.

For the initial run, you might need to coordinate with multiple teams who are interested in the outcome and are nervous about the impact of the experiment. As you gain confidence by doing more experiments and your organization gains confidence in the approach, there will be less of a need to explicitly send out notifications about what it is happening.

5. Run the Experiment

Perform the chaos experiment, watch the metrics, halt if required.
Ensure that you have alerting in place in case when critical metrics dip below a certain threshold.

6. Analyze the Results

After the experiment is done, use the metrics you’ve collected to test if your hypothesis is correct. Was your system resilient to the real world events you injected? Did anything happen that you didn’t expect?

Many issues exposed by Chaos Engineering experiments will involve interactions among multiple services. Make sure that you feed back the outcome of the experiment to all of the relevant teams so they can mitigate any weaknesses.

7. Increase the Scope

Once you’ve gained some confidence from running smaller-scale experiments, you can ratchet up the scope of the experiment to reveal systemic effects that aren’t noticeable with smaller-scale

8. Automate

Once you have confidence in manually running your chaos exercises, you’ll get more value out of your chaos experiments once you automate them so they run regularly.

Tags: , , , , , ,

Leave a Reply