RSS Feed Subscribe to RSS Feed


Talk summary: Using Chaos to Build Resilient Systems by Tammy Butow @ QCon2018

Using Chaos to Build Resilient Systems” was a talk by given by Tammy Butow of Gremlin at QCon New York 2018 . I really enjoyed the talk, so wanted to summarize some of the key points of interest to me.


What is Chaos Engineering?

“Thoughtful planned experiments designed to reveal the weaknesses in our systems” – Kolton Andrus, Gremlin CEO

That is, not random, not chaotic, but a very considered approach to testing in production.
Think of Chaos Engineering as like a vaccine – “inject something harmful to build an immunity”.

Do Chaos Engineering (or Game Days, or Failure Fridays) as a way to spot issues with monitoring, alerting, resiliency, to learn new things, or to train people!

If you’re company is a “rocketship”, it may feel like you are on the outside, trying to fix it as you fly!


  • Google is an expert at designing services which you wont notice when there is a downtime.
    If there is an issue with Google Search for example, results might be slightly less accurate, not quite so up-to-date, or the page won’t show the “last visited” time beside search results.
  • Netflix is also an example: If the movie recommendation service is not available, default recommendations may be shown instead.

Testing those graceful degradations is where chaos engineering comes in to play.

Part1: Laying the foundation

Why run Chaos Engineering Experiments?

The goal is to catch issues before they turn into high severity incidents (e.g. users unable to puchase new products).

  • Are you confident your alerting and metrics are as good as they should be?
  • Are you confident your customers are getting s good an experience as they should be? (if you’re customers are having pain pints, do you even know?)
  • Are you losing money due to downtime and broken features?

How do you run Chaos Engineering Experiments?

  • Form a hypothesis
  • Consider a blast radius (start small and expand)
  • Run the experiment
  • Measure results (and make sure you have baseline metrics!)
  • Find & fix issues or scale

Don’t run before you can walk.

The 3 prerequisities for Chaos Engineering

  1. Monitoring & Observability
  2. On-call & Incident Management
  3. Know your cost of downtime per hour

How to choose a chaos experiemnt

  1. Identify your top 5 critical systems
  2. Choose 1 system
  3. Whiteboard the system
  4. Select attack: resource/state/network e.g. CPU, latency injection
  5. Determine scope

What should we measure

  • Availability – 500S
  • Service specific KPIs
  • System metrics: CPU, IO, Disk
  • Customer complaints


Part2: Types of Chaos Engineering

  • Resource Chaos Engineering: Increase CPU, Disk IO & Memory consumption to ensure monitoring is setup to catch problems
  • State Chaos Engineering: Process Chaos (killing or spawning processes), Shutdown Chaos (Shutdown a server, or a container), Time Travel chaos (i.e. Clock skew)
  • Networking Chaos Engineering e.g. Blackhole chaos, DNS Chaos, Packet Loss Chaos


Finally: Complex Outages

We can combine different types of chaos engineering experiments to reproduce complicated outages.

Reproducing outages gives you confidence you can handle it if/when it happens again.

Tags: , , , , , , ,

Leave a Reply