Shaun Abram
Technology and Leadership Blog
Talk summary: Using Chaos to Build Resilient Systems by Tammy Butow @ QCon2018
“Using Chaos to Build Resilient Systems” was a talk by given by Tammy Butow of Gremlin at QCon New York 2018 . I really enjoyed the talk, so wanted to summarize some of the key points of interest to me.
Introduction
What is Chaos Engineering?
“Thoughtful planned experiments designed to reveal the weaknesses in our systems” – Kolton Andrus, Gremlin CEO
That is, not random, not chaotic, but a very considered approach to testing in production.
Think of Chaos Engineering as like a vaccine – “inject something harmful to build an immunity”.
Do Chaos Engineering (or Game Days, or Failure Fridays) as a way to spot issues with monitoring, alerting, resiliency, to learn new things, or to train people!
If you’re company is a “rocketship”, it may feel like you are on the outside, trying to fix it as you fly!
Examples:
- Google is an expert at designing services which you wont notice when there is a downtime.
If there is an issue with Google Search for example, results might be slightly less accurate, not quite so up-to-date, or the page won’t show the “last visited” time beside search results. - Netflix is also an example: If the movie recommendation service is not available, default recommendations may be shown instead.
Testing those graceful degradations is where chaos engineering comes in to play.
Part1: Laying the foundation
Why run Chaos Engineering Experiments?
The goal is to catch issues before they turn into high severity incidents (e.g. users unable to puchase new products).
- Are you confident your alerting and metrics are as good as they should be?
- Are you confident your customers are getting s good an experience as they should be? (if you’re customers are having pain pints, do you even know?)
- Are you losing money due to downtime and broken features?
How do you run Chaos Engineering Experiments?
- Form a hypothesis
- Consider a blast radius (start small and expand)
- Run the experiment
- Measure results (and make sure you have baseline metrics!)
- Find & fix issues or scale
Don’t run before you can walk.
The 3 prerequisities for Chaos Engineering
- Monitoring & Observability
- On-call & Incident Management
- Know your cost of downtime per hour
How to choose a chaos experiemnt
- Identify your top 5 critical systems
- Choose 1 system
- Whiteboard the system
- Select attack: resource/state/network e.g. CPU, latency injection
- Determine scope
What should we measure
- Availability – 500S
- Service specific KPIs
- System metrics: CPU, IO, Disk
- Customer complaints
Part2: Types of Chaos Engineering
- Resource Chaos Engineering: Increase CPU, Disk IO & Memory consumption to ensure monitoring is setup to catch problems
- State Chaos Engineering: Process Chaos (killing or spawning processes), Shutdown Chaos (Shutdown a server, or a container), Time Travel chaos (i.e. Clock skew)
- Networking Chaos Engineering e.g. Blackhole chaos, DNS Chaos, Packet Loss Chaos
Finally: Complex Outages
We can combine different types of chaos engineering experiments to reproduce complicated outages.
Reproducing outages gives you confidence you can handle it if/when it happens again.
Tags: chaos, chaosengineering, failurefridays, gamedays, infoq, resilience, resilienceengineering, summary