RSS Feed Subscribe to RSS Feed


Talk summary: Chaos Engineering by Adrian Cockroft @ ChaosConf18

Title: Chaos Engineering – What is it, where did it come from, and where might it be going?

Speaker: Adrian Cockcroft (AWS VP Cloud Architecture Strategy)

Conference: Chaos Conference 2018 (


The following are some brief notes and slide summaries from Adrian’s keynote at ChaosConf 2018…

What should your system do when something fails?
Saying it shouldn’t fail is not realistic!
There are really two main options

  • Stop (e.g. if you are moving money, and not sure, you should stop)
  • Carry on with reduced functionality

Related post – Memories, Guesses, and Apologies


You can’t legislate against failure, focus on fast detectton and response

– Chris Pinkham

(In other words, you cant write down and protect against everything that might go wrong)


Synaptic illegibility

If you can write a synopsis (exactly what happens), you cant automate an adhoc and messy system. – Sydney Decker, the Safety Anarchist


What is supposed to happen when part of the system fails
How is it supposed to recover after the failure goes away

Drift into failure” by Sydney Dekker. Even when everyone does everything right individually, things can still go wrong.


Greenspun’s tenth rule:

Any sufficiently complicated C or Fortran program contains an ad-hoc, informally specified, bug ridden, slow implementation of half of common lisp

– Philip Greenspun 1993

That is, developers love to reinvent things from first principles (they like to build things)

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capacity to withstand turbulent conditions in production – Principles of chaos engineering (

Distilled: experiment to ensure that the impact of failure is mitigated.


How do you think of everything that might fail? A taxonomy of failures… (timed video link here)

  • Infrastructure
  • Software stack
  • Application
  • Operations

Infrastructure failures:

  • Device failures
  • CPU failures
  • Datacenter failures
  • Internet failures

Software stack failures:

  • Time bombs (e.g. memory leaks)
  • Date bombs (leap year/second, Y2K)
  • End of unix time
  • Expiration (certificates expire)
  • Revocation (license revoked by supplier)
  • Exploits (Security failures e.g. Heartbleed)
  • Language or Runtime bugs (e.g. compiler, JVM, Docker, Hypervisor, Linux)

Application failures:

  • Configuration
  • Versioning (incompatible mixes)
  • Time & Date bombs
  • Content bombs (e.g. issues with html formatting causing infinite loops)
  • Cascading failures (error handing bugs)
  • Cascading overloaded (e.g. lock contention)
  • Retry storms (too many retries, bad timeout strategy)

Operations Failures

  • Poor capacity planning
  • Inadequate incident management
  • Failure to initiate incident
  • Unable to access monitoring dashboards
  • Insufficient observability of systems
  • Incorrect corrective actions


  • Formerly known as Disaster Recovery
  • More recently known as Chaos Engineering
  • We are aiming towards Resilient Critical Systems

Hypothesis testing

  • In production
  • Without causing an issue
  • We think we have a safety margin in this dimension, let’s carefully test to be sure

What we are aiming for:

  • Experienced staff
  • Robust applications
  • Dependable switching fabric
  • Redundant service foundation

As datacenters migrate to the cloud, fragile and manual disaster recovery will be replaced by chaos engineering.

Testing failure mitigation will move from a scary annual experience to automated continuous chaos.


I think my favorite part from the whole talk is where Adrian points out that Observability is not a new term, but was in fact defined way back in 1961 for Control Systems Theory:

A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs.

Adrian argues that a Lamba with no side effects and no state is inherently very observable. You can figure out exactly what it is doing purely by looking at it’s inputs and outputs. At the other end of the scale, you have monoliths which are typically much less observable.



Tags: , , ,

Leave a Reply