RSS Feed Subscribe to RSS Feed


Talk summary: How Complex Systems Fail by Dr Richard Cook @ Velocity 2012

“How Complex Systems Fail” is a talk by Dr Richard Cook at Velocity 2012.

I’ve included a link to the video on YouTube below, and some of my key takeaway points.

When it comes to complex systems, “The surprise is not that there are so many accidents
The surprise is that there are so few.”

Every time there is an outage “We felt the wings of the angel of death fluttering around out foreheads”. (Or to paraphrase an old proverb, there but for the grace of the software Gods go I). In other words, even if the issue is a near miss, or did involve you (or your code!) directly, there can be a sense of whew, that was a close one. Sometimes it is simply a wonder why are there so few accidents at all.

Individual systems can have reasonable defaults, but the way that you put systems together can result in catastrophic failure.

Systems research results 1987-2012

  • The real world often surprises
  • Existential threats occur
  • Demand & op-tempo varies
  • Stuff don’t work as advertised
  • Novel conditions are common
  • Lots of adapting / tailoring

What is resilience? It is:

  • Learning
  • Monitoring
  • Responding
  • Adapting

What do we want from resilient systems? We want our systems to have the ability to:

  • Withstand transient conditions
  • Recover swiftly & smoothly from failures
  • Prioritize to serve high level goals
  • Recognize and respond to abnormal situations
  • Adapt to change

How do we design for resilience?

  • Support continuous maintenance
  • Reveal the controls
  • Show the lift pints
  • Support mental simulation
  • Open objects/methods
  • Get rid of “don’t touch this!” mentality
  • Empower operator learning

Tags: , , ,

Leave a Reply