Shaun Abram
Technology and Leadership Blog
Talk summary: SRE principles by Tori Wieldt @ AWS re:Invent 2018
I caught a talk by Tori Wieldt at the New Relic booth at AWS re:Invent on “SRE principles”. Even though it was a short talk in the expo hall, rather than a formal scheduled one, it had a ton of good SRE material.
I have likely misquoted or even misconstrued some of Tori’s points, so you can find her original slides on slideshare, and New Relic have lots of other information on Site Reliability Engineering here.
Contents
Tori started by referencing the Google SRE books, “Site Reliability Engineering” and “The Site Reliability Workbook” (I really need to get around to reading these!), but asking how the rest of us mere mortals who don’t work at good deal with SRE.
What should SREs do?
Continuously improve reliability of your application, and ultimately to have a good user experience.
(I liked this latter point. For example, a service may be down but the user experience may not be affected if you have good resilience strategies in place such as defaults, caches, failovers etc.)
How should SREs help build reliability?
- Check the team has runbooks
- If not, write them!
- If so, review them – are they understandable to someone not familiar with the system, at 2am during an outage?
- (My understanding of a runbook is a document describing what to do in certain, typically undesirable, scenarios. While we do our best to build reliable systems, failure is an inevitability, so we should document how to deal with. Victorops have some notes on how to create a Minimal Viable Runbook)
- Automate where possible, avoiding toil
- Toil can defined as repetitive works that scales linearly
- For example, make sure any infrastructure provisions is automated e.g. ansible or CloudFormation
- Hold Game Days (something close to my heart that I discussed in Testing in Production – Game Days)
- “Reduce sprawl”, by ensuring the team is using a code pipeline and standardize on build tools
- Finally, improve monitoring and clean up alerts
Three Spheres of reliability
Tori also talked about the “Three Spheres of reliability”:
- Stability (Operations)
- Reliability (Improvement)
- Engineering (Shift left)
(I actually hadn’t heard the “Shift Left” term before, but after reading about it, I recognize the concept. It basically means testing should be performed earlier when possible, similar to the “Test early and often” philosophy. For example, if a bug is caught in UAT, could it have been caught earlier by an integration test? If it is caught in a integration test, can we create a unit test for it instead? Another way to put it might be prevention is better than detection.)
Foundations for Reliability
- Reliability is a feature
- Reliability depends on shared understanding
- SRE is a challenging cross-disciplinary practice
(I really liked the “Reliability is a feature” idea. I hadn’t thought of it that way before, but it did occur to me that it is a feature that it not likely to show up in a sprint demo. No one notices when things work just fine and the lack of reliability of an app is not likely to be apparent in the relatively short and artificial confines of a sprint demo. In other words, getting Product Managers to recognize reliability as a (necessary and indeed essential) feature can be a big challenge for SREs and developers. I actually think part of an SREs job is highlighting the need for reliability and advocating for sprint time to improve it. The PMs aren’t the ones who will get paged at 3am for unreliable apps, but they will care when customers start complaining about a flaky user experience. Make sure reliability is considered by the team before customers do!)
Template for success
1. Determine your goal
2. Establish Roles
3. Focus Areas
Finally, one point that was made at the very start of the talk was that some organizations may even consider occasionally pulling the plug on apps that are otherwise very reliable and available, just so that the developers of the consuming services don’t expect 100% uptime! I love this idea. I guess one approach to make this feasible in an organization is to announce that you will do it, and give people a chance to perform some chaos engineering like testing on it before hand. “We’re going to kill it, you better be ready!”
Tags: aws, newrelic, reinvent, reinvent2018, sitereliabilityengineering, sre, summary, Testing