Shaun Abram
Technology and Leadership Blog
Report Summary: Accelerate State of DevOps 2019
This is an abridged version of The Accelerate State of DevOps Report 2019; essentially a cut and paste of the most salient parts. The original is about 18,000 words; This is about 2,500 words.
I highly recommend reading the original in its entirety, if you have time, and I’m a big fan of the Accelerate book too. As with all the other summaries I create, this just as as way to help me digest and understand an excellent article.
Contents
- 1 Key Findings
- 2 How do we compare?
- 3
- 4 How do we improve?
- 5 How do we Transform: What really works?
- 6 Final Thoughts
The Accelerate State of DevOps Report provides an overview of the DevOps industry, providing actionable guidance for organizations to improve their software delivery performance.
Teams can then leverage the findings of the report to identify the specific capabilities they can use to improve their software delivery performance and ultimately become an elite performer.
The research continues to show that the industry-standard Four Key Metrics (https://www.thoughtworks.com/radar/techniques/four-key-metrics) of software development and delivery drive organizational performance and that it is possible to optimize for stability without sacrificing speed.
Key Findings
1) The industry continues to improve, particularly among the elite performers.
2) Delivering software quickly, reliably, and safely is at the heart of technology transformation and organizational performance.
3) The best strategies for scaling DevOps in organizations focus on structural solutions that build community.
4) Cloud continues to be a differentiator for elite performers and drives high performance.
5) Productivity can drive improvements in work/life balance and reductions in burnout, and organizations can make smart investments to support it.
6) There’s a right way to handle the change approval process, and it leads to improvements in speed and
stability and reductions in burnout.
How do we compare?
Software delivery and operational (SDO) performance
There are four metrics that provide a high-level systems view of software delivery and performance and predict an organization’s ability to achieve its goals. Last year, we added an additional metric focused on operational capabilities, and found that this measure helps organizations deliver superior outcomes. We call these five measures software delivery and operational (SDO) performance. This helps avoid the common pitfalls of software metrics, which often pit different functions against each other and result in
local optimizations at the cost of overall outcomes.
The first four metrics that capture the effectiveness of the development and delivery process can be summarized in terms of throughput and stability. We measure the throughput of the software delivery process using lead time (from check-in to release) along with deployment frequency. Stability is measured using time to restore (the time it takes from detecting a user impacting incident, to having it remediated)
and change fail rate (a measure of the quality of the release process).
These metrics do not represent a set of trade-offs; The research has consistently shown that speed and stability are outcomes that enable each other.
Throughput
Deployment frequency
How often does your organization deploy code to production or release to its end users?
The elite group reported that it routinely deploys on-demand and performs multiple deployments per day. By comparison, low performers reported deploying 2-12 times per year.
Change lead time
Lead time is how long does it take to go from code committed to having that code successfully deployed in production.
Elite performers report change lead times of less than one day. In contrast, low performers required lead times between one month and six months.
Stability
Time to restore service
How long does it take to restore service when an incident or defect that impacts users occurs? e.g., unplanned outages or service impairment.
The elite group reported time to restore service of less than one hour,
while low performers reported between one week and one month.
Change failure rate
What percentage of changes to production result in degraded service and require remediation such as a hotfix or rollback?
Elite performers reported a change failure rate of 7.5%,
while low performers reported change failure rates of 53%.
Availability
In addition to speed and stability, availability is important for operational performance. Availability represents an ability for technology teams and organizations to keep promises and assertions about the software they are operating. Availability reflects how well teams define their availability targets, track their current availability, and learn from any outages, making sure their feedback loops are complete. The items used to measure availability form a valid and reliable measurement construct.
Better software delivery goes hand-in-hand with higher availability.
How do we improve?
How do we improve SDO & Organizational Performance?
Start with foundations:
- Basic automation (such as version control and automated testing),
- monitoring,
- clear change approval processes,
- a healthy culture.
Then identify your constraints to plan your path forward.
Focus resources on what is currently holding you back, then iterate: Identify constraints and choose the next target.
Cloud
More organizations are choosing multi-cloud and hybrid cloud solutions, although there is no agreed standard definition for what it means to work in a hybrid or multi-cloud environment. If respondents say they are working in a hybrid environment, then they are.
How you implement Cloud Infrastructure matters
What really matters is how teams implement their cloud services, not just that they are using a cloud technology.
The National Institute of Standards and Technology (NIST) defines five essential characteristics cloud computing. Elite performers were 24 times more likely to have met all essential cloud characteristics than low performers. The characteristic are:
- On-demand self-service: Consumers can automatically provision computing resources as needed, without human interaction from the provider.
- Broad network access: Capabilities can be accessed through heterogeneous platforms such as mobile phones, tablets, laptops, and workstations.
- Resource pooling: Provider resources are pooled in a multi-tenant model, with physical and virtual resources dynamically assigned on-demand.
- Rapid elasticity: Capabilities can be elastically provisioned and released to rapidly scale outward or inward on demand, appearing to be unlimited.
- Measured service: Cloud systems automatically control, optimize, and report resource use based on the type of service such as storage, processing, bandwidth, and active user accounts.
Cloud Cost
Adopting cloud best practices improves organizations’ visibility into the cost of running their technologies making it more likely to be able to
- accurately estimate the cost to operate software
- identify their most operationally expensive applications
- stay under software operation budget
Technical Practices
Executing for maximum effect
First work to understand the constraints in your current software delivery process with an eye to your short- and long-term outcomes in measurable terms.
Then empower teams to decide how best to accomplish those outcomes. By not having to micromanage detailed execution plans, management can focus on high-level outcomes, allowing their organizations to grow. By focusing on designing and executing short-term outcomes that support the long-term strategy, teams are able to adjust to emergent and unanticipated problems.
While there is no “one size fits all” approach to improvement, we have observed some themes in our work helping organizations adopt DevOps…
Concurrent efforts at team and organization levels
Team-level and organization-level efforts can and should proceed concurrently, as they often support each other.
For example, deployment automation at the team level will have little impact if the team’s code can only be deployed together with that of other teams.
Team-level technical capabilities
Test automation has a significant impact on both CI (continuous integration) and CD (continuous delivery). With automated testing, developers gain confidence that a failure in a test suite denotes an actual failure just as much as a test suite passing successfully means it can be successfully deployed. Deployment automation, trunk-based development, and monitoring all impact CD.
Organization-level technical capabilities
Some capabilities benefit from organization-level coordination and sponsorship. Examples of these kinds of capabilities are those that involve decisions or design that span several teams, such as architecture or policy (e.g., change management).
This year’s research revalidated the positive impact of loosely coupled architecture has a positive impact on CD. A loosely coupled architecture is when delivery teams can independently test, deploy, and change their systems on demand without depending on other teams for additional support, services, resources, or approvals, and with less back-and forth communication. This allows teams to quickly deliver value, but it requires orchestration at a higher level.
Architectural approaches that enable this strategy include the use of bounded contexts and APIs, and service oriented and microservice architectures. Architecture designs that permit testing and deploying services independently help teams achieve higher performance.
Our analysis found that code maintainability positively contributes to successful CD.
We have found that trunk-based development with frequent check-in to trunk and deployment to production is predictive of performance outcomes.
Whether you are working on a closed-source code base or an open source project, short-lived branches; small, readable patches; and automatic testing of changes make everyone more productive.
Disaster Recovery Testing
DISASTER RECOVERY TEST TYPES
Tests should be performed using production systems, for two reasons.
- It’s difficult and expensive to create comprehensive reproductions of
production systems. - The types of incidents that bring down production systems are often caused by interactions between components that are operating within apparently normal parameters, which might not be encountered in test environments.
Organizations that conduct disaster recovery tests are more likely to have higher levels of service availability.
Mike Garcia, Vice President of Stability & SRE at Capital One, says (paraphrased):
Delivering quickly on modern cloud technology is not enough. We need to demonstrate resiliency, and more just the ability to failover
en masse… We need to show automatic resiliency using chaos-testing techniques.
Organizations that work together cross-functionally and cross-organizationally to conduct disaster recovery exercises see improvements in more than just their systems. Because these tests pull together so many teams, the exercises also improve and strengthen the processes and communication surrounding the systems being tested, making them more efficient and effective.
Learning from Disaster Recovery Exercises
Analysis shows that organizations that create and implement action items based on what they learn from disaster recovery exercises are more likely to be in the elite performing group.
Blameless post-mortems are an important aspect to support growth and learning from failure. Conducting blameless post-mortems contributes to a learning culture and an organizational culture that optimizes for software and organizational performance outcomes.
Good follow reading is Weathering the Unexpected by Kripa Krishnan with Tom Limoncelli
Change Management
We recommend that organizations move away from external change approval because of the negative effects on performance. Instead, organizations should “shift left” to peer review-based approval during the development process.
A clear change process is also important. When team members have a clear understanding of the process to get changes approved for implementation, this drives high performance.
Culture of Psychological Safety
Our own research has found that an organizational culture that optimizes for information flow, trust, innovation, and risk-sharing is predictive of performance.
High-performing teams need a culture of trust and psychological safety, meaningful work, and clarity.
How do we improve Productivity
Productivity is the ability to get complex, time-consuming tasks completed with minimal distractions and interruptions.
Most agree that productivity is important: Productive engineers are able to do their work more efficiently, giving them more time to re-invest into other work, such as documentation, refactoring, or doing more of their core function to deliver additional features or build out additional infrastructure.
But what is productivity, and how should we measure it? Productivity cannot be captured with a simple metric such as lines of code, story points, or bugs closed; doing so results in unintended consequences that sacrifice the overall goals of the team. For example, teams may refuse to help others because it would negatively impact their velocity, even if their help is important to achieve organizational goals.
To use this model, locate the goal you want to improve in the figure, and then identify the capabilities that impact it.
Useful, Easy-to-use Tools
When building complex systems and managing business-critical infrastructure, tools are even more important because the work is more difficult.
The strongest concentration of fully proprietary software is seen in low performers, while the lowest concentration is seen among high and elite performers. Proprietary software may be valuable, but it
comes at great cost to maintain and support. It’s no surprise that the highest performers have moved away from this model.
Companies should be thoughtful about which software is strategic and which is merely utility. By addressing their utility needs with commercial solutions and minimizing customization, high performers save their resources for strategic software development efforts.
Automation is truly a sound investment. It allows engineers to spend less time on manual work, thereby freeing up time to spend on other important activities such as new development, refactoring, design work, and documentation.
Internal and External Search
Finding the right information to help solve a problem, debug an error, or find a similar solution quickly and easily can be a key factor in getting work done and maintaining the flow of work. We found that having access to information sources supports productivity. These information sources come in two categories: internal (e.g. wikis, Jira, github) and external (e.g. search engines and stackoverflow) search.
Technical Debt
Technical debt is what happens when we fail to adequately maintain “immature” code. Technical debt negatively impacts productivity. One approach to reducing technical debt is refactoring.
Culture of Psychological Safety
A culture that values psychological safety, trust, and respect contributes to productivity by letting employees focus on solving problems and getting their work done rather than politics and fighting.
Additional Benefits of Improved Productivity
The benefits to the team and organization from higher productivity are usually obvious: more work gets done, so we deliver more value. But what about benefits to the people doing the work?
Work Recovery
Work recovery is the ability to cope with work stress and detach from work when we are not working, or “leaving work at work”. Productivity has a positive impact on work recovery,
Burnout
Burnout is a combination of exhaustion, cynicism, and inefficacy at work. Good technical practices and improved process (in the form of clear change management) can reduce burnout. We see that the highest performers are half as likely to report feeling burned out.
How do we Transform: What really works?
A DevOps transformation is not a passive phenomenon. This year we sought to identify the most common approaches for spreading DevOps best practices throughout an organization.
There are some commonly used approaches that organizations use to spread DevOps and Agile methods, including:
- Training Center
- Center of Excellence
- Proof of Concept but Stall
- Proof of Concept as a Template
- Proof of Concept as a Seed
- Communities of Practice
- Big Bang
- Bottom-up or Grassroots
High performers favor strategies that create community structures at both low and high levels in the organization. The top two strategies employed are Communities of Practice and Grassroots, followed by Proof of Concept (PoC) as a Template and PoC as a Seed.
Low performers tend to favor Training Centers (also known as DOJOs) and Centers of Excellence (CoE)—strategies that create more silos and isolated expertise.
Final Thoughts
We see continued evidence that DevOps delivers value. It is not a trend, and will eventually be the standard way of software development and operations, offering everyone a better quality of life.
Tags: accelerate, devops, fourkeymetrics, jezhumble, nicoleforsgren, stateofdevops, summary