RSS Feed Subscribe to RSS Feed

 

How much is your slow lead time costing you?

In a previous blog post, I discussed slow build times and estimated the associated costs. The build process is only one part of getting software out the door however.

Lead time is the time it takes to go from code committed to successfully running in production. This will include the build time we covered in the previous blog post, as well as all the other things required to get your code into users hands such as testing & deployments. This article focuses on the costs of that lead time.

Using the example of a team of 10 engineers, I estimate that the costs of a slow (one week) lead time could be the approximate equivalent of more than 3 engineers, or $400,000 per year. And I think it’s entirely possible that is on the low side since there are other costs that are just difficult to estimate. Imagine how much more you could achieve with 3+ extra engineers on the team.

Charity Majors goes further (discussed below) and suggests that reducing the lead time to hours could save the cost of 5 engineers on such a team. I was initially skeptical on that claim, but after trying out these estimates, she think may well be more accurate that my possibly over-conservative math.

Thanks

A big thank you to my former colleagues Dave Taubler, Abhijit Karpe, Josh Outwater and Steve Mauro for providing feedback and input on this article.

Most of the feedback took issue with some aspect of the estimates, which is fair, but the common theme seemed to be that everyone agreed that there is a very real cost to slow lead times, that it is high, and that using data where you can and estimates where needed is a good way to surface and highlight that cost.

 

 

Why Lead Time is important

First, what is Lead time? The excellent Accelerate book (which I summarized here) defines Lead Time as:

“the time it takes to go from code committed to code successfully running in production.”

A related concept is the deployment pipeline, defined in the Continuous Delivery book as:

“A deployment pipeline is an automated implementation of your application’s build, deploy, test, and release process.”

Lead Time can be viewed as a reflection of your deployment pipeline’s efficiency and as such is an incredibly useful metric to track. The Accelerate book states that:

Shorter product delivery lead times are better since they enable faster feedback on what we are building and allow us to course correct more rapidly. Short lead times are also important when there is a defect or outage and we need to deliver a fix rapidly and with high confidence.”

Lead time is one of the key Performance metrics highlighted in 2019 State of Devops Report, which shows that long lead times are not at all unusual. Even high performers can have lead times of up to a week, with low performers having lead times of up to a whopping 6 months.

Not understanding the true cost of that time could mean that we are ignoring significant costs and inefficiencies in our organizational processes. We need to make those hidden costs visible. Being able to articulate and estimate the cost of long lead times makes it easier to have conversations with Product Managers about the priority of the work. Also, sometimes people hear about CICD and say but we don’t need multiple releases a day! Multiple releases per day is not the goal in itself. Reducing waste, WIP, and the associated costs, is.

So, let’s look at the actual cost of slower lead times…

Three ways a slow lead time could be costing you

 

If you have a lead time of, say, a week (meaning that after a MR or PR is merged, it takes a week for that code to actually get to production), how much does that actually cost you? If you were to decrease this lead time from a week to an hour for example, how much would you save? I will try answer this by looking at 3 separate ways that slow leads time can cost you:

1) The cost of delay from lost revenue

2) The cost of treading water while waiting to release

3) The cost of context switching back to old work

Finally, we will look at the cost using an approach from Charity Majors.

 

An example team

In all of the following examples, I am going to use an example team with

  • 10 engineers, each shipping 1 feature per week every week
  • A lead time of 1 week. That is, a delay of one week from our code being completed by an engineer until it is released in production to our customers.
  • An annual cost of $120k for each engineer (taken from payscale.com)
    • = $2,400 per week, $480 per day, $60 per hour

 

1) The cost of delay from lost revenue

TLDR; If we estimate the revenue we expect from a feature, we can estimate the cost of delaying that feature due to a slow lead time. That cost of delay from lost revenue could be as much as $120,000 per year on our example team of 10 engineers with a one week lead time.

Data & Estimates

If we can estimate the revenue a feature is going to generate over a 1 year period, then we can estimate the cost of our slow lead time. For example, if the feature will generate $5,000 in revenue in a year, then a lead time of a week is costing you $5,000/50 = $100. But how do we estimate the revenue expected from a typical feature? How about we tackle this problem by answering this question:

If we spend $1 in engineering costs to develop a feature, how much money would we need to make from that feature for that $1 investment to make sense?

Assumption: For every $1 we spend on engineering, we need to generate $5 in revenue, or a 5x multiplier. 

(See the reasoning behind this below at How much revenue do we need engineering work to generate?)

 

Calculations

So, a feature that takes one engineer a week to develop costs us $2,400 in engineering costs (based on our approximate engineer costs above).

Therefore we would need to see a return (using our 5x multiplier) of $12,000 revenue per year, or $240 per week. So, by delaying this new feature by a week, we are costing ourselves $240.

Not too big of a deal, right? But if we have a pipeline that delays things by a week, then every feature by every engineer is delayed by a week.

So if we could move to a CICD process where we release features when they are complete, we could save this amount of time on every feature.

If we have a team of 10 engineers, each shipping 1 feature a week each week, we would be saving
10 engineers * 1 feature * 50 weeks * $240 = $120,000

The takeaway

By reducing the lead time by a week on a team of 10 engineers, the savings from the cost of delay from lost revenue alone could be $120K a year. It would be like adding a whole new engineer to the team.

 

2) The cost of treading water while waiting to release

TLDR; The cost of treading water while waiting to release could cost $120,000 per year on our example team of 10 engineers with a one week lead time.

Let’s continue with out example of a lead time of 1 week. During that week of delay, will the engineer spend any time on the already completed feature? The work remains on project plans, Jira boards, release schedules, etc., until released, and hence will likely still incur some work. Some places an engineer could spend time “treading water” while waiting for the release include:

  • Release review meetings (or Change Approval Boards).
  • Status reporting, such at standups, or managers asking for updates.
  • Coordinating with QA
  • Manual deploy & release processes
  • Dealing with merge conflicts

Most of this is manual work that could have been automated (such as the testing and release process) or toil that adds little if anything to the equations (admin, bureaucracy, process). We are tying up perfectly good engineers with not very beneficial work. We are working to stand still.

How much does this effort actually cost us?

 

Data & Estimates

We would need to now how much time the engineer actually spends on this work. I can’t think of anyway to get hard data on this, so a best guess is in order. 10% seems a reasonable guess of how much time gets spent.

Assumption: During a one week delay, an engineer spends 10% (4 hours) of their time on the task already completed last week.

Calculations

= 4 hours * $60 per hour (again, see approximate engineer costs above)

= $240 per feature

So, for our example:

  • if we have a team of 10 engineers,
  • each shipping 1 feature a week each week
  • each spending 4 hours or $240 on that feature after “finishing” it but before shipping

It would cost:

10 engineers* 1 feature * 50 weeks * $240 = $120k

 

The takeaway

The cost of treading water for a week while waiting to release could cost $120,000 per year.

Coincidentally, this cost is again the cost of a full time engineer. Reducing our lead time and shipping as soon as we’re done would again be like adding a whole new engineer to the team.

3) The cost of context switching back to old work

TLDR; Things get more difficult to troubleshoot in production as time passes since we wrote the code. That could cost $150,000 per year on our example team of 10 engineers with a one week lead time.

After a week delay between code complete and release, if something goes wrong in production, you need to immerse yourself in the code again. In her “It is time to fulfill the promise of Continuous Delivery” presentation, Charity Majors says that as you finish coding, you have everything you need in your head: your motivation, intent, implementation details, tradeoffs, variable names etc. But that when you finish the code, you lose that information. Fast. In minutes.

The question is, how much longer does it take you to be able to triage, understand and fix the bug than if, for example, you had released the code to production the same day that you finished writing it? Can we estimate that cost?

Data & Estimates

Here is the data we would need:

  1. How many bugs do we typically have in production as a result of a release
  2. How much time does it take to debug an issue, assuming thing are fresh in your mind and there is no context switch required
  3. How much extra effort does the context switching required from a one week delay take?

Let’s take those one at a time.

1. How many bugs are we actually dealing in production as a result of the release

Most teams can pull data for this. Production bugs are one things that teams tend to track fairly well, usually tying them to the release that caused them. In fact if you are tracking one of the other Accelerate metrics, Change Failure Rate (the percentage of your releases that result in degraded service and need remediation such as a hotfix or rollback), you need to associate production bugs to the release that caused them. And of course how many bugs there will be will depend on the complexity of the release, how many engineers worked on it, how many features went out etc.

For the purposes of this example, let’s assume that our one engineer working on one feature over a one week period will introduce a single bug into production. Obviously your milage will vary and you customize for your own calculations.

Assumption: One bug per feature per release

 

2. How much time does it normally take to debug an issue?

This one is trickier to pull data for. Your engineers are unlikely to track exact start and stop time for working on each bug, never mind tracking the interruptions while working on it, so we need to estimate this one.

This Bug Fixing article mentions that Jeff Sutherland, one of the inventors of Scrum, suggests an estimate of half a day to fix a bug. That seems reasonable, so let’s go with that.

Assumption: Typical bug fix time is 4 hours

 

3. The cost of context switching to fix an old bug

Again, I am not sure how to pull data for this one. Has any one done studies on how context switching impacts bug fixing!? I’d love to hear if they have, but I think we need estimate for now.

I discussed this with a few colleagues and the consensus seemed to be that after a week, it would take at least 100% (2x) longer to troubleshoot a production bug. (I discuss this estimate a little more below in How much longer does a bug take to fix in production)

Assumption: A bug takes 2x (100%) longer to fix a week after completing the code

 

Calculations

OK, we have some data points, estimates & assumptions and we will continue with our hypothetical scenario of 1 engineer spending one week on one feature, that generates one production bug.

With no context switching that bug might take 4 hours, but we are assuming a 100% increase due to our lead time of a week, so an extra 4 hours.

So, again, assuming

  • A team of 10 engineers,
  • each shipping 1 feature a week each week
  • with each feature causing a single bug
  • and each bug requiring an extra 4 hours to troubleshoot because of context lost from a week of delay

10 engineers * 1 feature each with 1 production bug * 50 weeks * 4 additional hours to troubleshoot

= 200 hours@ $60 per hour (again see approximate engineer costs above)

= $120,000

The additional cost of context switching back again

However, this is only one side of the equation! To fix this production bug for the work you completed last week, you not only have the cost of ramping up on it again, you are also incurring a cost from ramping down on the features you are working on this week. Then when you have fixed the production bug, you now need to ramp up again on this week’s work. There is a very real cost associated with that too!

How much? Again, really hard to put number on that, but I will as usual take a conservative estimate here and say that the cost is just one hour. That is, the ramp down and ramp up again on this week’s feature costs one hour.

The (admittedly questionable) logic is that that context switching definitely costs you something, and this seems like an acceptable minimum without having any hard data.

So our cost to switch back to our original work after fixing the production bug is:

10 engineers * 1 feature each with 1 production bug * 50 weeks * 1 additional hours to context switch back

= 500 hours @ $60 per hour

= $30k

So, our overall context switch cost is

$120k + $30k = $150k

 

The takeaway

After delaying our release, the cost of context switching back to old work could cost $150,000 per year

Once again, this is remarkably close to being the cost of a full time engineer. The savings from avoiding the context switching to work we have long since completed would be like adding a whole new engineer to the team.

Totaling the costs

On a team of 10 engineers with a one week lead time:

  • The cost of delay from lost revenue could cost $125,000 per year
  • The cost of treading water for a week while waiting to release could cost $125,000 per year
  • The cost of context switching back to old work could cost $150,000 per year

The total cost of a one week lead time on our example team of 10 engineers comes up to a total of $400,000 per year, or the approximate equivalent of more than 3 engineers.

Imagine how much extra work you could do with 3+ engineers freed up!

Are there even more costs?

So far, we’ve covered 3 separate costs that you might incur from a long lead time. But are there other we’re not accounting for here too? Definitely.

  • Delays will cause work start batching up, getting us away from the “limit WIP” mantra of Lean. And as Charity points out, this can make ownership of resulting problems less clear, requiring more coordination and project management etc.
  • There are also the costs to be saved from automating manual processes. If you have a led time of a week it is highly likely it is at least partially due to manual processes. Manual testing is still common. Maybe you even have an off shore team doing it for you overnight. Converting recurring manual tests to automated can easily save the time of several engineers each week. (And yes there may well still be a need for manual testing, such as explorative testing, but it should be the exception rather than the rule).
  • There are also the costs to be saved from improving your build time too. Part of reducing your lead time is often reducing your build time, and as we discussed in the previous post, there are very real savings to be had there too.

However, for some of these costs, I just don’t even know how to go about estimating them, they are likely to vary enormously from org to org. And also, wow have I already spent waaaay more time on this blog post than I intended.

If you combine that strong sense that we are still missing costs associated with long lead times, and the sense that many of my estimates above are probably low, my estimate of being able to save the equivalent cost of 3 engineers per team of 10 by reducing your weekly lead time to hours is probably low. Which leads me to…

 

The Charity Majors perspective

I love reading Charity Major’s writing. She is always entertaining and thought provoking. And on the topic of CICD she is particularly enthusiastic.

In her blog post “How much is your fear of continuous deployment costing you?“, she suggests the following:

If your team has n engineers with a <15 min delivery interval,

it would take twice as many engineers with an interval of hours,

and twice as many more with an interval of days.

So by this logic, with out example team of 10 people working on a project with a lead time of one week, by improving our pipeline and releasing daily to hours, we could do the same work with 2.5-5 people!

Let’s assume the lesser of those and that we can we have do the same work with half the team.
So in my hypothetical team of 10 engineers, that would mean a saving of 5*120 = $600k a year.

Charity methodology is mainly anecdotal and based on the experience of other technical folks. Is my approach any better? I tried to be more detailed I guess, but in the end, she may well speak more truthfully than my assumptions, guesses and back of the napkin math.

 

Conclusion

So, a quick recap. Based on our example team of 10 engineers with a lead time of one week, we found that:

  • The cost of delay from lost revenue alone could be the equivalent of one engineer
  • The cost of treading water for a week while waiting to release could also be the equivalent of one engineer
  • The cost of context switching back to old work could again also be the equivalent of one engineer

So by improving the deployment pipeline and reducing our lead time to, say, an hour, we could essentially free up the equivalent of 3 engineers. A massive saving.

But Charity Major goes further and suggests you could free up the equivalent of 5, or even 7 engineers.

I’ve tried to put some data and structure around my estimate. Not all of the above analysis is based on hard data. Some of it uses estimates and approximations. But I feel this approach is significantly better than my previous vague worries that slow lead times are costing us. Hopefully these approaches can help you have better, more structured conversations with you Product Management and leadership teams about the cost of delay for technology driven initiatives, or Non Functional Requirements.

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”
― Jim Barksdale

 

Support material

The background calculations

How much revenue do we need engineering work to generate?

Above, I asked the question, if we spend $1 in engineering costs to develop a feature, how much money would we need to make from that feature for that $1 investment to make sense?

Can it earn zero revenue? Sure, there are many example of work that might not directly generate revenue, such as

  • Rotating keys and credentials
  • Switching to a new service provider
  • Switching to a new framework because of difficulty hiring engineers for an old one
  • Security upgrades
  • Upgrades force by end-of-life

And those get in to the realm of savings, harm reduction or cost reductions. You could argue that multipliers just as easily apply there. Would you spend $1 on engineering time to save $1 in costs? Probably not. $1 in engineering time to save $5 in costs annually? Probably.

But revenue seems easier to focus on, to quantify, and it keeps things simpler. And, for most companies, on average (or, amortized over time and across engineers) individual releases need to generate revenue, otherwise you may be in dire straits.

So, let’s start with a low estimate: We need to make at least $1, or a factor of 1x. That would cover our engineering time, but engineering is only a fraction of our costs, with other components including research, design, marketing, sales, legal, HR, management, etc. And of course, we need to do more than just cover costs.

I asked a trusted colleague experience in Product Management who said that as a bare minimum, we might want to see a return of at least a 7x over a 2 year period for this engineering investment to make sense. So 3.5x in a one year period.

This article, on the “Laws Of Software Economics” (and it’s related post, Your Next Developer Costs $1M/Year in Revenue) suggests you should expect a 6x revenue return on development spending.

This article from the Boston Consulting Group on “How Software Companies Can Get More Bang for Their R&D Buck” says that while spending on R&D (which includes software development) may vary from 5% to 50% of annual revenue, the fastest growing companies typically spend more than 20% of their revenues on R&D. And their analysis of 35 publicly traded software companies shows that the median R&D spending among high-growth companies is 26% of revenues. So, that would suggest a return of 4x to 20x on R&D spending.

So, I tried to pick a reasonable if conservative estimate and go with a 5x factor.

Sure, these things may vary enormously based on your type and age of company, industry, how you generate revenue (licenses, subscription, in-app purchases…) etc, so while this estimate is certainly not perfect, it seems reasonable, and is unlikely to be an order of magnitude off. The good part is you can adjust this factor as you see fit in your calculations.

In the meantime, if you have other suggestions, or better still any data or research around this, I would love to hear.

 

How much longer does a bug take to fix in production?

TLDR; This Facebook Engineering article suggests that it takes 20x longer to fix a big in production than at peer review. I am making a conservative estimate that it will take 2x longer to troubleshoot a bug released to production one week after the code was written.

I found it surprisingly hard to find hard data on on the cost of bug fixes as time passes.

The National Institute of Standards & Technology (NIST) published a 2002 report called “The Economic Impacts of Inadequate Infrastructure for Software Testing” (nist.gov) where (in table 5-1), it states that if a bug found in the requirements gathering or design stage costs 1x (e.g. 1$) to fix, it will cost 30x (e.g. 30$) to fix in production. It does however clearly state that these numbers are examples only. None the less, it hasn’t stopped many others reporting on and quoting these numbers as fact.

The Leprechauns of Software Engineering book looks at similar claims around the cost of defects, including at the “Software Requirements Engineering” section of a 1976 article, in which the author Boehm concluded “it pays off to invest effort in finding requirements errors early and correcting them in, say, 1 man-hour rather than waiting to find the error during operations and having to spend 100 man-hours correcting it”. A 100x increase! However the Leprechaun book suggests such claims “to be not just awkwardly vague, but in fact almost entirely anecdotal.”

Similarly, an “IBM Systems Science Institute report” is widely cited, as is the “Pressman Ratios”, and they result in nice little graphs like this:

But also seems to have largely been debunked (see github discussions here and here).

More recently however, I found this Faster, more efficient systems for finding and fixing regressions post from Facebook Engineering. It suggests that it takes 20x longer to fix a big when in production than it does when “a diff is being reviewed” (which I’m assuming to mean at peer review stage):

 

If you are aware of other sources on the real relative costs of bug in production versus the development phase out there, please let me know!

In the meantime, I went with a conservative estimate:

I am assuming that after a one week delay from writing the code, it will take you 2x longer to troubleshoot a production bug.

Whether you agree or not, and as with all the calculations here, you can use your own estimate that best fits your team.

 

Terminology

Release process vs deployment pipeline

What is the difference between the terms release process and deployment pipeline? In Continuous Delivery, they say:

A deployment pipeline is, in essence, an automated implementation of your application’s build, deploy, test, and release process.

This suggests that the release process is a subset, and final, piece of the deployment pipeline, and that a deployment pipeline is fully automated. That works for me

 

Lead Time vs deployment pipeline

Is Lead Time the length of time long it takes your deployment pipeline to complete?

Not necessarily. If you’re shipping code to users, you definitely have a Lead Time, but you might not even have a deployment pipeline, since that implies automation. Maybe you just have a plethora of manual processes stuck together with bash scripts and duct tape. And even if you do have a deployment pipeline, Lead Time starts at commit, but a deployment pipeline usually starts with a build. So, I was deliberately vague above when I said “Lead Time can be viewed as a reflection of your deployment pipeline’s efficiency”. A reflection, rather than a direct measure of it.

 

Deployment vs Release

As I discussed in this Testing in Production post, deployment means pushing, or “installing”, and running the new version of your service’s code on production infrastructure. Your new servers have the new code, but are not yet receiving live production traffic.

Release refers to exposing our already deployed service to live incoming traffic.

Tags: , , , , , , , , ,

Leave a Reply