RSS Feed Subscribe to RSS Feed


Testing in Production Presentation – SVCC 2018

The following post is essentially a written version of the Testing in production talk I gave at Silicon Valley Code Camp 2018. You can find the presentation deck here at slideshare.


The talk is based on previous some my previous blog posts including:

Which themselves are based on the great work of industry leaders including


I don’t always test…

You may have seen this meme before: The DosEquis guys saying
“I don’t always test, but when I do, I test in production”
“Testing in production” has been kind of a joke -> what you’re really saying is you don’t test anywhere.
And instead you’re just winging it: Deploying to production, crossing your fingers and hoping it all works.

But then I began to look at it differently.
The DosEquis guy usually says “I don’t always drink beer, but when I do, I drink DosEquis”
Meaning DosEquis is the best beer to drink.
So, the implication here is  not that testing in production is a joke, but that Production is actually the BEST place to test. And I’m increasingly believing that to be the case. Or, at least, that production is an environment we shouldn’t be ignoring for testing. After all, prod is the only place your software has an impact on your customers is production.

But there has been this status quo of production being sacrosanct. Instead of testing there, it is common to keep a non-prod env, such as staging, as identical to production as possible, and test there.
Such environments are usually a pale imitation of production however.

  • Testing in staging is kind of like testing with mocks, an imitation, but not the real thing.
  • Saying “works in staging” is only one step better than “works on my machine” (Cindy Sridharan)
  • It has been said (Charity Majors) that software being released to production is like a baby’s 4th trimester: software leave’s it cozy, artificial staging environment and slams into the real world

But what makes the real world of production so special?

How is Production different?

  • Hardware & Cluster size
  • Data
  • Configuration
  • Traffic
  • Monitoring

Some things we can only test in production

As our architecture becomes more complicated (particularly with Microservices), we need to consider all options to allow us to test and deliver working software to our customer. Including testing in production.

So should we skip testing in non-prod first? No!

Not a substitute for pre-production testing

Testing in production is by no means a substitute for pre-production testing

I’ve given talks on unit testing, integration testing, mocks, code coverage and continuous integration. And I believe very firmly in all those things. Testing in Production is an addition to all those, not a replacement.

Treat production validation with the respect it deserves.

I mean two things by this

1. Most production testing is really validation only – although there is at least one exception (A/B testing)

2. Respect production!

  • Beware of unwanted side effects
  • Consider testing with SAFE methods e.g. GET, HEAD
  • Consider testing using expected failures of non-SAFE  methods e.g. PUTs that result in 400 errors (these still tells you something)
  • Or at least be able to tell the difference between test data and “real” prod data

In this talk, we’re going to cover some of the different ways we can test in production.
1) We’ll start with Observability, the foundation for any testing in production. Observability = Knowing what the heck your app is doing anyway. Going beyond just logs and alerting.

2) Around deployment & release times

3) Chaos engineering. Perhaps the most advanced form of production testing, but I would argue its actually not that advanced. I talk about what is is, some basic rules for doing and how we’ve been starting to use it where I work.


Observability is the ability to ask new questions of your system without deploying new code; Being able to answer questions that you have never thought of before. You can think of it as the next step beyond just monitoring and alerting.

With distributed systems, it is can be difficult to know exactly what the software is doing. Observability means bringing better visibility into that software. To have better visibility, we need to acknowledge that:

  • Everything is sometimes broken
  • Something is always broken (No complex system is ever fully healthy; See also Richard Cook’s How Complex Systems Fail)
  • If nothing seems broken  … your monitoring is broken

Distributed systems are unpredictable. In particular, it’s impossible to predict all the ways a system might fail. Failure needs to be embraced at every phase (from design to implementation, testing, deployment, and operation). Ease of debugging is of high importance.

It is often said that there are three pillars of Observabilty: Logs, metrics, and traces

I personally think of Observability as also including tools, monitoring and alerting.

1) Logging:

Logging is typically the most common and most basic approach used to getting insights into what your service is doing. I’m guessing you all do it in one form or another. But there are 3 aspects of logging worth considering:

a) Structured logging: plain text -> splunk friendly -> json

How many people use structured logging? It seems to be increasingly common. The idea behind structured logging is that instead of logging plain text, you log structured information, such as json, to aid diagnostics, analytics and monitoring.

Most logging frameworks already provide useful information with each log statement already, for example: timestamp, pid, thread, level, loggername, etc.

The idea is to extend this with attributes (e.g., a request or transaction ID, a client or user id) that will further provide further insights into our software, and allow us to answer questions such as:

  • What caused this stack trace
  • What were the sequence of events leading up to it, and
  • Who was the original caller (e.g. user or calling service)

See also


b) Event sourcing logs can be a great source of logs for debugging too

Event Sourcing is a technique where every change to an application’s state is captured as an event object. These event objects themselves are stored, often in a streaming event store such as Kafka or Kinesis. The current application state can be reconstructed at any time by simply replaying the events.

While application state can tell you about the current state of things, we often want to know about how things got that way, and event sourcing logs can be a great source of information. They can often provide a clearer picture on business specific events that led to the current state than regular application logs can.

c) Consider sampling rather than aggregation

If someone complains about a service being slow, and the owners respond that the averages look healthy and fine, those owners are doing their users a disservice! Being able to identify individual problem transactions can be invaluable for troubleshooting, and while storing details of all transactions becomes expensive, sampling transactions is a great compromise, especially if those transactions have noteworthy characteristics e.g. your much more likely to be interested in slow transactions than those meeting expectations.

2) Metrics: Time series metrics, for tracking system stats such as CPU and mem usage, stats like # logins

3) Tracing: Distributed traceability e.g. Zipkin

4) Alerting: Useful for about proactively learning about, typically, predictable issues

5) Tools: e.g. Splunk, NR, OverOps, Wavefront, OverOps, HoneyComb

Observability: The ability to answer questions about our applications behavior in production. Questions we may have never even though of before. And with Observability in place, what types of production testing can we do…


Testing around Release time

First, let’s start by defining some terms. For the purposes of this talk, I use the terms
Deployment and Release as follows.

Deployment means pushing, or “installing”, and running the new version of your service’s code on production infrastructure. Your new servers have the new code, but are not yet receiving live production traffic.

Release refers to exposing our already deployed service to live incoming traffic.


The time after deployment, before release (exposing it to customers), is a golden time to do testing to confirm that it is ready to handle production traffic. At my current company, we refer to software that has been deployed but not yet receiving traffic as the Dark Pool. A more common industry standard is Blue/Green deployments, but they are essentially the same.

So what kind of testing can we do at this time?

  • Config Tests
  • Smoke Tests
  • Shadowing
  • Load Tests


As discussed earlier, production is different from non-production environments in almost every way, and one key difference is config. For example, database usernames, passwords, and IP addresses will all be different in the production environment. The Dark Pool is a great place to test this config.

Not testing configuration before the release of code can be the cause of a significant number of outages.

But testing config in isolation can be difficult, and is usually done in conjunction with the other techniques listed here, include smoke testing, …

Smoke Tests

The most basic smoke test is a health-check. Including a health-check endpoint (e.g. /health) in microservices is a common best practice. It comes out of the box with Spring Boot for example. A health-check shows that the app is up and running and can sow things like disk, memory and CPU usage, version numbers, git commit hashes, build time, start time, command line arguments etc.

Smoke tests can also include basic manual testing like “Can I log into the application”, “Can I see the UI?”, “Can I perform basic functionality, like look at an account balance?”.


Shadowing is the technique by which production traffic to any given service is captured and replayed against the newly deployed version of the service.

Shadowing works best for testing stateless services, or for services where any stateful backend can be stubbed out.

A variation of Shadowing is “tap and compare” where you direct a copy of production traffic to a newly deployed, but not yet live, version of a service and compare its results to the currently live version. This is a way to confirm that the new version is working as expected, and returning consistent results.

Load testing

Load testing is something that you can, and should, do in non-production environments. But, as is the common theme in this post, the production environment has unique, interesting and useful traits not found in other environments. For example, no where else can we test against the exact hardware and load characteristics of production.

Possible approaches:

  • Perform load testing against the actual production hardware using a load testing tool such as ApacheBench, JMeter (also from Apache) and Gatling to generate load, varying request rate, size, and content and required for your tests.
  • Use in combination with Shadowing (a technique sometimes referred to as load shifting) where we direct production traffic to a cluster smaller than the usual production one, as a way to test capacity limits and establishing the what resources, such as CPU and memory may be required in different circumstances.


Again, these types of testing, and production testing in general, are really about validation only.



Release (in this post at least!) refers to exposing our already deployed service to live incoming traffic. What can we do other than just flip the switch and letting the traffic come in?

The main option for testing here is doing a Canary Release.

Canary Release

A Canary Release is a technique where a percentage of production traffic is routed to the new release, rather than a big bang approach. The advantage is that you can essentially test the new release in the wild (that is, in production) and if issues are detected, you have only affected a small subset of users. You can monitor and watch for errors, exceptions and other negative impacts and proceed to increasing the traffic to the new canary release only if things look acceptable.

Internal release

An internal release is similar to a canary release in that you are making the new version of the application available to a limited number of users, but in this case, the users are internal – typically employees of your company. Such users are a great, friendly and responsive group to test again. You get feedback about the viability of your new release without negatively impacting paying customers.


Post release: Your new code has been deployed. All deploy phase testing including config tests, health-checks, and smoke tests have all passed. Your code has bee released; the canary didn’t die, and a roll back wasn’t necessary! It is now live, receiving live traffic. Is there any testing we can still do?

The two main types of testing done at this phase are probably the most well known and the most accepted form of testing in production: feature Flagging and A/B Testing.

Feature Flags

Feature flagging is a technique for releasing a hidden or disabled feature that can then subsequently be enabled at run time.

Feature flags can be used for disabling features until the “noise” of a release has quietened, and you can release your feature to the wild is isolation; Or as a technique to elimiate long running feature branches, where you can merge to main/trunk more often, even for unfinished features.

There are also several vendor offerings that support feature branches, including Launch Darkly.

A/B Testing

A/B Testing is also a very well known approach, and indeed so common that it is not deemed at all controversial. It is a technique for comparing two versions or flavors of a service to determine which one performs “better” based on some predefined criteria, such as more user clicks. It is experimenting in production at its finest!

And final form of testing after code has been released, is Chaos Engineering…

Chaos Engineering

When talking with engineers, I usually use the term Chaos Engineering, because it sounds cool! When talking with management, I tend to use the term resilience engineering, since it sounds less scary. The terms are synonymous. In the past, terms such as Disaster Recovery and Contingency Planning have been used to describe somewhat similar processes.

Whatever term you use, it basically refers to conducting carefully planned experiments designed to reveal weaknesses in our systems.
In other words, Chaos Engineering is the practice of confirming that your applications work as you expect them to in production.

Despite the name, Chaos Engineering is not about introducing Chaos into your system! Instead it is about identifying any chaos already there, so that you can remediate.

For example, if you have a pool of servers, and you have designed your app to seamlessly failover of one of those servers goes down, a good Chaos Engineering experiment is to actually kill one of those servers in production to confirm that all works as expected.

Game Days

What are Game Days?
If Chaos Engineering is the theory, Game days are the practice; the execution.
Game days are where you start with Chaos engineering.
Game days are “An exercise where we place systems under stress to learn and improve resilience”
Systems can be technology, people, process.
They are like fire drills – an opportunity to practice a potentially dangerous scenario in a safer environment.

Regardless of what Game Day exercises you may plan to do, even just getting the team together to discuss resilience can be a very productive exercise. If you ask a team how to make an application more resilient, you may get blank looks and shoulder shrugs. But ask a team to think of ways an app can or has broken, and suddenly the conversation can quickly become lively and animated! And each way an app can be broken is an opportunity to introduce resilience, and to plan game day exercises around it.


OK, so I’ve sold you on testing in production, and Chaos Engineering sounds cool!  How do you start? What should you’re first Game Day look like? I’m going to run through what is partly a template, or step by step guide, for Game Days, but also a case study or example, on how we’ve started with Game Days on my team.

A step by step guide

1. Hypothesis

To start with, what are we trying to test! Pick a hypothesis.

Typically in Chaos Engineering experiments, the hypothesis is that if I do X (take out a server, simulate a region failure), everything should be OK.

But we need to be specific about how to measure things are “OK”. And a big part of defining OK is to define “Steady State”

Steady state is essentially what the key metrics are for you to monitor as part of your test. It could be things like:

  • Response times remain in an acceptable range
  • Loan applications remain constant
  • Netflix use a metric measuring users clicking start on a show on movie, which they call Starts Per Second, or SPS.

If you don’t define steady state, how do you know your test is working on not? How do you know if you are breaking things?

With a hypothesis in mind, and a way to test, but first think abut blast radius…


2. Blast radius

The blast radius refers to to how much damage can be done by the experiment

If you take out a server, and everything is in fact NOT OK, how bad might things be? Try to ensure that you limit the possible damage. For example, if your hypothesis is that

When Foo service is running in a pool of 2 servers, and one of those servers dies, CPU and memory utilization should increase on the remain servers, but response time remain unaffected

That is a fine thing to test

But if you have 10 services depending on that service (even in non prod), and you’re wrong that response times will be unaffected, you may have caused 10 other services to have problems.

So a way to limit the blast radius in that test would be to test using a pool of Foo Service that only one other service relies on. Hopefully a service that you also control and that is closely monitored as part of the test.

Another way to minimize possible damage is to make sure that you have the equivalent of a big red Stop Test button!

If you metrics aren’t looking good, have the ability to abort the test immediately.

Remember: our goal here is to build confidence in the resilience of the system and we so that with Controlled, contained experiments, that grow in scope incrementally.

3. Run the experiment

Figure out the best way to test your hypothesis

If you plan to take out a server, how do you do it?

ssh in and kill -9? Orderly shutdown? Have Ops do it for you? Do you simulate failure by using bogus IP addresses, or simply removing a server from a VIP pool?

And again, stop if metrics or alert dictate unexpected problems.


4. Analyze the results

Were your expectations correct?

Did your system handle things correctly

Did you spot issue with you alerts, metrics that should be improved before any future tests.


5. Increase scope

The idea is to start small

1 service, in non-prod, and gradually expand to prod.

And the goal should be prod. Prod is where’s it’s at!

That being said, don’t get carried away! We’re easing into testing in prod.
Let’s not ruin it for everyone.


We have talked about Testing in production; No longer a joke, and instead increasingly viewed as a best practice. It is not a replacement for the essential and high value non-prod testing we do, but an addition.

Observability: Testing in production, and indeed in all envs, requires being able to understand what our applications do. Conventional logs, monitoring and alerting are all good, but Observability is about more than that. It’s about the ability to answer complex questions about our apps at run time. Questions we may not have even thought of before like: why is my app slow. Is it me or a downstream service? Where is all my memory being used? We can use metrics, tracing, any tools at our disposal so that we can see what’s going on when things go wrong. Or better still, to proactively spot problems in advance.

And with Observability in place, we can actually start to test in production!

Testing at Release: We ran through different types of testing we can do before, at and after a release, including

  • After deployment (Config, smoke, load, shadow)
  • At release time: Canary and internal release
  • After release: Feature flags and A/B testing

Finally, even when everything is up and running in prod, customers are using it, and all looks good, there is still more testing we can do…

Chaos Engineering

Not introducing chaos, but exposing the already present chaos!

Carefully planned experiments designed to reveal weaknesses in our systems


Tags: , , , , , , , , , , ,

Leave a Reply