RSS Feed Subscribe to RSS Feed


Testing in Production

Note that I gave a talk on this blog post in Dec ’18, if you prefer to watch that:


“Testing in production” used to be a joke. The implication was that by claiming to test in production, you didn’t really test anywhere, and instead just winged it: deploying to production and hoping that it all worked. Times have changed however, and testing in production is becoming accepted as a best practice.

First, my usual disclaimer. As with most of my posts, this one is derivative. It is mostly based on this great “Testing in Production, the safe way” post by Cindy Sridharan, as well as Charity Major’s “Shipping software should not be scary“, but I have also liberally taken from many other great sources listed at the end. In some cases I may have simply copy and pasted parts (though I have tried not to). In the worst cases I have bastardized the sources into points the original authors likely never intended.  I hope they can forgive me – imitation is the greatest form of flattery, and I’ve found the best way for me to understand a subject area is to copy, paste, summarize, slice, dice and rehash…


The reality is that we all inevitably do test in production already, so perhaps there is now just more of a recognition of that, and a growing desire to do it in better ways. Along with pre-production testing such as unit, integration and end-to-end testing, production testing is just another arrow in our quiver.

Why test in production?

The sole purpose of software is to have a positive impact on your customers, and the only place your software has an impact on your customers is production.

So should we not test in non-prod first? No! Testing in dev/demo/stage (or whatever your your various flavors of non-prod are called) is good and indeed essential. It’s just that no matter how hard you try to make your non-prod environments like production, they will never be production. Production will always be so different in so many ways, including, to name but a few:

  • Hardware
  • Cluster size (e.g., a single machine per service in staging)
  • Configuration (this is a big one; more later)
  • Data (stage is either totally artificial, or scrubbed)
  • Monitoring (staging rarely has the same level of monitoring as prod)
  • Traffic (staging likely not receiving real prod traffic)
  • IP address

Even the most detailed staging environment can’t mimic production, and in fact trying to make them as similar as possible becomes increasingly difficult, expensive and with rapidly diminishing returns. The only thing that is really like production is production. So test there.

And what are the ways we can test in production?

This post breaks it down into 3 phases.

  1. Deploy
  2. Release
  3. Post release

Testing after Deployment

Deployment means “installing” and running the new version of your service’s code on production infrastructure. The time between deployment and release (exposing it to customers) is a golden time to do testing to confirm that it is ready to handle production traffic. At my current company, we refer to software that has been deployed but not yet receiving traffic as the Dark Pool. A more common industry standard is Blue/Green deployments, but they are essentially the same.

Whatever you call it, being able to have your software in the real production environment but without being exposed to users bring great opportunities for testing.

A quick aside: If production testing is so great, why not test software ONLY in production? Why not skip non-prod completely! There are many reasons. To start with, most engineers do not have access to production, but the main reason is side effects. We do not want to adversely impact customers, or metrics, or reports. We must be careful with any tests run in the production to not change the environment in any meaningful ways. For example, ideally database writes would be avoided. Or they should at least be clearly identified as being test writes. Most of the “deploy” testing strategies described below may work best when used with stateless services, or against “Safe” Rest Endpoints. Safe methods, such as HEAD, GET, do not modify resources, they are used only for retrieval.

What tests can we carry out in this post-deploy, pre-release state?

  • Smoke tests
  • Config Tests
  • Shadowing
  • Load tests

Smoke tests

The most basic smoke test is a health-check. Including a health-check endpoint (e.g. /health) in microservices is a common best practice. It comes out of the box with Spring Boot for example. Contents can vary from merely showing a test (“Up!”) to including things like disk, memory and CPU usage, version numbers, git commit hashes, build time, start time, command line arguments etc.

Smoke tests can also include basic manual testing like “Can I see the UI?”, “Can I log into the application”, “Can I perform basic functionality, like look at an account balance?”. Again, beware of unintended side effects here. Ensure that any metrics you use in production, such as # of logins, particular those that may be published outside the company, exclude such some testing activity.

Config Tests

As discussed, production is different from non-production environments in almost every way, and one key difference is config. For example, database usernames, passwords, and IP addresses will all be different in the production environment. The Dark Pool is a great place to test this config.

Not testing configuration before the release of code can be the cause of a significant number of outages.

But testing config in isolation can be difficult, and is usually done in conjunction with the other techniques listed here, include smoke testing, …

Whatever technique you use, any tests that utilize your production specific config before your new services receive live traffic can greatly minimize risk.


Shadowing, also known as Dark Traffic Testing or Mirroring, is a technique where, in parallel to the current live service receiving live traffic, a newly deployed version of the service also receives some portion of the live traffic too. This can happen in real time, where a copy of the live traffic is simultaneously sent to the new service, or in a record & playback fashion. It can also be involve an entire copy of the incoming production traffic, or some sampled subset of it. The former will likely require a similar hardware capacity as production, and so may be difficult and expensive. However, such a duplicated capacity setup is often available

As discussed before, beware triggering unintended state changes to the test service’s databases and upstream services.

Load tests

Load testing is something that you can, and should, do in non-production environments. But, as is the common theme in this post, the production environment has unique, interesting and useful traits not found in other environments. For example, no where else can we test against the exact hardware and load characteristics of production.

For example, we can perform load testing against the actual production hardware by using a load testing tool such as ApacheBench, JMeter (also from Apache) and Gatling to generate load, varying request rate, size, and content and required for your tests.

Or instead of generating artificial traffic, we can use a technique sometimes referred to as load shifting where we direct production traffic to a cluster smaller than the usual production one, as a way to test capacity limits and establishing the what resources, such as CPU and memory may be required in different circumstances.

Testing in Production - Release

Release (in this post at least!) refers to exposing our already deployed service to live incoming traffic. What can we do other than just flip the switch and letting the traffic come in?

The main option for testing here is doing a Canary Release.

Canary Release

A Canary Release, aka a phased or incremental rollout, is a technique where a small subset of production traffic is routed to the new release, rather than a big bang approach. The advantage is that you can essentially test the new release in the wild (that is, in production) and if issues are detected, you have only affected a small subset of users. You can monitor and watch for errors, exceptions and other negative impacts and proceed to increasing the traffic to the new canary release only if things look acceptable.

The downsides are that you need a way to actually direct some percentage of the traffic to the canary (with an appropriate selection strategy, e.g. random, geographical, demographical) and you need to be able to support having two versions of your application simultaneously (although this is almost required with microservices).

Testing in Production - Post Release

Post release testing, in this post at least, refers to the period of testing in the hours and days after a release has gone live. The code may be live (the canary didn’t die, and a roll back wasn’t necessary) and receiving all the live traffic, but we are not done with testing yet. There are two particular testing strategies often used as this phase of proceedings: Feature Flagging and A/B Testing

Feature Flagging

Probably the most well known and the most accepted form of testing in production, feature flagging is a technique for releasing a hidden or disabled feature that can then subsequently be enabled at run time.

Using FF is one approach to eliminate long running feature branches. Code can be merged earlier and more frequently, rather than living on a long running feature branch while waiting for the feature to be finished.

A common usage I have found for feature flags is to roll a release out with a feature disabled, and let the “noise” of the release reduce. If there are problems immediately after the release, you can be (more) sure that they are not related to your disabled feature. This can avoid having any post-release issues falsely attributed to you, your team and your code. Then when things have settled, you can enable your feature. Any issues that then surface are very likely to be directly related to your new feature, but can be quickly dealt with by immediately disabling the feature again.

There are many techniques and approaches for implementing feature flags, many of them covered in this detailed blog post by Pete Hodgson, formerly of ThoughtWorks. There are also several vendor offerings, including Launch Darkly.

A/B Testing

A/B Testing is also a very well known approach, and indeed so common that it is not deemed at all controversial. It is a technique for comparing two versions or flavors of a service to determine which one performs “better” based on some predefined criteria, such as more user clicks. It is experimenting in production at its finest!


Sources and References

  • Feature Toggles by Pete Hodgson:


Tags: , , , , ,

Leave a Reply