RSS Feed Subscribe to RSS Feed


Blog post summary: Automating safe, hands-off deployments at AWS

AWS’s Clare Liguori wrote an excellent blog post on Automating safe, hands-off deployments.

This is a summary (1,700 words, vs 5,300 in the original) and mostly just a copy & paste of highlights. I have also skipped some of the sections that are at scales larger than most folks deal with (e.g. global releases across 26 regions!).


How often do you deploy to production? If you are doing major releases infrequently, then even the small fixes in between can take hours of carefully shepherding each of those deployments and watching logs and metrics to see of a roll back may be needed.

The better alternative is automatically deploying to production multiple times a day, using continuous deployment pipelines, where everything post code review & approval is automated.

Continuous deployments at AWS

Deployment pipelines allow quick and safe deploys that free up developer time from manual work on deployments. Pipelines consist of tests and deployment safety checks that prevent customer-impacting defects from reaching production or limits impact if they do reach production.

Adopt continuous delivery as a way to automate and standardize how software is deployed and to reduce the time it takes for changes to reach production. Incrementally build improvements to your release process over time. Identify deployment risks and mitigate them through safety automation.


A typical continuous delivery pipeline has four major phases—source, build, test, and production.

Source and Build

Pipeline sources

Pipelines at Amazon automatically validate and safely deploy any type of source change to production, not only changes to application code. A typical microservice might have pipelines for:

  • Application code
  • Infrastructure code (IaC)
  • OS patches
  • Configuration/feature flags
  • Website static assets

All of these pipelines have similarities, e.g. all have safety mechanisms, like automatic rollback.

Having multiple pipelines ensure that pipelines in one ares don’t block other pipelines. For example, issues with app code changes won’t block infrastructure code changes from reaching production in the infrastructure pipeline.

Code review

Consider having your pipeline enforce the requirement that all commits on the mainline branch must be code reviewed and approved by another team member.
With fully automated pipelines, the code review is the last manual review and approval that a code change receives from an engineer before being deployed to production, so this is a critical step.

Code reviewers evaluate the code’s correctness and also evaluate whether the change can be safely deployed to production.
Teams may define custom checklists, which include checking for things such as:

  • Sufficient test coverage
  • Sufficient instrumentation for monitoring
  • Rollback mechanisms

Build and unit tests

In the build stage, the code is compiled and unit tested. You may also use linters (a linter is a tool that may use static analysis to analyzes source code to find errors, bugs, enforce code standards and identify potential security or performance problems. An example in the Java world would be SpotBugs.).

Typically, unit tests mock (simulate) all their API calls to dependencies, such as other services. Interactions with “live” non-mocked dependencies are tested later in the pipeline in integration tests. Compared to integration tests, unit tests with mocked dependencies are able to exercise edge cases like unexpected errors returned from API calls and ensure graceful error handling in the code.

Finally, when the build is complete, the compiled code is packaged and signed.

Integration tests

Integration tests help us to automatically use a service just like customers do as part of the pipeline, exercising the full stack end-to-end by calling real APIs on running servers. The aim of integration testing is to catch any unexpected or incorrect behavior of the service before deploying to production.

Integration tests may also run:

  • Both positive and negative test cases
  • Fuzz tests (generate many possible API inputs)
  • Load tests

(On a side note, I have seen the terms unit tests refer to test small units of code where essentially all other dependencies are mocked; integration tests test large units of code, for example how several classes interact with each other, but still in a “runs with network cable unplugged” manner; and end-to-end, or regression, tests referring to more exhaustive and comprehensive integration tests that, rather than mocking other services, run again live (e.g. alpha or beta, demo or stage, or whatever your nomenclature of choice is) versions of them. However, I know terminology here is far from standardized).


Test deployments

Test deployments in pre-production environments

Before deploying to production, the pipeline deploys and validates changes in multiple pre-production environments, for example, alpha, beta, and gamma.
Alpha and beta validate that the latest code functions as expected by running functional API tests and end-to-end integration tests.
Gamma validates that the code is both functional and that it can be safely deployed to production. Gamma is as production-like as possible, including the same deployment configuration, the same monitoring and alarms, and the same continuous canary testing as production. Gamma is also deployed in multiple AWS Regions to catch any potential impact from regional differences.

Backward compatibility and one-box testing

Before deploying to production, we need to ensure that the latest code is backward-compatible and can be safely deployed alongside the current code. Considering deploying the latest code to a single virtual machine. This one-box deployment leaves the rest of the gamma environment deployed with the current code for some period of time, such as 30 minutes or one hour. Traffic doesn’t have to be specially driven to the one box but can just be added to the same load balancer. The one-box deployment monitors canary test success rates and service metrics to detect any impact from the deployment or from having a “mixed” fleet deployed side by side.

Some pipelines also run integration tests again in a separate backward-compatibility stage we call zeta, in which each microservice calls only production endpoints, testing that changes going to production are compatible with the code currently deployed in production across multiple microservices.

Production deployments

Our #1 objective for production deployments at AWS is to prevent negative impact, especially multi-Availability-Zone or multi-Region impact. To limit the scope of automatic deployments, we split the production phase in to individual Regions, individual Availability Zones or even to a service’s individual internal shards (called cells).

Staggered deployments

Each team needs to balance the safety of small-scoped deployments with the speed at which we can deliver changes to all customers. Deploying changes to all Regions and AZs through the pipeline one at a time has the lowest risk of causing broad impact, but it could take weeks. We have found that grouping deployments into “waves” of increasing size, helps us achieve a good balance between deployment risk and speed.

The first two waves in the pipeline build the most confidence in the change: The first wave deploys to a Region with low traffic, one AZ at a time. The second wave then deploys to a Region with a high number of requests, again one AZ at a time, where it is highly likely that customers will exercise all the new code paths and where we get good validation of the changes. Next, we can deploy to more Regions in increasingly large waves.

One-box and rolling deployments

Deployments to each production wave start with a one-box stage. The prod one-box deployment minimizes the potential impact of changes on the wave e.g. at most ten percent of overall requests. If the change causes a negative impact in the one box, the pipeline automatically rolls back the change and stops the rollout.

After the one-box stage, most teams use rolling deployments to deploy to the wave’s main production fleet.


Metrics monitoring and auto-rollback

Automated deployments typically don’t have a developer actively checking the metrics and manually rolling back . Instead, the deployment system actively monitors to determine if it needs to automatically roll back. A rollback will switch the environment back to what was previously deployed. Deployment packages are immutable.

Each microservice typically has alarms that triggers on thresholds for the metrics that impact the service’s customers (like fault rates and high latency) and on system health metrics (like CPU utilization), as illustrated in the following example. This high-severity alarm is used to page the on-call engineer and to automatically roll back the service if a deployment is in progress. Often, the rollback is already in progress by the time the on-call engineer has been paged and starts engaging.

Our deployment system can also detect and automatically roll back on anomalies in common metrics emitted by our internal web service framework. Examples of this are if the request count suddenly drops to zero, or if the latency or number of faults becomes much higher than normal.

Alarm and time window blockers

The pipeline should prevent automatic deployments to production when there is a higher risk of causing a negative impact. This can include using a set of “blockers” that evaluate deployment risk e.g., don’t deploy when there is an ongoing issue in the target environment. Before starting a new deployment, the pipeline should check for app specific and organization-wide open alarms to determine whether there are any active issues. If the alarm is currently in the alarm state, the pipeline prevents the change from moving forward. These deployment blockers also need to be overridden by developers when a change needs to be deployed to prod to recover from a high-severity issue.

The pipeline can also configured with a set of time windows that define when a deployment is allowed to start. You want windows that are as large as possible, but do not conflict with other high risk activities (e.g. certain batch processes), and deployments during off-hours will take longer to engage the on-call engineer.

Pipelines as code

We discussed above how one microservices may have many pipelines (for app code, infra code, OS patches, etc.), each with several stages, regions/AZs, alarms etc. This translates into a lot of configuration. Consider practicing “pipelines as code”. Ideally you can even model pipelines using inheritance.


Balance deployment safety against deployment speed. Minimize the amount of time developers spend worrying about deployments. Building automated deployment safety into the release process by using extensive pre-production testing, automatic rollbacks, and staggered production deployments helps minimize the potential impact on production caused by deployments. This means that developers don’t need to actively watch deployments to production.

Further reading

Tags: , , , ,

Leave a Reply