RSS Feed Subscribe to RSS Feed


AWS Well-Architected Framework – Abridged

This is an abridged version of the AWS Well-Architected Framework. It is essentially a cut and paste of the most salient parts (the original is about 18,000 words; this is about 4,000).


The AWS Well-Architected Framework helps you learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It is based on five pillars:

  • Operational Excellence,
  • Security,
  • Reliability,
  • Performance Efficiency, and
  • Cost Optimization.

(Think CORPS; Cost optimization; Operation excellence; Reliability; Performance; Security)

Operational Excellence: The ability to run, monitor and improve systems to deliver business value.

Security: Protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Reliability: Recover from disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

Performance Efficiency: Use computing resources efficiently, and to maintain that efficiency as demand changes.

Cost Optimization: Avoid or eliminate unneeded cost or suboptimal resources.

When architecting solutions you make trade-offs between pillars based upon your business context. e.g. reduce cost at the expense of reliability in dev, or optimize reliability with increased costs in prod.
Note however that security and operational excellence are generally not traded-off against the other pillars.

On Architecture

Amazon prefer to distribute capabilities into teams rather than having a centralized team with that capability. Every team is expected to have the capability to create architectures and to follow best practices.

General Design Principles

The Well-Architected Framework identifies a set of general design principles to facilitate good design in the cloud:
Stop guessing your capacity needs: Scale up and down automatically.

  • Test systems at production scale: Create a production-scale test environment on demand in a cost effective way.
  • Automate to make architectural experimentation easier:
    Automation allows you to create and replicate your systems at low cost and avoid the expense of manual effort.
  • Allow for evolutionary architectures: Architectural decisions shouldn’t be implemented as static, one-time events. In the cloud, the capability to automate and test on demand lowers the risk of impact from design changes. This allows systems to evolve over time.
  • Drive architectures using data: Collect data on the impact of your architectural choices and make fact-based decisions on how to improve.
  • Improve through game days: Test how your architecture and
    processes perform by regularly scheduling game days to simulate events in production. This will help you understand where improvements can be made and can help develop organizational experience in dealing with events.

The Five Pillars of the Well-Architected Framework

Creating a software system is a lot like constructing a building. If the foundation is not solid structural problems can undermine the integrity and function of the building. Incorporating the five pillars of operational excellence into your architecture will help you produce stable and efficient systems, allowing you to focus on functional requirements.

Operational Excellence

The operational excellence pillar includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

Design Principles

There are six design principles for operational excellence in the cloud:

  1. Perform operations as code: Apply the same engineering discipline to your entire environment that you use for application code. Script your
    operations procedures and automate their execution. By performing operations as code, you limit human error and enable consistent responses to events.
  2. Annotate documentation: Automate the creation of documentation after every build, or automatically annotate hand-crafted documentation. Annotated documentation can be used by people and systems, and as an input to your operations code.
  3. Make frequent, small, reversible changes: Reversible, without affecting customers when possible.
  4. Refine operations procedures frequently
  5. Anticipate failure: Perform “pre-mortem” exercises to identify
    potential sources of failure and test your response procedures.
  6. Learn from all operational failures: Share what is learned across teams

Best Practices


Effective preparation is required to drive operational excellence. Design workloads with mechanisms to monitor and gain insight into application, platform, and infrastructure components, as well as customer experience and behavior.
Create mechanisms to validate that workloads, or changes, are ready to be moved into production and supported by operations.

Using AWS CloudFormation enables you to have consistent, templated, sandbox development, test, and production environments with increasing levels of operations control. Data on use of
resources, application programming interfaces (APIs), and network flow logs can be collected using Amazon CloudWatch, AWS CloudTrail, and VPC Flow Logs.



Define expected outcomes and determine how success will be measured. Establish baselines from which improvement or degradation of operations will be identified. Use collected metrics to determine if you are satisfying customer and business needs, and identify areas for improvement.

Communicate the operational status of workloads through dashboards. Determine the root cause of unplanned events and unexpected impacts from planned events. This information will be used to update your procedures to mitigate future occurrence of events.

In AWS, you can generate dashboard views of your metrics collected from workloads and natively from AWS. AWS provides workload insights through logging capabilities including AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs.



Evolution of operations is required to sustain operational excellence. Dedicate work cycles to making continuous incremental improvements.

Include feedback loops within your procedures to rapidly identify areas for improvement and capture learnings from the execution of operations.

Share lessons learned across teams to share the benefits of those lessons.



The security pillar includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Design Principles

There are six design principles for security in the cloud:

  • Implement a strong identity foundation

Implement the principle of least privilege and enforce separation of duties with appropriate authorization. Reduce reliance on longterm credentials.

  • Enable traceability

Monitor, alert, and audit actions and changes to
your environment in real time.

  • Apply security at all layers

Apply a defense-in-depth approach

  • Automate security best practices

Create secure architectures, including the implementation of controls that are defined and managed as code in version-controlled templates.

  • Protect data in transit and at rest

Classify your data into sensitivity levels and use mechanisms, such as encryption and tokenization where appropriate. Reduce direct human access to data to.

  • Prepare for security events

Have an incident management process. Run incident response simulations.

Best Practices

There are five best practice areas for security in the cloud:

  1. Identity and Access Management
  2. Detective Controls
  3. Infrastructure Protection
  4. Data Protection
  5. Incident Response
Identity and Access Management

Identity and access management ensures that only authorized and authenticated users are able to access your resources.
In AWS, privilege management is primarily supported by the AWS Identity and Access Management (IAM) service, which allows you to control user access to AWS services and resources via policies, which assign permissions to a user, group, role, or resource. You also have the ability to require strong password practices and MFA. IAM enables secure access for systems through instance profiles, identity federation, and temporary credentials.

Detective Controls

You can use detective controls to identify a potential security incident. e.g.,

  • conduct an inventory of assets and their detailed
  • use internal auditing to ensure that practices meet policies and requirements and that you have set the correct automated
    alerting notifications.

These controls are reactive factors that can help identify and understand the scope of anomalous activity.

In AWS, you can implement detective controls by processing logs, events, and monitoring.
CloudTrail logs, AWS API calls, and CloudWatch provide monitoring of metrics with alarming, and AWS Config provides configuration history.

Log management is important to a well-architected design for reasons ranging from security or forensics to regulatory or legal requirements.


Infrastructure Protection

Infrastructure protection includes control methodologies, such as defense-indepth and MFA.
You can implement stateful and stateless packet inspection.

You should use Amazon Virtual Private Cloud (Amazon VPC) to create a private, secured, and scalable environment in which you can define your topology—including gateways, routing tables, and public and private subnets.

Enforcing boundary protection, monitoring points of ingress and egress, and comprehensive logging, monitoring, and alerting are all essential to an effective information security plan.

Data Protection

Before architecting any system, foundational practices that influence security should be in place. For example, data classification provides a way to categorize organizational data based on levels of sensitivity, and encryption protects data by rendering it unintelligible to unauthorized access.

As an AWS customer you maintain full control over your data and AWS never initiates the movement of data between Regions.

AWS provides multiple means for encrypting data at rest and in transit, including server-side encryption (SSE) for Amazon S3, and the entire HTTPS encryption and decryption process (generally known as SSL termination) can be handled by ELB.

Incident Response

Your organization should put processes in place to respond to and mitigate the potential impact of security incidents. Putting in place the tools and access ahead of a security incident, then routinely practicing incident response, will make sure the architecture is updated to accommodate timely investigation and recovery.


The reliability pillar includes the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions.

Design Principles

There are five design principles for reliability in the cloud:

  1. Test recovery procedures: Test how your system fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before.
  2. Automatically recover from failure: By monitoring a system for key performance indicators (KPIs), you can trigger automation when a threshold is breached. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
  3. Scale horizontally to increase aggregate system availability: Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall system. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure.
  4. Stop guessing capacity: A common cause of failure in on-premises systems is resource saturation. In the cloud, you can monitor demand and system utilization, and automate the addition or removal of resources.
  5. Manage change in automation: Changes to your infrastructure should be done using automation.

Best Practices


Before architecting any system, foundational requirements that influence reliability should be in place. For example, you must have sufficient network bandwidth to your data center. These requirements are sometimes neglected because they are beyond a single project’s scope. In an on-premises environment, these requirements can cause long lead times and therefore must be incorporated during initial planning. With AWS, most of these foundational requirements are already incorporated or may be addressed as needed. The cloud is designed to be essentially limitless.

AWS sets service limits (an upper limit on the number of each resource your team can request) to protect you from accidentally over-provisioning resources.

Change Management

Being aware of how change affects a system allows you to plan proactively, and monitoring allows you to quickly identify trends that could lead to capacity issues or SLA breaches.
Using AWS, you can monitor the behavior of a system and automate the response to KPIs, for example, by adding additional servers as a system gains more users. You can control who has permission to make system changes and audit the history of these changes.

When you architect a system to automatically add and remove resources in response to changes in demand, this not only increases reliability but also ensures that business success doesn’t become a burden.

Failure Management

In any system of reasonable complexity it is expected that failures will occur. With AWS, you can take advantage of automation to react to monitoring data. For example, when a particular metric crosses a threshold, you can trigger an automated action to remedy the problem. Also, rather than trying to diagnose and fix a failed resource that is part of your production environment, you can replace it with a new one.

Regularly back up your data and test your backup files to ensure you can recover from both logical and physical errors. A key to managing failure is the frequent and automated testing of systems to cause failure, and then observe how they recover. Actively track KPIs, such as the recovery time objective (RTO) and recovery point objective (RPO), to assess a system’s resiliency (especially under failure-testing scenarios). Tracking KPIs will help you identify and mitigate single points of failure. The objective is to thoroughly test your system-recovery processes so that you are confident that you can recover all your data and continue to serve your customers, even in the face of sustained problems. Your recovery processes should be as well exercised as your normal production processes.


Performance Efficiency

The performance efficiency pillar includes the ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve.

Design Principles

There are five design principles for performance efficiency in the cloud:

  1. Democratize advanced technologies: Rather than having your IT team learn how to host and run a new technology, they can simply consume it as a service. In the cloud, technologies that require hard-to-acquire expertise become services that your team can consume while focusing on product development.
  2. Go global in minutes: Easily deploy your system in multiple Regions around the world with just a few clicks.
  3. Use serverless architectures: Remove the need for you to run and maintain servers to carry out traditional compute activities. For example, storage services can act as static websites.
  4. Experiment more often: With virtual and automatable resources, you can quickly carry out comparative testing using different types of instances, storage, or configurations.
  5. Mechanical sympathy: Use the technology approach that aligns best to what you are trying to achieve. For example, consider data access patterns when selecting database or storage approaches.

Best Practices

(Performance Efficiency -> Best Practices -> Selection)


How do you select the best performing architecture?

Well-architected systems use multiple solutions to improve performance, selected using a data-driven approach e.g., through benchmarking or load testing. Your architecture will likely combine a number of different architectural approaches (for example, event-driven, ETL, or pipeline). In the following sections we look at the four main resource types that you should consider (compute, storage, database, and network).

(Performance Efficiency -> Best Practices -> Selection -> Compute)


How do you select your compute solution?

In AWS, compute is available in three forms: instances, containers, and functions:

  • Instances (virtualized servers)

In the cloud, you can experiment with different virtual server instances, which come in different
families and sizes, with a wide variety of capabilities, including
solid-state drives (SSDs) and graphics processing units (GPUs).

  • Containers

Containers allow you to run an application and its dependencies in resource-isolated processes.

  • Functions

Functions abstract the execution environment from the code you want to execute. e.g., AWS Lambda allows you to execute code without running an instance

The optimal architecture may use different compute solutions for various components and take advantage of the elasticity mechanisms available to ensure sufficient capacity to sustain performance as demand changes.


How do you select your storage solution?

The optimal storage solution for a particular system will vary based on the kind of access method (block, file, or object), patterns of access (random or sequential), throughput required, frequency of access (online, offline, archival), frequency of update (WORM, dynamic), and availability and durability constraints.

When you select a storage solution, ensuring that it aligns with your access patterns will be critical to achieving the performance you want.


How do you select your database solution?

The optimal database solution can vary based on requirements for availability, consistency, partition tolerance, latency, durability and scalability.

  • Amazon RDS provides a scalable, relational database
  • Amazon DynamoDB is a NoSQL database that
    provides single-digit millisecond latency at any scale.
  • Amazon Redshift is a petabyte-scale data warehouse

Database approach (RDBMS, NoSQL, etc.) is often an area that is chosen according to organizational defaults rather than through a data-driven approach. It is critical to consider access patterns, and whether non-database solutions could solve the problem more
efficiently (e.g., a search engine or data warehouse).


How do you configure your networking solution?

Network solutions vary based on latency, throughput requirements etc.
AWS offers product features:

  • to optimize network traffic: e.g., very high network instance types, Amazon EBS optimized instances, Amazon S3
    transfer acceleration, dynamic Amazon CloudFront
  • to reduce network distance: e.g., Amazon Route 53 latency routing, Amazon VPC endpoints, and AWS Direct Connect.

Consider location. With AWS, you can choose to place resources close to where they will be used to reduce distance. By taking advantage of Regions, placement groups, and edge locations you can significantly improve performance.


How do you ensure that you continue to have the most appropriate resource type as new resource types and features are

When architecting solutions, there is a finite set of options that you can choose from. However, over time new technologies and approaches become available that could improve the performance.

Understanding where your architecture is performance-constrained will allow you to look out for releases that could alleviate that constraint.


How do you monitor your resources post-launch?

You need to monitor your architecture so that you can remediate any issues before your customers are aware. Monitoring metrics should be used to raise alarms when thresholds are breached.

Amazon CloudWatch provides the ability to monitor and send notification alarms, and you can automatically trigger actions through Kinesis, SQS, and Lambda.

Ensuring that you do not miss things (false negatives), or are overwhelmed with false positives, is key to having an effective monitoring solution.

Automated triggers avoid human error and can reduce the time to fix problems. Plan for game days where you can conduct simulations in the production environment to test your alarm solution and ensure that it correctly recognizes issues.


How do you use tradeoffs to improve performance?

When you architect solutions, think about tradeoffs so you can select an optimal approach. Depending on your situation you could trade consistency, durability, and space versus time or latency to deliver higher performance.

Note however that tradeoffs can increase the complexity of your architecture and require load testing to ensure that a measurable benefit is obtained.

Cost Optimization

The cost optimization pillar includes the ability to avoid or eliminate
unneeded cost or suboptimal resources.

Design Principles

There are five design principles for cost optimization in the cloud:

  1. Adopt a consumption model: Don’t forecast: Pay only for the computing resources that you consume and increase or decrease as needed.
  2. Measure overall efficiency: Measure the business output and the costs associated with delivering it, to understand the gains you make from increasing output and reducing costs.
  3. Stop spending money on data center operations: (hmm, biased?!)
  4. Analyze and attribute expenditure: Identify the usage and cost of systems to measure ROI and optimize.
  5. Use managed services to reduce cost of ownership: Remove the operational burden of maintaining servers.

Best Practices

Cost-Effective Resources

(The Five Pillars -> Cost Optimization -> Best Practices -> Cost-Effective Resources)

Are you considering cost when you select AWS services for your
Have you sized your resources and selected the appropriate pricing model to meet your cost targets?

Using the appropriate instances and resources for your system is key to cost savings.

For example:

  • A reporting process might take five hours to run on a smaller server but one hour to run on a larger server that is twice as expensive. Same outcome, but the smaller one will incur more
    cost over time.
  • Rather than maintaining servers to deliver email, you can use a service that charges on a per-message basis.

AWS offers a variety of flexible and cost-effective pricing options to acquire EC2 instances:

  • On-Demand Instances: pay for compute capacity by the hour; no commitments required.
  • Reserved Instances: reserve capacity with savings of up to 75% off On-Demand pricing.
  • Spot Instances: bid on unused EC2 capacity at significant discounts. (appropriate where the system can tolerate individual servers
    going down)
Matching Supply and Demand

(The Five Pillars -> Cost Optimization -> Best Practices -> Matching Supply and Demand)

How do you make sure your capacity matches but does not substantially exceed what you need?

Optimally matching supply to demand delivers the lowest cost for a system, but there also needs to have sufficient extra supply to allow for provisioning time and individual resource failures.

A well-architected system will use the most cost-effective resources, and managed services (e.g. SES rather than your own email server) to reduce costs.

Auto Scaling and demand, buffer, and time-based approaches allow you to add and remove resources as needed.


Expenditure Awareness

(The Five Pillars -> Cost Optimization -> Best Practices -> Expenditure Awareness)

Do you consider data-transfer charges when designing your
How are you monitoring usage and spending?
Do you decommission resources that you no longer need or stop resources that are temporarily not needed?

The ease of use and virtually unlimited on-demand capacity may require a new way of thinking about expenditures.
The capability to attribute resource costs to the individual business or product owners drives efficient usage behavior, helps reduce waste and allows you to understand which products are truly profitable.

You can use cost allocation tags to categorize and track your AWS costs and set up billing alerts to notify you of predicted overspending.

Optimizing Over Time

(The Five Pillars -> Cost Optimization -> Best Practices -> Optimizing Over Time)

How do you manage and/or consider the adoption of new services?

It is a best practice to review your existing architectural decisions to ensure they continue to be the most cost-effective.
Be aggressive in decommissioning resources, entire services, and systems that you no longer require.
Be aware of new managed services that could help save you money.

The Review Process

The review of architectures should be a light-weight process that is a conversation and not an audit. The purpose of reviewing an architecture is to identify any critical issues that might need
addressing or areas that could be improved. The outcome of the review is a set of actions that should improve the experience of a customer using the workload.

Ideally each team member takes responsibility for the quality of architecture. Reviews should be applied at multiple times in a applications lifecycle, including early on in the design phase to avoid one-way doors that are difficult to change, and then before the go live date.



The AWS Well-Architected Framework provides architectural best practices across the five pillars for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. Using the Framework in your architecture will help you produce stable and efficient systems, which allow you to focus on your functional requirements

Tags: ,

Leave a Reply