5 Pillars of Well-Architected Framework

5 Pillars of Well-Architected Framework

Creating a software system is a lot like constructing a building. If the foundation is not solid, structural problems can undermine the integrity and function of the building. In this article, we're going to talk about the design principles we can follow to build a future proof large scale software. The concepts are from AWS Well-Architected framework whitepaper. This whitepaper inspires to learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way to consistently measure your architectures against best practices and identify areas for improvement. I'll be trying to summarize the Whitepaper

So let's first quickly sum up the Guiding Design Principles:

  • Stop guessing capacity needs: Scale up & Down as required
  • Automate everything: Automated systems ensure consistency & reliability
  • Test at scale: Test an accurate replica of production on-demand
  • Adapt & Evolve: Adapt the architecture as needed to meet new challenges

The framework is based on 5 pillars:

1). Operational Excellence 2). Cost optimization 3). Reliability 4). Performance Efficiency 5). Security

Operational Excellence

The main emphasis of this pillar is: Does your architecture work ? Will it continue to ? Let's look at this pillar specific principles:

  • All operations are code
  • Document is updated automatically
  • Make smaller changes you can roll back
  • Iterate...a lot
  • Expect things to go sideways

Cost Optimization

Emphasis: Spend only what you have to Pillar specific principles:

  • Consumption based pricing
  • Measure efficiency constantly


Emphasis: Will this system work consistently & recover quickly ? Pillar specific principles:

  • Recover from issues automatically
  • Scale horizontally first for resilience
  • Reduce idle resources
  • Manage change through automation

Performance Efficiency

Emphasis: Remove bottlenecks, reduce waste Pillar specific principles:

  • Reduce latency
  • Serverless


Emphasis: Does this system work only as intended? Pillar specific principles:

  • Automate security tasks
  • Encrypt data in transit and at rest
  • Know who did what when
  • Identities have the least privileges required

Operational Excellence In Depth

Operational excellence is the ability to run systems and gain insights into their operations in order to deliver business value, and to continuously improve supporting processes and procedures. The 3 Phases of Operational Excellence

Prepare-Prioritize: Prioritize to align with business priorities

  • What is the business goal ?
  • What are the critical pieces need to meet that goal ?
  • Any compliance restrictions/requirements ?
  • Dependencies between services ?

Design your architecture to support business Priorities

  • Is the design observable ?
  • Are your logs & observations actionable ?

Is your workload ready to go live ?

  • Are your processes consistent ?
  • Is operational code properly managed ?
  • Are tests in place ?
  • Anticipate failure ?
  • Ensure your workload is actually working

Shit happens. Be ready.

  • Anticipate planned & unplanned events
  • Respond in code
  • Connect observations with 3rd party tools as needed


  • Learn from success & failure
  • Post-event, have runbooks changed ?
  • Test assumptions
  • Experiment early and often find better solutions


  • Use the appropriate resources & configurations
  • Provision to current needs with an eye to future
  • Right size to lowest resource that meets needs
  • Use data to choose purchase options
  • Optimize by geography
  • Optimize data transfer
  • Know how much you're spending and where
  • Continuously work to maximize value delivered
  • Align utilization with requirements
  • Report and validate findings
  • Evaluate new services for value

Awareness of spend is key to maximizing value


Reliability is the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions.

  • Scale horizontally first for resilience
  • Reduce idle resources
  • Manage change through automation

Limit: Understand default & requested resources limit Networking: Understand topology, bandwidth & latency Availability: Ensure your application is ready for business use

Ensure your application is ready for business use

  • Can users access your application
  • Deploy without issue
  • Can you push issue to planned downtime
  • Can your application withstand portal outages ?

Performance Efficiency


  • Is this the optimal solution for this workload ?
  • What type of compute best suits ?
  • Which data store is ideal for this workload ?
  • Does your network design complement compute & data store choices ?


  • Continuously ensure choices work for your workload
  • Is infrastructure stored as code ?
  • Are deployments simple & automated ?
  • Can benchmarks be taken automatically ?


  • Use active & passive monitoring where appropriate
  • Understand the five phases of monitoring (Generation, Aggregation, Real-time Processing, Storage, Analysis)
  • Create actionable metrics

Trade of -> You can't have it all

Did you find this article valuable?

Support Rajan Prasad by becoming a sponsor. Any amount is appreciated!