General Design Principles
These principles apply across all pillars and form the foundation of a well-architected system in the cloud:
- Stop guessing your capacity needs: Eliminate guesswork when provisioning resources. You can use auto-scaling to match supply with demand dynamically, reducing costs and improving user experience.
- Test systems at production scale: In the cloud, you can create a production-scale test environment on-demand, complete your testing, and then decommission it. This allows for accurate testing without the high cost of maintaining a permanent test environment.
- Automate to make architectural experimentation easier: Automation allows you to create and replicate systems quickly and reliably. This makes it easier to experiment with different instance types, storage options, or configurations to see what works best.
- Allow for evolutionary architectures: In a traditional environment, architectural decisions are often static and hard to change. In the cloud, you can design systems that evolve. This enables you to automate and test changes, so your architecture can adapt as business needs change.
- Drive architectures using data: Collect data on how your architectural choices impact business goals. Use this data to inform your decisions and select the right architectures and resource configurations.
- Improve through game days: Regularly schedule "game days" to simulate failure events or high-load scenarios. This is a crucial practice for testing how your systems and teams respond to problems, helping you identify areas for improvement before a real event occurs.
The Six Pillars & Their Design Principles
1. Operational Excellence Pillar
This pillar focuses on running and monitoring systems to deliver business value and continuously improving supporting processes and procedures.
- Perform operations as code: Automate everything. Define your entire workload (applications and infrastructure) as code and update it with code.
- Make frequent, small, reversible changes: Design systems to allow for small, incremental changes that can be easily reversed if they fail. This reduces the scope and impact of potential failures.
- Refine operations procedures frequently: Continuously evolve your operational procedures. As you use them, look for opportunities to improve and automate.
- Anticipate failure: Proactively identify potential sources of failure. Test your failure and response procedures to ensure they are effective.
- Learn from all operational failures: Treat every operational failure as a learning opportunity. Conduct a thorough post-mortem to understand the root cause and implement improvements to prevent recurrence.
2. Security Pillar
This pillar focuses on protecting information, systems, and assets while delivering business value through risk assessments and mitigation strategies.
- Implement a strong identity foundation: Use the principle of least privilege and enforce separation of duties. Centralize identity management and aim to eliminate reliance on long-term static credentials.
- Enable traceability: Monitor, alert, and audit all actions and changes in your environment in real time. Integrate logs and metrics with systems that can automatically investigate and take action.
- Apply security at all layers: Instead of just focusing on the edge of your network, apply a defense-in-depth approach with security controls at all layers (e.g., network, VPC, subnet, load balancer, instance, OS, application).
- Automate security best practices: Automate security mechanisms to improve your ability to scale securely and rapidly. This includes things like security patching, vulnerability scanning, and configuration management.
- Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms like encryption, tokenization, and access control where appropriate.
- Keep people away from data: Create mechanisms and tools to reduce or eliminate the need for direct human access to data. This minimizes the risk of mishandling or unauthorized modification.
- Prepare for security events: Have an incident management process that aligns with your organizational requirements. Run incident response simulations and use automation to speed up detection, investigation, and recovery.
3. Reliability Pillar
This pillar focuses on ensuring a workload performs its intended function correctly and consistently when it’s expected to. A resilient workload can recover from infrastructure or service disruptions.
- Test recovery procedures: Use automation to simulate different failures or to re-create scenarios that led to past failures. This helps you understand the impact of failures and validates your recovery procedures.
- Automatically recover from failure: By monitoring a workload for key performance indicators (KPIs), you can trigger automated recovery processes when a threshold is breached, reducing manual effort and recovery time.
- Scale horizontally to increase aggregate workload availability: Replace large, single resources with multiple smaller resources. This distributes requests across the smaller resources and prevents a single point of failure from impacting the entire system.
- Stop guessing capacity: Monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level.
- Manage change in automation: Use automation for all changes to infrastructure. This ensures changes are made consistently and can be tracked and reviewed.
4. Performance Efficiency Pillar
This pillar focuses on using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes and technologies evolve.
- Democratize advanced technologies: Go global in minutes, and leverage advanced technologies like AI/ML or data analytics as services, allowing your team to focus on product development rather than resource provisioning.
- Go global in minutes: Easily deploy your system in multiple AWS Regions around the world to provide lower latency and a better experience for your customers at a minimal cost.
- Use serverless architectures: Avoid the overhead of managing physical servers. Serverless architectures remove the need for you to run and maintain servers, and they automatically handle scaling and availability.
- Experiment more often: With virtual and automatable resources, you can perform comparative testing using different types of instances, storage, or configurations with ease.
- Consider mechanical sympathy: Understand how your cloud solution will be used. Select the technology approach that aligns best with your workload goals, such as choosing the right database or storage type for your data access patterns.
5. Cost Optimization Pillar
This pillar focuses on avoiding or eliminating unneeded costs.
- Implement Cloud Financial Management: Achieve financial success and accelerate business value realization in the cloud by dedicating the necessary time and resources to this practice.
- Adopt a consumption model: Pay only for the computing resources you consume and increase or decrease usage depending on business requirements, rather than provisioning for peak capacity.
- Measure overall efficiency: Measure the business output of the workload and the costs associated with delivering it. Use this data to understand the gains you make from increasing output and reducing cost.
- Stop spending money on data center operations: AWS handles the heavy lifting of racking, stacking, and powering servers, which allows you to focus on your customers and business projects rather than on IT infrastructure.
- Analyze and attribute expenditure: The cloud makes it easier to accurately identify the usage and cost of systems, which allows for transparent attribution of IT costs to individual workload owners.
6. Sustainability Pillar
This pillar focuses on minimizing the environmental impacts of running cloud workloads.
- Understand your impact: Establish performance indicators, evaluate the downstream and upstream impact of your workloads, and model the expected impact to set goals for improvement.
- Establish sustainability goals: Set long-term goals for sustainability, such as reducing the compute and storage resources required per transaction.
- Maximize utilization: Provision for the exact level of capacity you need and use auto-scaling to adjust to demand. This avoids overprovisioning and reduces waste.
- Anticipate and adopt new, more efficient hardware and software offerings: Continuously keep pace with the latest, most efficient technologies from AWS to improve the sustainability of your workloads.
- Use managed services: Sharing services across a broad customer base helps maximize resource utilization, which reduces the amount of infrastructure needed to support your workloads.