Key Disaster Recovery Metrics
Two fundamental metrics define any DR strategy:
- Recovery Time Objective (RTO): The maximum acceptable delay between the interruption of service and the restoration of service. This metric defines "how long" you can afford to be down.
- Recovery Point Objective (RPO): The maximum acceptable amount of time since the last data recovery point. This metric defines the maximum amount of data loss you can tolerate, measured in time (e.g., 1 hour of data loss).
The choice of a DR strategy is a trade-off between cost and complexity versus RTO and RPO. Lower RTO and RPO (faster recovery, less data loss) require more complex and expensive solutions.
Four Common Disaster Recovery Strategies on AWS
These strategies are presented in order of increasing cost and complexity, and decreasing RTO and RPO.
1. Backup and Restore
This is the simplest and most cost-effective DR strategy. It involves regularly backing up your data and infrastructure to a different AWS Region. In a disaster, you would initiate a restore process in the DR region.
- How it works:
- Data is backed up to services like Amazon S3 or Amazon EBS Snapshots. AWS Backup can be used to automate and manage these backups centrally.
- Infrastructure can be defined as code using AWS CloudFormation or Terraform.
- In a DR event, you deploy the infrastructure from your templates and restore the data from backups.
- RTO: High (typically hours to a day). The time it takes to provision infrastructure and restore large data volumes.
- RPO: High (minutes to hours). Depends on the frequency of your backups.
- Best for: Non-critical systems, development/test environments, or applications that can tolerate significant downtime and some data loss.
2. Pilot Light
In this strategy, a minimal version of the core infrastructure is always running in the DR region. This "pilot light" includes critical components like databases, which are kept up-to-date. The application servers and other components are turned off and are only provisioned during a disaster.
- How it works:
- A copy of your core infrastructure (e.g., database servers, a small web server instance) runs in the DR region.
- Data is replicated from the primary region to the DR region's database.
- During a DR event, you scale up the infrastructure by provisioning and starting the full set of application servers. DNS is updated to point traffic to the DR region.
- RTO: Lower than Backup and Restore (typically tens of minutes to hours).
- RPO: Lower than Backup and Restore (typically seconds to minutes).
- Best for: Core business applications where a recovery time of under an hour is desirable but the cost of a fully duplicated environment is prohibitive.
3. Warm Standby
The Warm Standby strategy involves maintaining a scaled-down but fully functional copy of your production environment in another region. This environment runs 24/7 but operates at a reduced capacity.
- How it works:
- A complete, but scaled-down, version of your infrastructure runs in the DR region. For example, you might run fewer application servers or smaller instance types.
- Data is actively replicated to the DR region.
- During a DR event, you "scale up" the DR environment to handle production load and update DNS to route all traffic there. The failover process can be automated.
- RTO: Low (minutes).
- RPO: Very low (seconds, or potentially zero).
- Best for: Business-critical systems that require high availability and minimal downtime.
4. Multi-Site Active-Active
This is the most advanced and expensive DR strategy, providing near-zero downtime and data loss. In this model, you run your workload simultaneously in multiple active regions. Traffic is distributed among all regions.
- How it works:
- Your workload is fully deployed and active in two or more AWS Regions.
- A routing service (like Amazon Route 53) distributes traffic between the regions. Various routing policies can be used (e.g., latency-based, weighted).
- If one region becomes unavailable, traffic is automatically directed to the other healthy region(s) with no manual intervention.
- RTO: Near-zero. Failover is instantaneous for users.
- RPO: Near-zero. Data is replicated synchronously or near-synchronously between regions.
- Best for: Mission-critical applications with global user bases that cannot tolerate any downtime or data loss.