Skip to main content

Disaster Recovery

https://aws.amazon.com/disaster-recovery

https://aws.amazon.com/solutions/databases/disaster-recovery

Disaster Recovery of Workloads on AWS: Recovery in the Cloud - https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html

Plan for Disaster Recovery (DR) - https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-for-disaster-recovery-dr.html

RPO and RTO

Disaster Recovery (DR) objectives - https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/disaster-recovery-dr-objectives.html

REL13-BP01 Define recovery objectives for downtime and data loss - https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_planning_for_recovery_objective_defined_recovery.html

RTO: recovery time objective. Maximum downtime. How long it takes to restore the system.

RPO: recovery point objective. Maximum data loss. Time of the latest backup.

Strategies

Disaster recovery options in the cloud - https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html

Live data means the data stores and databases are up-to-date (or nearly up-to-date) with the active Region and ready to service read operations.

  • Active/passive: only one site gets traffic. When a disaster occurs, the requests are routed to the passive site.
  • Active/active: both sites get traffic.
Backup and restorePilot lightWarm standbyMulti site active/active
Data backed upData liveData liveData live
Active/passiveActive/passiveActive/passiveActive/active
$ (lowest cost)$$$$$$$$$ (highest cost)
RTO of hours (highest)RTO of 10s of minutes (eg 30 min)RTO of minutesLowest RTO
No infrastructure deployedServices are idle or shut offMinimum services runningAll infrastructure running at full capacity, duplicated
Can't serve traffic immediatelyCan't serve traffic immediatelyCan serve minimal traffic immediatelyCan serve production traffic

Backup and restore

Backups are done to another region for safety.

The infrastructure is rebuilt manually, which takes hours.

Pilot light

Basic infrastructure elements are in place, like Elastic Load Balancing and EC2 Auto Scaling, but services are "shut off" (ie no EC2 instances are running).

It cannot serve requests until triggered.

Warm standby

Maintains a minimum deployment that can handle requests, but at a reduced capacity—it cannot handle production-level traffic. We have a few EC2 instances running only.

Before failover, the infrastructure must scale up to meet production needs.

Multi site active/active

Both sites serve traffic.

Application Recovery Controller

Move traffic across Regions or away from Availability Zones.

https://aws.amazon.com/application-recovery-controller

Announcement - https://aws.amazon.com/blogs/aws/amazon-route-53-application-recovery-controller

https://aws.amazon.com/blogs/aws/amazon-route-53-application-recovery-controller/ - Part 2: https://aws.amazon.com/blogs/networking-and-content-delivery/building-highly-resilient-applications-using-amazon-route-53-application-recovery-controller-part-2-multi-region-stack/

Guidance for Cross Region Failover & Graceful Failback on AWS - https://aws.amazon.com/solutions/guidance/cross-region-failover-and-graceful-failback-on-aws