Disaster Recovery: The Art of Sleeping at Night

Rantideb Howlader10 min read

Introduction: The 3 AM Phone Call

Every DevOps engineer knows the sound. The PagerDuty alert. It breaks through your sleep like a siren. You look at your phone. "CRITICAL: Database Unresponsive." You open your laptop. The AWS Console won't load. The entire region is down.

Your CTO calls you. "How long until we are back online?" "How much data did we lose?"

If you don't know the answer to those two questions, you are in trouble. If the answer is "I don't know," you might be looking for a new job.

Disaster Recovery (DR) is not about technology. It's about Paranoia. It is about assuming that everything will break. The disk will fail. The datacenter will flood. The intern will run rm -rf / on the production database.

In this guide, we aren't going to talk about "Cloud Resilience" in abstract terms. We are going to talk about Survival. We will define RTO and RPO (the only two numbers that matter). We will build a bulletproof backup strategy. And we will learn how to replicate your entire company to a different continent in 15 minutes.


Part 1: The Two Numbers (RTO & RPO)

Forget the jargon. Here is what they mean.

1. RPO (Recovery Point Objective) -> "How much data can we lose?"

  • Scenario: The database crashed at 12:00 PM.
  • You restore from the backup taken at 11:00 AM.
  • You lost 1 hour of data.
  • Your RPO was 1 hour.

If your boss says "We need Zero Data Loss," tell them: "That costs 10x more money." Be realistic. Is 5 minutes acceptable? Is 24 hours acceptable?

2. RTO (Recovery Time Objective) -> "How long are we down?"

  • Scenario: The site crashed at 12:00 PM.
  • It took you 30 minutes to figure it out, and 30 minutes to restore the server.
  • Site came back at 1:00 PM.
  • Your RTO was 1 hour.

The Trade-off:

  • Fast RTO + Zero RPO = Very Expensive (Active/Active Multi-Region).
  • Slow RTO + High RPO = Very Cheap (Backup & Restore).

Part 2: The Four Strategies

AWS defines 4 levels of DR. Think of them like Insurance Policies.

Level 1: Backup & Restore (The Cheapest)

  • What: You take snapshots of your data. You put them in S3.
  • Disaster: You manually launch new servers, install software, and download the backup.
  • RTO: Hours/Days.
  • Cost: $

Level 2: Pilot Light (The Smart Choice)

  • What: You have a copy of your environment in another region, but it is turned off.
  • The Database is replicating (Active/Passive), but the Web Servers are Stopped.
  • Disaster: You turn on the web servers. They connect to the database.
  • RTO: Minutes (10-30).
  • Cost: $$ (Paying for Storage + minimal Db).

Level 3: Warm Standby (The "Ready to Go")

  • What: You have a scaled-down version of Prod running in another region.
  • Prod has 10 servers. DR has 2 servers (always running).
  • Disaster: You scale the DR group from 2 to 10.
  • RTO: Minutes (Scaling time).
  • Cost: $$$

Level 4: Multi-Site Active/Active (The Unicorn)

  • What: Both regions are taking traffic to 50/50.
  • Disaster: Route53 sends 100% to the healthy region.
  • RTO: Near Zero.
  • Cost: $$$$ (Complex synchronization).
  • Warning: Most people who try this fail because of database conflicts.

Part 3: The "S3 Lock" (Ransomware Protection)

Hackers don't just encrypt your database. They delete your backups first. Then they demand Bitcoin.

If your backups are in a writable S3 bucket, they are gone.

The Fix: Object Lock (WORM - Write Once Read Many). Enable S3 Object Lock in Compliance Mode. "This backup cannot be deleted by anyone, not even the Root User, for 365 days." If a hacker gets your Root password, they still cannot delete the file. AWS literally blocks the API call.

This is your strictly "Get Out of Jail Free" card.


Part 4: Database Survival (RDS)

Your database is your most important asset. The web servers are disposable. The data is not.

Automated Backups

Enable RDS Automated Backups. AWS takes a snapshot every day and saves transaction logs every 5 minutes. This gives you Point-in-Time Recovery. "Restore the database to the state it was in at 4:13 PM yesterday."

Cross-Region Read Replicas

  1. Create a "Read Replica" in us-west-2 (if Prod is us-east-1).
  2. Data replicates async (lag is usually < 1 second).
  3. Disaster: Click "Promote Read Replica".
  4. The Replica becomes a Standalone Master Database.
  5. Point your apps to the new endpoint.

Part 5: Infrastructure as Code (The Blueprint)

If us-east-1 is gone, can you recreate your VPC? Your Security Groups? Your IAM Roles? If you are doing "ClickOps" in the console, the answer is No.

You need Terraform. Your entire infrastructure must be defined in code. To recover to a new region:

  1. cd infrastructure/
  2. terraform apply -var="region=us-west-2"
  3. Wait 10 minutes.
  4. Done.

If you don't have IaC, your RTO is "Weeks".


Part 6: Route53 (The Traffic Cop)

How do you tell users to go to the new region? You don't send an email saying "Please use new-site.com".

You use Route53 DNS Failover.

  1. Create a "Health Check" that pings your endpoint (api.example.com/health).
  2. Create a Primary Record pointing to East. Associate it with the Health Check.
  3. Create a Secondary Record pointing to West.

If the health check fails (3 times in 30 seconds): Route53 automatically updates the DNS. Users are routed to West. Note: DNS caching means this takes ~5 minutes to propagate to everyone.


Part 7: Chaos Engineering (Testing Fate)

You have a plan. Great. Have you tested it?

Start small:

  1. Go to the Development environment.
  2. Reboot the database while the app is running.
  3. Does the app reconnect automatically? Or does it crash?

Game Day: Once a quarter, schedule a Game Day. Simulate a failure. "The East Region is down. Activate the plan." Time it. If it takes 4 hours, and your boss expects 1 hour... you have work to do.


Part 8: Glossary for the Panic Room

  • Failover: Moving traffic from Primary to Secondary.
  • Failback: Moving traffic back to Primary after it is fixed. (Harder than failover!).
  • Quorum: The minimum number of nodes needed to make a decision.
  • Split Brain: When two databases both think they are the Master. Data corruption ensues. Avoid at all costs.
  • Rehydration: The process of loading data from a backup into a new database.

Part 9: Aurora Global Database (The Speed of Light)

Standard "Cross-Region Read Replicas" (MySQL/Postgres) use logical replication. The database engine has to process the binary log. Lag can be seconds or minutes.

Aurora Global Database is different. It replicates at the Storage Layer. The compute nodes don't do the work. The disk (Aurora Storage) replicates blocks directly to the remote region's disk.

  • Replication Lag: < 1 second (typically 150ms).
  • Performance Impact: Near zero.

The Write Forwarding Trick: In a typical Master/Slave setup, you can only write to the Master (East). If a user is in West, they have to send the request to East (high latency). With Aurora Global, the Read Replica in West can accept the Write, forward it to East internally, and receive the ack. The app treats the Replica as a Master.


Part 10: Route53 ARC (Application Recovery Controller)

Standard Route53 Checks (Pings) have a flaw. They are "Outside-In". "Can I reach the server?" But maybe the server is reachable, but the database is dead. Or maybe the cache is dead.

ARC is the nuclear option. It introduces "Routing Controls" (On/Off switches). You don't let DNS decide efficiently. You decide. You build "Cells" (Silos).

  • Cell 1: East
  • Cell 2: West

You define "Readiness Checks" (Deep audit of capacity). "Does West have enough EC2 capacity to handle East's load right now?" If No, ARC prevents the failover. This stops you from failing over to a region that will immediately crash under the load (The "Death Spiral").


Part 11: AWS Backup (The Central Brain)

We talked about RDS snapshots. But what about EBS? EFS? DynamoDB? S3? Managing backups for 5 services in 5 consoles is a nightmare.

AWS Backup centralizes it. One policy: "Daily at 5 AM. Retain 30 days. Copy to West." Apply this tag: BackupStrategy: Gold. Any resource you tag with Gold automatically gets backed up.

Backup Audit Manager: For compliance (SOC2, HIPAA). "Prove to me that every database has a backup." Audit Manager scans your account. If it finds an RDS instance without a backup plan, it flags it as Non-Compliant. This keeps your auditors happy.


Part 12: The Math of "Pilot Light" vs "Warm Standby"

Let's talk money.

Scenario:

  • Primary: 10x m5.2xlarge ($0.38/hr * 10 = $3.80/hr).
  • Database: db.r5.2xlarge ($1.00/hr).
  • Total Primary Cost: ~$3,500/month.

Option A: Pilot Light (OFF)

  • Remote DB: db.t3.small (for replication only). $0.04/hr.
  • Remote App: 0 Servers.
  • Cost: $30/month.
  • RTO: 15 minutes (Boot time).

Option B: Warm Standby (Scaled Down)

  • Remote DB: db.r5.large (Ready to take load). $0.25/hr.
  • Remote App: 2x m5.large. $0.20/hr.
  • Cost: ~$350/month.
  • RTO: 2 minutes (Scaling time).

Decision: Is saving 13 minutes worth $320/month? For a blog? No. For a Bank? Yes (13 minutes of downtime costs $1M).


Part 13: FSx for NetApp ONTAP (Enterprise DR)

If you are migrating from On-Prem (VMware/NetApp), you probably use SnapMirror. You can use FSx for NetApp ONTAP in AWS. It supports SnapMirror natively. You can replicate your on-prem SAN directly to AWS FSx. This is the fastest way to get petabytes of data into the cloud for DR.


Part 14: Glossary for the Panic Room

  • Failover: Moving traffic from Primary to Secondary.
  • Failback: Moving traffic back to Primary after it is fixed. (Harder than failover!).
  • Quorum: The minimum number of nodes needed to make a decision.
  • Split Brain: When two databases both think they are the Master. Data corruption ensues. Avoid at all costs.
  • Rehydration: The process of loading data from a backup into a new database.
  • RTO: How long you are down.
  • RPO: How much data you lost.
  • WORM: Write Once Read Many (Object Lock).

Conclusion: Hope is Not a Strategy

"It probably won't happen" is not a strategy. "I think backups are running" is not a strategy.

Disaster Recovery is comfortable because it removes fear. When you know that you can survive a region failure, you sleep better. The 3 AM phone call is still annoying. But it is no longer terrifying.

Go check your backups. Right now. Are they encrypted? Are they locked? Have you tried to restore one recently?

If the answer is no... you know what to do.

Further Reading


Ranti

Rantideb Howlader

Author

Connect