Breaking Production on Purpose: A Guide to Chaos Engineering
Introduction: The Firefighter
Imagine a firefighter who only reads books about fire. They know the chemistry of combustion. They know the melting point of steel. But they have never held a hose. They have never felt the heat.
On the day of a real fire, they will freeze.
Software Engineering is the same. We design "Highly Available" systems. We draw diagrams with Redundancy. But until you actually pull the plug on a server, you don't know if it works.
Chaos Engineering is the practice of breaking your own system comfortably, under control, to verify that it recovers. It is not about causing pain. It is about building confidence.
I remember my first "Game Day." We turned off the Primary Database. We expected a smooth failover. Instead, the application froze for 10 minutes, then crashed. Why? Because the Java JDBC driver had a default timeout of infinity. It was waiting forever for the dead database to reply.
We fixed one line of config (timeout=5s).
We broke it again.
It recovered in 2 seconds.
That simple test saved us from a catastrophic outage on Black Friday.
In this guide, we are going to learn how to break things safely. We will use AWS FIS (Fault Injection Service) to inject failure into EC2, RDS, and EKS. And we will learn how to run a "Game Day" without getting fired.
Part 1: The Principle (It's Science, Not Anarchy)
Chaos Engineering is not "Let's randomly delete stuff." It is the Scientific Method.
- Hypothesis: "If I stop one EC2 instance, the Auto Scaling Group (ASG) will launch a new one, and users will see no errors."
- Experiment: Stop the instance.
- Observation: Watch the graphs. Did 500 errors spike? Did latency go up?
- Conclusion: It worked. Or it didn't.
If you don't measure it, it's not Science. It's just vandalism.
Part 2: AWS FIS (The Controlled Explosion)
Netflix has "Chaos Monkey." AWS has Fault Injection Service (FIS).
It is a managed service that lets you run "Experiments."
Why use FIS instead of a script? Safety. FIS has a "Stop Condition." You hook it up to a CloudWatch Alarm (e.g., "500 Error Rate > 2%"). If your experiment causes too much damage and the alarm goes off, FIS stops immediately. It rolls back the chaos. This is the "Dead Man's Switch" that prevents you from destroying production.
Part 3: Experiment 1 - The EC2 Sniper
Hypothesis: "My Load Balancer will detect a dead server and stop sending traffic to it within 5 seconds."
** The Setup (FIS Template)**:
- Action:
aws:ec2:stop-instances - Target: Tags
Env=ProdandRole=WebServer. - Selection Mode:
Count=1(Pick 1 random server). - Duration: 5 minutes. (Keep it stopped).
The Reality Check: Run it. Open your website. Mash F5. Do you see a 502 Bad Gateway? If yes, your ALB Health Check is too slow.
- Default: Interval 30s, Threshold 3 failures = 90 seconds of downtime.
- Fix: Interval 5s, Threshold 2 = 10 seconds.
Part 4: Experiment 2 - The Latency Injection
Latentcy is worse than downtime. If a service is DOWN, the load balancer removes it. Fast failure. If a service is SLOW, the load balancer keeps sending traffic. Threads pile up. The database connection pool fills up. The whole system enters a "Brownout."
The Setup:
- Action:
aws:ssm:send-command-> Runtc(Traffic Control) on Linux. - Command:
tc qdisc add dev eth0 root netem delay 200ms. - Target: The Microservice "Payment-API".
The Observation: Watch the "Checkout-API" (which calls Payment-API). Does it handle the slow-down gracefully? Or does it timeout and crash? The Fix: Implement Circuit Breakers (e.g., Resilience4j). If Payment-API is slow, fail fast and show a "Try again later" message instead of hanging the browser.
Part 5: Experiment 3 - The Availability Zone Failure
This is the big one. AWS claims "Multi-AZ" solves everything. Does it?
The Setup:
Block all network traffic to/from us-east-1a.
(You can simulate this with Network ACLs).
The Failure Mode:
Your app servers in 1b try to talk to 1a. They hang.
Your database Master is in 1a. It stops responding.
RDS detects the failure and fails over to 1b.
The Question:
Does your app application know that the DB IP address changed?
Does is respect the DNS TTL?
We found that Java JVM caches DNS lookups forever by default.
Even after RDS failed over to a new IP, our app kept trying to talk to the old IP.
Fix: networkaddress.cache.ttl=60 in java.security.
Part 6: Experiment 4 - Database Chaos (RDS)
FIS can trigger an RDS Reboot or Failover.
The Action: aws:rds:reboot-db-instance with ForceFailover=true.
The Test:
Run a script that writes to the DB every second.
Time 00:01: Write OK
Time 00:02: Write OK
Time 00:03: Write FAIL (Connection Refused)
Time 00:04: Write FAIL
Time 00:05: Write OK
Your downtime was 2 seconds. Is 2 seconds acceptable? For a Bank: No. Use RDS Proxy to queue connections during failover. For a Blog: Yes.
Part 7: Game Days (The Human Element)
Chaos Engineering is a Team Sport. Once a month, schedule 2 hours. Invte the Juniors. Invite the CTO. Buy Pizza.
- Select a Master of Ceremonies: They run the FIS experiment. They know what will happen.
- The Team: They don't know what will happen. They start "On Call" mode.
- Start: The MC breaks something.
- React: The team looks at dashboards. "Why is latency up?" "Why is the queue backing up?"
- Fix: They find the root cause and fix it.
This trains your team's "Muscle Memory." When a real outage happens at 3 AM, they won't panic. They will say: "Oh, this looks like the latency experiment we ran last Tuesday."
Part 8: Spot Instances as Chaos Monkeys
You don't need FIS to kill servers. Just use Spot Instances in production. Spot instances are revoked by AWS randomly (when price goes up). This effectively runs a "Chaos Monkey" 24/7 for free.
If your system can survive on Spot, it is resilient. If you are afraid to run Spot, your system is fragile. Strategy: Run 50% On-Demand, 50% Spot. This guarantees you solve the "Graceful Shutdown" problem.
Part 9: Kubernetes Chaos (Chaos Mesh)
In Kubernetes, servers don't matter. Pods matter. AWS FIS is great for EC2. For EKS, use Chaos Mesh (CNCF Project).
Experiments:
- Pod Kill: Randomly kill 1 pod every minute.
- Test: Does the ReplicaSet re-launch it? Does the HPA scale up?
- Network Partition: Isolate "Frontend" namespace from "Backend" namespace.
- Test: Do the frontends show a nice "Maintenance Mode" page? Or do they stacktrace?
- Stress: Consume 100% CPU on a node.
- Test: Does the Scheduler move pods to a healthy node?
Part 10: The Game Day Template
Copy-paste this for your first Game Day.
Title: "Simulate Redis Failure" Date: Friday 2 PM. Attendees: Team Lead, 2 Seniors, 2 Juniors.
1. Baseline state:
- Latency: 50ms.
- Error Rate: 0%.
- Redis CPU: 20%.
2. The Injection:
- Action: Block Port 6379 (Security Group).
- Expected Result: Application falls back to Database (slower but works).
3. The Reality:
- Actual Result: Application crashed with
RedisConnectionException. - Detection Time: 3 minutes (Too slow).
- Recovery Time: 10 minutes (Manual restart).
4. Action Items:
- Fix
catchblock inRedisClient.jsto handle connection errors. - Add CloudWatch Alarm for "Redis Connection Errors".
- Retest next week.
Part 11: Anti-Patterns (How Not To Do It)
- Testing in Prod First: Don't. Start in Staging. Earn the right to break Prod.
- Running without Monitoring: If you break it but didn't see it on the graph, you learned nothing.
- Fixing Forward: If the experiment goes wrong, Rollback. Don't try to debug the chaos. Stop the chaos first.
- Blaming Humans: If the Junior didn't realize the DB was down, don't blame them. Blame the Dashboard. Build better alerts.
Part 12: Expert Glossary
- Blast Radius: The amount of damage an experiment can cause. Start with Blast Radius = 1 Server. Expand to 1 AZ.
- Steady State: "Normal" behavior. You need to know what "Normal" looks like before you break it.
- Chaos Monkey: The original tool from Netflix. Kills EC2 instances.
- FIS: Fault Injection Service. AWS native tool.
- Gremlin: Enterprise Chaos tool (GUI based).
- Circuit Breaker: Software pattern that stops trying to call a failing service.
Conclusion: Embrace the Chaos
Systems drift. New code introduces new bugs. New dependencies introduce new latency.
If you don't break your system, it will break itself. And it will choose the worst possible time.
Start with staging.
Start with echo "chaos".
Then graduate to real injection.
Build a system that is Anti-Fragile. A system that gets stronger when you attack it.