By Sahil Kapoor in Engineering - 21 Aug 2025

The Art of Breaking Things on Purpose

Instead of hoping systems don’t fail, what if you break them on purpose? To find weak spots before the world does. Netflix, Uber, and Google turned Chaos Engineering into a discipline. Here’s why it matters, how it works, and why it’s the best way to build resilient systems.

Back in 2010, Netflix engineers were moving their entire infrastructure to the cloud. They knew AWS machines could fail at any time. Disks would die, networks would choke, regions could vanish without warning. Instead of pretending these failures were rare, they built a tool called Chaos Monkey. Its job was brutally simple: randomly shut down servers in production.

At first glance, it looks reckless. Why would anyone sabotage their own system? But the logic is clear. If a service can't survive one machine disappearing, it has no business running in production. So they built Chaos Monkey for survival training. It was like stress-testing the system in daylight, exposing weaknesses before real customers ever noticed. That simple idea grew into what we now call Chaos Engineering, the practice of deliberately breaking things to build confidence in how they recover.

Outages Are Normal, Not Rare

We like to imagine outages as black swan events, but reality looks different. At scale, they're routine.

Take Uber, for example. With millions of rides a day, even small inefficiencies in trip-matching or dispatch can snowball into delays and unhappy riders. While the company has shared lessons from reliability incidents, the bigger point is clear: even world-class systems stumble in unpredictable ways, and chaos testing is how they get ahead of it.

Google learned this too. They run their own playbook called DiRT (Disaster Recovery Testing). They simulate earthquakes, fiber cuts, even total data center outages. The point isn't just to break servers, it's to break assumptions. In one drill, teams discovered their backup restore process failed because the login system was hosted in the very region they'd just taken down. No monitoring tool would have caught that.

Amazon had a similar wake-up call during Prime Day. A single internal service failed under peak load, and retries amplified the problem into a full-blown outage. Since then, Amazon has institutionalized Game Days, rehearsal sessions where engineers simulate failures ahead of big events. The idea is simple: if you're going to fail, fail in practice, not in production.

What Chaos Really Teaches Us

Chaos isn't really about servers dying. It's about illusions dying. You learn if your monitoring reflects real user pain or just paints a happy green dashboard. You see whether engineers trust the system enough to sleep at night or live in low-level panic. At Uber, chaos drills revealed that payment failures required too many manual steps for on-call engineers. The fix wasn't new code, it was automation.

Chaos Engineering is less about breaking code and more about breaking assumptions. It's a cultural mirror. It forces teams from SREs to PMs to face fragility head-on. For leads, it's also a team-building tool: who owns which response, where are the knowledge gaps, and how does the team communicate under stress? A good chaos drill reveals not just weak services but weak handoffs. Leaders who pay attention can turn those gaps into training, clearer runbooks, and more confidence on-call.

My Confession (I hope my team reads this one day)

One of our games, built from scratch over four months, was two days away from launch. Everything looked fine on docs; diagrams drawn, long whiteboard discussions, infra planning and much more. But I wasn't convinced. So I did something reckless: I attacked my own servers.

In just two minutes, ChatGPT helped me spin up a quick script that simulated a WebSocket DDoS attack. Within minutes, the system started buckling. The team went into disarray. We had players dropping connections, lobbies crashing, and services tripping over each other.

What happened next is what mattered. Instead of panicking, the team came together. They stayed late that evening, working extra hours to contain the problem and understand what had gone wrong. They noted down every point of failure:

Games failing to reconnect gracefully.
Cloudflare's rate limiting hadn't been turned on.
Circuit breakers in the API layer were missing.
Our monitoring looked green even while users were seeing blank screens.

But wait! The next day, I did it again. This time, I used multiple MacBooks with separate IPs to attack the servers. And again, loopholes surfaced. Queries weren't optimized. Failover logic wasn't clean. Each round of chaos gave the team something new to fix.

I stayed out of the picture and just watched the Slack threads unfold. It was beautiful. Watching your team rally, share quick fixes, argue over what mattered most, and then patch things together.

And just to be clear, I don't recommend this kind of chaos drill for production apps already serving real users. But in our case, this wasn't a deadline crunch, it was a soft launch. We even delayed the release by a week to look deeper. That extra week of controlled chaos gave us far more confidence than any number of QA checklists ever could.

That story is exactly the spirit of building chaos into a system. We didn't just trust QA, we created our own chaos, found weaknesses, delayed release, and came out stronger. That's how you start building chaos into your system. Reliability isn't backend hygiene. It's retention. It's trust. It's your brand.

Uber riders don't care how elegant your dispatch algorithm is. They care if the cab shows up. If payments fail, trust evaporates instantly. Same goes for any product. If a live-score app lags during India vs Pakistan, you've lost users forever. Reliability compounds as an advantage. Products that stay up win. The rest lose users one outage at a time.

That's why chaos isn't just an SRE trick. It's a founder's moat.

Going Deeper: Tools and Techniques

Here's what Chaos Engineering looks like in practice.

Fault Injection Methods

Process killing: The classic Chaos Monkey approach - randomly terminate VM instances, pods, or containers. In Kubernetes, tools like kubectl delete pod or Chaos Mesh can automate this at scale.
Network faults: Introduce latency, packet loss, or partitions using tools like tc in Linux or netem. In cloud-native environments, LitmusChaos or Gremlin provide ready-made experiments.
Resource stress: Spike CPU, memory, or disk I/O. Stress-ng is commonly used, while Kubernetes operators like Chaos Mesh can inject load directly on pods.
Dependency failures: Kill or throttle databases, caches, or queues. For example, pause a Redis node, drop connections to Postgres, or add latency to Kafka.
Regional failures: Simulate full zone or region outages. Netflix's Chaos Gorilla and Chaos Kong were built for exactly this.

Designing Experiments

Define the steady state metric. Example: p95 latency under 300 ms, or 99.9% successful payments.
Formulate a hypothesis: "If one database replica goes down, the app should still serve traffic with <500 ms latency."
Run the experiment with a limited blast radius: start with one pod, one queue, one AZ.
Observe metrics, logs, and tracing. Did the hypothesis hold? Where did retries, fallbacks, or autoscaling fail?
Document findings, patch weaknesses, and iterate.

Observability Integration
Chaos experiments are only useful if you can measure the outcome. Pairing chaos with observability tools makes the practice actionable:

Metrics: Prometheus, Datadog, CloudWatch.
Logs: ELK stack, Loki.
Tracing: Jaeger, OpenTelemetry.

Without strong observability, chaos drills are just noise.

Safety Mechanisms

Always add a kill switch. Every chaos tool worth its salt provides an emergency stop.
Schedule experiments during business hours, with on-calls present.
Communicate to the entire org. Nothing destroys trust like an unannounced chaos test.

This deeper layer is what turns Chaos Engineering from a concept into an engineering discipline. It's not just about daring to break things - it's about breaking them scientifically, with tooling, safety, and measurement built in.

Final Thoughts

When I first read about Chaos Monkey, I thought it was madness. Now I see it as one of the most responsible move an engineering team can make. Monitoring tells you what's already broken. Chaos Engineering tells you what will break. And the practice is simple:

Form a hypothesis. What should happen under failure?
Run experiments. Kill servers, throttle APIs, add latency.
Observe reality. Did the system behave as expected?
Fix the weak spots. Repeat.

The world will break your system anyway. Better it happens on your terms than theirs.

Outages Are Normal, Not Rare

What Chaos Really Teaches Us

My Confession (I hope my team reads this one day)

Going Deeper: Tools and Techniques

Final Thoughts

Subscribe to Sahil's Playbook