By Sahil Kapoor in Engineering - 24 Jun 2025

What is API Reliability?

Building APIs without a reliability layer is like running a city with no backup generators. The lights stay on until the first power cut.

Most APIs I see are shiny. Great docs, neat REST structure, GraphQL schemas that make you nod. But here is the dirty secret. Reliability is not in the spec. It is in how your system behaves when the spec cannot be followed because something else is on fire. I learned this the hard way.

Reliability Isn't Uptime

Most engineers confuse uptime with reliability. A green dashboard gives false comfort. But uptime just says your server is running. Reliability is a bigger, messier promise: will your system behave in line with user expectations even when it is under stress?

The IPL taught us this distinction brutally. A push notification landed, four hundred percent more users opened the app than usual, Redis slowed, locks jammed, Mongo queued. We were not down, we were late. And late feels worse to users because it is invisible in dashboards but painfully visible on screen.

What Reliability Really Means?

Reliability is predictability. A reliable API does not surprise you. It either gives you the data, tells you honestly it cannot, or gives you the last known good state. Anything else is chaos.

A 200 OK is worthless if it takes five seconds. A stale score is more reliable than a blank screen. A clear error with retry instructions is more reliable than a spinner that lies. Reliability is a product choice as much as an engineering one.

You can read the formal treatment in Reliability Engineering, but the working version most of us need is simple:

Do what the user expects, even when things are noisy.

The IPL Lesson

What made our live score application hard was not the steady state. It was the bursts. Sports runs on drama. A six, a wicket, a milestone - each moment brings a flood of users at once. The provider keeps pushing normally. The chaos is in the fanout.

But this is not unique to sports. HR software faces the same storm during login and logout hours. Between 9-10 a.m every weekday, 1000% more hits come at gateway hitting login endpoint. Around 6 p.m., another wave logs out. The code might be flawless and the spec might be elegant, but if the system cannot handle the synchronised surge, reliability is broken.

Or take WhatsApp. Engineers at Meta have often cited how the sheer volume of “Good morning” images from India turned into a reliability challenge, even making headlines in the NYPost. Millions of people waking up and sending the same kind of media at the same hour was not a bug. It was a cultural pattern. And culture creates traffic spikes more brutal than any load test.

Reliability is about surviving the predictable but painful rhythms of human behavior. A live API has to answer three questions during bursts:

Do you drop requests or queue them?
Do you serve stale data or nothing?
Do you let retries pile up or do you fail fast?

We learned to choose deliberately. Serving the last ball is better than showing nothing. Failing fast is better than retry storms. Shedding a fraction of requests is better than letting everyone wait.

Observability and Why Reliability Is Hard

You cannot deliver reliability if you cannot see. This is where observability comes in. Metrics, logs, traces - they are not decoration. They are the flashlight in the dark room. Without them you are debugging blind.

During IPL season, tracing showed us the timeline of every request: provider ingest > Redis > Mongo > fanout. Only then did we see the thundering herd for what it was. Metrics showed Redis command latency spike before Mongo pool saturation. That order mattered. It told us where the bottleneck really lived.

For a good primer, read What is API observability?. The essence is this: observability turns anecdotes into data, and data into action. Without it, reliability is just luck. And even with observability, reliability is hard because every protection has a tradeoff.

Retries add resilience, but also storms. Caching adds speed, but risks staleness. Backpressure protects systems, but makes some users wait. Warming capacity makes bursts smooth, but wastes money on quiet days.

No system can have it all. Reliability is the art of choosing the right failure mode for your domain. In live cricket, a slightly old score is fine. In fintech, stale data is unacceptable.

Talking Honestly About Failure

What most teams don't do is talk honestly about failure. They design for the happy path, demo the sunny day. Reliability begins when you accept the storm is coming. Your system will choke. The question is how it looks to the user when it does.

On our live score application, we reframed our goal: never leave the user with a blank screen. That one sentence gave us clarity. It meant staleness was acceptable, errors were acceptable, even load shedding was acceptable. But silence was not.

And this lesson is universal. HR platforms know that the morning surge is coming. WhatsApp engineers know those good morning images are coming. Retail apps know Black Friday is coming. Reliability is not about ignoring these patterns. It is about planning for them.

The truth is, every industry has its IPL moment, its login surge, or its cultural flood of traffic. Reliability is not about perfection. It is about being trustworthy when those moments arrive. And trust is built in the bad moments, not the good ones.

We did not build reliability layers because we were smart. We built them because traffic punched us in the face. Each outage left a scar. Each scar taught us what to change. That is why I call reliability scar tissue. It only comes from pain.

Simple APIs are easy. Reliable APIs are ugly, layered, full of compromises. They are born from late nights, angry users, and lessons you cannot skip. But they are also the ones that last.

If you are building APIs for the real world, stop obsessing over the happy path. Think about the storm. Decide your failure modes up front. Add the seatbelts before you floor the throttle. Remember, reliability is not the absence of failure. It is surviving it.

Reliability Isn't Uptime

What Reliability Really Means?

The IPL Lesson

Observability and Why Reliability Is Hard

Talking Honestly About Failure

Subscribe to Sahil's Playbook