Batch Processing in Modern Systems
Batching in distributed systems is like a tax strategy. Every database write, every API call, every network request carries overhead you can't avoid. The only question is how many times you pay it. Process a thousand records individually and you pay a thousand times. Batch them and you pay once.
Batch processing is the practice of deferring high-volume, repetitive work until it makes sense to run it all at once. Backups, sorting, filtering - these tasks are compute-heavy enough that running them per transaction is wasteful. Instead, systems accumulate the work, then flush it in one go, typically during low-traffic windows. An ecommerce store doesn't need to process every order the instant it arrives. Collect them all day, run the batch at intervals, and your compute resources stay free for the work that actually needs to happen in real time.
On paper it's obvious. In practice, every layer of your stack implements it differently, fails differently, and hides the failure in a different way. But first - why does batching work at all?
The Coordination Tax
Every operation in a distributed system carries overhead that has nothing to do with the work itself. A single database write triggers a chain of costs: TCP handshake, SSL termination, OS syscalls, memory allocation, lock acquisition, a WAL append, and an fsync to disk. You pay all of that before a single byte of your actual data hits storage.
Do that 1,000 times individually and you pay the full coordination cost 1,000 times. Batch those same 1,000 operations into one request and you pay it once. The CPU instruction cache stays hot, the network packet gets saturated, and the disk does one sequential write instead of 1,000 random seeks. The gap between paying once versus paying a thousand times is where batching lives, and where its complications begin.
Where Batching Shows Up
The Database: The fsync Ceiling
Postgres and MySQL are gated by fsync, the call that forces data from memory to physical disk. Most drives handle a few thousand fsyncs per second at most. That ceiling matters more than most people realize. If you want to understand how Postgres specifically handles WAL writes and fsync behavior under load, this deep dive by Laurenz Albe is worth bookmarking.
-- Sequential: 1 fsync per row, 1000 rows = 1000 fsyncs
INSERT INTO events (...) VALUES (...); COMMIT;
-- Batched: 1 fsync for all 100 rows
INSERT INTO events (...) VALUES (...), (...), (...);
The catch is that a longer transaction holds its locks longer. Push the batch size too high and you get lock contention, stalled reads, and the application layer stalling while waiting for the transaction to clear. The right batch size isn't something you calculate; it's something you measure against your specific write volume and lock timeout thresholds.
Move up the stack and the failure mode changes shape.
The API: Partial Failures Need Granular Responses
Batching HTTP requests eliminates round-trip overhead, but it creates a failure model most API designs aren't built for. When you send 50 items in a single request and item #47 is malformed, what happens to the other 49?
If you return a blanket 500, the client retries all 50. Without knowing which item failed, it has no choice but to retry everything. That one bad record now causes 49 valid records to hammer your database in a loop, write amplification hiding behind a retry mechanism.
The correct design uses HTTP 207 Multi-Status with a response body that maps each item to its own result:
{
"results": [
{ "id": "item_001", "status": 200 },
{ "id": "item_047", "status": 400, "error": "missing required field: timestamp" },
{ "id": "item_049", "status": 200 }
]
}
Now the client knows exactly which item to fix and retry. Design this from day one; retrofitting granular error responses after clients are in production means negotiating breaking changes to a contract that's already live.
The server-side tradeoff is about failure isolation. On the client side, it's about something more visible: battery life, data costs, and latency tolerance.
The Client: Two Paths, One System
A freight company tracks 400 delivery trucks across the country. Their original app pinged the server with GPS coordinates every 3 seconds per truck. The LTE radio on each device stayed in high-power state all day. Batteries died mid-shift. Mobile data costs were significant.
They switched to local batching: store coordinates on-device every 3 seconds, flush to the server once every 60 seconds. The radio wakes up, transmits a full minute of location history as a single payload, and goes back to sleep.
Battery life doubled. Data costs dropped 80%. The tradeoff was visible and honest: the dispatch dashboard now shows positions up to 60 seconds stale. For route monitoring, that was fine. For "is the driver at the customer gate right now," it wasn't, so they kept a separate lightweight ping for that specific case.
That's what mature batching decisions look like: not replacing granular operations entirely, but running batched and unbatched paths in parallel based on what latency each use case can actually tolerate. The performance case is real. So is the failure tax.
What Can Go Wrong
Poison Pills: When One Bad Message Blocks Everything
A poison pill is a malformed message that crashes the consumer. When processing messages individually, the consumer crashes, restarts, skips the bad message, and carries on. With batching, the consumer pulls 1,000 messages, hits the one corrupted record, crashes, restarts, and pulls the exact same 1,000 messages again. It crashes again. That one bad character in one JSON payload now blocks 999 unrelated legitimate messages from ever being processed.
The fix is a Dead Letter Queue. After N failed attempts, route the offending message to a separate queue for human inspection rather than letting it stall the main consumer indefinitely. AWS has a concise write-up on DLQ configuration that covers retry thresholds and redrive policies worth reading before you implement one. Without a DLQ, a batched queue under a poison pill is a full stop, not a slowdown.
Thundering Herds: The Synchronization Problem
If 1,000 microservice instances all flush their batches on a strict 60-second timer, they will eventually synchronize. Your database absorbs a wall of simultaneous writes every 60 seconds, then silence, then another wall. That load profile is harder on your infrastructure than an equivalent steady stream because connection pools spike, query queues back up, and latency for everything else spikes with them.
The fix is one line:
flush_window = 60s + random(0-5s)
Jitter desynchronizes the fleet. The spike becomes a plateau and the database never sees the wall. This piece from Marc Brooker on exponential backoff and jitter explains the math behind why uniform timers synchronize and randomized ones don't.
Observability: The Quiet Compounding Failure
This one is quieter than the others, but it compounds. When a user reports that their order didn't go through, you pull logs and see: Batch of 500 processed successfully. Their request was in that batch. Whether it succeeded or failed inside the batch isn't in the log unless you explicitly wrote it there, and if you're logging every individual ID inside every batch, you've likely offset a meaningful chunk of the I/O savings you were chasing.
Batching improves throughput and degrades trace granularity at the same time. The answer isn't to avoid batching; it's to decide upfront which IDs get logged, sample aggressively, and use correlation IDs that survive the batch boundary so distributed traces can still reconstruct what happened to a specific request.
When It's Worth It
Batch when throughput matters more than latency: background jobs, analytics pipelines, bulk ingestion, mobile sync, message queues. Keep individual operations where you need real-time feedback or where the failure domain needs to stay small enough to debug cleanly.
The engineers who use batching well aren't the ones who understood the performance gain. They're the ones who understood what they were giving up.