Process data in large periodic batches vs continuously as it arrives.
If you are new here: There are two fundamental models for processing data at scale. Batch processing accumulates data over time — hours or days — then runs a job to process the whole dataset at once. Your nightly payroll run, the monthly invoice generation, the weekly ML model retrain — these are batch jobs. Stream processing processes each event within milliseconds of it arriving, keeping outputs continuously up to date. Fraud detection (block the card now, not tomorrow morning), live dashboards (show active users in real-time), IoT alerting (page on-call when temperature exceeds threshold) — these require stream processing. The choice between them isn't always obvious: batch is simpler to reason about and handles complete data; stream is complex but gives you fresh answers. Many real-world systems use both together — the Lambda architecture — running a stream processor for real-time approximate results and a batch job for periodic accurate recalculations.
| Term | Plain meaning |
|---|---|
| Batch job | A program that runs periodically, processes a bounded dataset, and exits |
| Stream processing | Continuously processing events as they arrive from an unbounded stream |
| Kafka | The most popular distributed event streaming platform — a high-throughput message broker |
| Apache Flink | A stateful stream processing framework — handles out-of-order events, windows, state |
| Apache Spark | Originally batch (MapReduce successor); has streaming mode (Spark Streaming / Structured Streaming) |
| Window | In stream processing, a bounded time slice to aggregate events (last 5 minutes, last 1000 events) |
| Watermark | A signal in stream processing that says "events older than T are unlikely to arrive now" — used to close windows |
| Lambda architecture | Run both batch and stream; merge results at serving time |
| Kappa architecture | Stream-only, with the ability to replay historical events to replace batch |
You run an e-commerce platform. Orders come in continuously — thousands per hour. You need to:
These four requirements have fundamentally different freshness needs. The invoice can wait until end-of-month. The fraud detection cannot wait at all. Using the wrong processing model for each means either unnecessary complexity (streaming the monthly invoice) or dangerous staleness (batch-processing fraud detection with a 12-hour lag).
In plain terms: batch and stream are not opposites — they're tools for different freshness requirements. Understanding when each fits is one of the most important data engineering decisions.
Analogy: a bank's check-clearing process vs a real-time card authorization. When you deposit a check, the bank doesn't immediately credit your account — they batch-process checks overnight (the "clearing" cycle). That's acceptable because checks aren't time-critical. But when you tap your card at a store, the authorization happens in 200ms — because a 12-hour delay would mean no one could buy anything.
Batch processing works by accumulating data and then running a job over the full accumulated dataset:
Why batch is simpler: the input is bounded — a finite, complete dataset. The job has access to all the data it needs at runtime. Aggregations, joins, and sorts are straightforward. You can rerun the job with new logic over historical data. There's no notion of "the data might arrive late."
Throughput: batch jobs trade latency for throughput. A Spark job reading 10TB from S3 can achieve gigabytes/second of throughput, processing a massive dataset in minutes. Streaming systems can't match this throughput for historical reprocessing.
In plain terms: batch processing is like doing laundry — you collect dirty clothes all week (accumulate), then wash the full load (process), then fold and put away (output). You wouldn't run the washing machine for each individual sock (stream). But you also wouldn't wait a week to wash your work clothes if you needed them tomorrow.
Tiny example: Spotify's "Wrapped" feature. Every December, Spotify shows you your top artists and songs of the year. This is a massive batch job running over 12 months of listening history for 600+ million users. No stream processing needed — the result only needs to be ready once a year.
Stream processing consumes an unbounded sequence of events and produces outputs continuously:
Event time vs processing time: events have two timestamps:
Out-of-order events are common — a mobile app might buffer events offline and submit them 5 minutes late. Stream processors use watermarks to handle this: "process all events with event time ≤ T once the watermark reaches T+δ."
Windowing: most stream aggregations aren't over all-time, but over a window of recent events:
In plain terms: stream processing is like a cashier processing customers as they arrive — the line never "closes." There's no "end of input." The cashier keeps working, maintaining the running total, handling occasional out-of-order arrivals.
Concrete sketch: Uber's real-time surge pricing. Every ride request and driver location update is published to Kafka. Flink consumers compute supply/demand ratios per geographic cell in 5-second tumbling windows. When demand exceeds supply in a cell, surge pricing is activated. The lag from "demand increases" to "surge price applied" is under 10 seconds. Impossible with batch; straightforward with stream.
Batch processing is the right tool when:
Completeness matters more than speed: payroll must be correct — it must process every transaction for the pay period. A streaming system might miss late-arriving transactions. Batch waits until all data is in.
The computation requires the full dataset: machine learning training needs all labeled examples. A recommendation model can't train on "events so far today" — it needs months of historical data. Same for analytics that compare this year vs last year.
You need to reprocess historical data: if you fix a bug in your aggregation logic, stream processing can't retroactively correct past outputs. Batch jobs can be re-run over the same historical input.
Tolerance for hours of latency: business intelligence reports used by executives at Monday morning meetings don't need to reflect events from 2 minutes ago.
High throughput is required: reading and transforming 100TB of raw event logs for a monthly analytics report is most efficient as a batch job — streaming systems aren't optimized for bulk historical scans.
Stream processing is the right tool when:
Stale data causes harm: fraud detection is the canonical example. If you batch-process transactions every 6 hours, a fraudster has 6 hours to make purchases after their card is flagged. A stream processor detects the pattern within seconds and blocks the card.
Users expect real-time results: a live dashboard showing "123 active users right now" must reflect the user who just signed in 500ms ago. A nightly batch update is useless for this.
Actions must happen in response to events: sending a "your order has shipped" email should happen when the shipping event occurs — not in a nightly batch. Triggering a sensor alert when temperature exceeds 85°C cannot wait until midnight.
Volume is too high to store before processing: IoT sensors might produce gigabytes per second. Storing all of it before processing would require enormous storage. Stream processing filters and aggregates at the source.
Many systems need both freshness and accuracy. The Lambda architecture (coined by Nathan Marz) solves this by running both pipelines in parallel:
Speed layer (stream): processes events in real-time, producing approximate but fresh results. Stores in a fast read store (Redis, Cassandra). Results update within seconds. Slightly approximate because late-arriving events might be missed.
Batch layer (batch): periodically reprocesses all historical data with complete accuracy. Overwrites the speed layer results with the correct values. Runs every hour or every day.
Serving layer: merges results from both layers. For recent data, prefer the speed layer's fresh (but possibly approximate) result. For older data, use the batch layer's accurate result.
Example: a ride-sharing analytics dashboard. Speed layer: "total rides in the last 5 minutes" — updated in real-time from the Kafka stream. Batch layer: "total rides yesterday" — exact, computed from the complete dataset. Both show on the same dashboard; the serving layer knows which to use.
Kappa architecture (simpler alternative): use only a stream processor, but with the ability to replay historical events. Kafka retains events for days or weeks; reprocessing is a matter of resetting the consumer offset to the beginning. This eliminates the complexity of maintaining two separate pipelines. Works when your stream processor (Flink, Kafka Streams) can handle the full historical volume.
| Property | Batch | Stream |
|---|---|---|
| Latency | Hours to days | Milliseconds to seconds |
| Completeness | High — all data available | Lower — late events, ordering challenges |
| Throughput | Very high | Moderate (per-event overhead) |
| Reprocessing | Easy — re-run the job | Hard — must replay event log |
| Complexity | Low — bounded input | High — windows, state, watermarks |
| Best for | Analytics, ML training, invoicing | Fraud, alerting, live dashboards |
The most common mistake is defaulting to stream processing because it sounds more modern, then fighting complexity that wasn't necessary. Ask first: "does this use case actually need results in under a minute?" If the answer is no, batch is almost always simpler and cheaper. Build your data infrastructure batch-first; add streaming only for the specific use cases that genuinely require real-time results. When you do need streaming, Apache Kafka + Apache Flink is the dominant open-source stack; AWS Kinesis + Lambda works well for smaller scale on AWS.
Next: ETL Pipelines — the backbone of data infrastructure: extracting data from sources, transforming it, and loading it into destinations.
Batch processing — data accumulates over time, then a job runs periodically (hourly, nightly) to process the full dataset at once.