118 · BATCH · STREAM · KAFKA

Batch vs Stream Processing

Process data in large periodic batches vs continuously as it arrives.

If you are new here: There are two fundamental models for processing data at scale. Batch processing accumulates data over time — hours or days — then runs a job to process the whole dataset at once. Your nightly payroll run, the monthly invoice generation, the weekly ML model retrain — these are batch jobs. Stream processing processes each event within milliseconds of it arriving, keeping outputs continuously up to date. Fraud detection (block the card now, not tomorrow morning), live dashboards (show active users in real-time), IoT alerting (page on-call when temperature exceeds threshold) — these require stream processing. The choice between them isn't always obvious: batch is simpler to reason about and handles complete data; stream is complex but gives you fresh answers. Many real-world systems use both together — the Lambda architecture — running a stream processor for real-time approximate results and a batch job for periodic accurate recalculations.

Term	Plain meaning
Batch job	A program that runs periodically, processes a bounded dataset, and exits
Stream processing	Continuously processing events as they arrive from an unbounded stream
Kafka	The most popular distributed event streaming platform — a high-throughput message broker
Apache Flink	A stateful stream processing framework — handles out-of-order events, windows, state
Apache Spark	Originally batch (MapReduce successor); has streaming mode (Spark Streaming / Structured Streaming)
Window	In stream processing, a bounded time slice to aggregate events (last 5 minutes, last 1000 events)
Watermark	A signal in stream processing that says "events older than T are unlikely to arrive now" — used to close windows
Lambda architecture	Run both batch and stream; merge results at serving time
Kappa architecture	Stream-only, with the ability to replay historical events to replace batch

The Problem

You run an e-commerce platform. Orders come in continuously — thousands per hour. You need to:

Show a real-time dashboard of orders placed in the last 5 minutes
Generate monthly invoices for marketplace sellers
Detect fraudulent orders before they ship
Retrain your product recommendation model weekly

These four requirements have fundamentally different freshness needs. The invoice can wait until end-of-month. The fraud detection cannot wait at all. Using the wrong processing model for each means either unnecessary complexity (streaming the monthly invoice) or dangerous staleness (batch-processing fraud detection with a 12-hour lag).

In plain terms: batch and stream are not opposites — they're tools for different freshness requirements. Understanding when each fits is one of the most important data engineering decisions.

Analogy: a bank's check-clearing process vs a real-time card authorization. When you deposit a check, the bank doesn't immediately credit your account — they batch-process checks overnight (the "clearing" cycle). That's acceptable because checks aren't time-critical. But when you tap your card at a store, the authorization happens in 200ms — because a 12-hour delay would mean no one could buy anything.

Batch Processing: Complete Data, Periodic Jobs

Batch processing works by accumulating data and then running a job over the full accumulated dataset:

Data accumulates: events, transactions, logs write to a storage system (S3, HDFS, a database table)
A scheduler triggers the job: cron, Airflow, or AWS Step Functions starts the job at a defined time (2am nightly, first of the month, etc.)
The job processes all data: reads the full input dataset, applies transformations, aggregations, or ML training
Results are written: output goes to a data warehouse, report database, or file store

Why batch is simpler: the input is bounded — a finite, complete dataset. The job has access to all the data it needs at runtime. Aggregations, joins, and sorts are straightforward. You can rerun the job with new logic over historical data. There's no notion of "the data might arrive late."

Throughput: batch jobs trade latency for throughput. A Spark job reading 10TB from S3 can achieve gigabytes/second of throughput, processing a massive dataset in minutes. Streaming systems can't match this throughput for historical reprocessing.

In plain terms: batch processing is like doing laundry — you collect dirty clothes all week (accumulate), then wash the full load (process), then fold and put away (output). You wouldn't run the washing machine for each individual sock (stream). But you also wouldn't wait a week to wash your work clothes if you needed them tomorrow.

Tiny example: Spotify's "Wrapped" feature. Every December, Spotify shows you your top artists and songs of the year. This is a massive batch job running over 12 months of listening history for 600+ million users. No stream processing needed — the result only needs to be ready once a year.

Stream Processing: Continuous, Real-Time Results

Stream processing consumes an unbounded sequence of events and produces outputs continuously:

Events are produced by applications, IoT devices, user actions, and published to a message broker (Kafka, Kinesis, Pub/Sub)
A stream processor subscribes to the event topic and processes each event (or micro-batch of events) within milliseconds
State is maintained in the processor — how many orders in the last 5 minutes? What's the running fraud score for this user?
Outputs are written continuously — updated dashboard counters, alerts, downstream databases

Event time vs processing time: events have two timestamps:

Event time: when the event actually occurred (the user clicked at 09:00:01)
Processing time: when the stream processor received it (09:00:03 — 2 seconds of network + queue lag)

Out-of-order events are common — a mobile app might buffer events offline and submit them 5 minutes late. Stream processors use watermarks to handle this: "process all events with event time ≤ T once the watermark reaches T+δ."

Windowing: most stream aggregations aren't over all-time, but over a window of recent events:

Tumbling window: non-overlapping, fixed-size (aggregate every 5 minutes, reset)
Sliding window: overlapping (last 5 minutes, updated every 1 minute)
Session window: grouped by user activity, closed after inactivity

In plain terms: stream processing is like a cashier processing customers as they arrive — the line never "closes." There's no "end of input." The cashier keeps working, maintaining the running total, handling occasional out-of-order arrivals.

Concrete sketch: Uber's real-time surge pricing. Every ride request and driver location update is published to Kafka. Flink consumers compute supply/demand ratios per geographic cell in 5-second tumbling windows. When demand exceeds supply in a cell, surge pricing is activated. The lag from "demand increases" to "surge price applied" is under 10 seconds. Impossible with batch; straightforward with stream.

When to Use Batch

Batch processing is the right tool when:

Completeness matters more than speed: payroll must be correct — it must process every transaction for the pay period. A streaming system might miss late-arriving transactions. Batch waits until all data is in.

The computation requires the full dataset: machine learning training needs all labeled examples. A recommendation model can't train on "events so far today" — it needs months of historical data. Same for analytics that compare this year vs last year.

You need to reprocess historical data: if you fix a bug in your aggregation logic, stream processing can't retroactively correct past outputs. Batch jobs can be re-run over the same historical input.

Tolerance for hours of latency: business intelligence reports used by executives at Monday morning meetings don't need to reflect events from 2 minutes ago.

High throughput is required: reading and transforming 100TB of raw event logs for a monthly analytics report is most efficient as a batch job — streaming systems aren't optimized for bulk historical scans.

When to Use Stream Processing

Stream processing is the right tool when:

Stale data causes harm: fraud detection is the canonical example. If you batch-process transactions every 6 hours, a fraudster has 6 hours to make purchases after their card is flagged. A stream processor detects the pattern within seconds and blocks the card.

Users expect real-time results: a live dashboard showing "123 active users right now" must reflect the user who just signed in 500ms ago. A nightly batch update is useless for this.

Actions must happen in response to events: sending a "your order has shipped" email should happen when the shipping event occurs — not in a nightly batch. Triggering a sensor alert when temperature exceeds 85°C cannot wait until midnight.

Volume is too high to store before processing: IoT sensors might produce gigabytes per second. Storing all of it before processing would require enormous storage. Stream processing filters and aggregates at the source.

Lambda Architecture: The Best of Both

Many systems need both freshness and accuracy. The Lambda architecture (coined by Nathan Marz) solves this by running both pipelines in parallel:

Speed layer (stream): processes events in real-time, producing approximate but fresh results. Stores in a fast read store (Redis, Cassandra). Results update within seconds. Slightly approximate because late-arriving events might be missed.

Batch layer (batch): periodically reprocesses all historical data with complete accuracy. Overwrites the speed layer results with the correct values. Runs every hour or every day.

Serving layer: merges results from both layers. For recent data, prefer the speed layer's fresh (but possibly approximate) result. For older data, use the batch layer's accurate result.

Example: a ride-sharing analytics dashboard. Speed layer: "total rides in the last 5 minutes" — updated in real-time from the Kafka stream. Batch layer: "total rides yesterday" — exact, computed from the complete dataset. Both show on the same dashboard; the serving layer knows which to use.

Kappa architecture (simpler alternative): use only a stream processor, but with the ability to replay historical events. Kafka retains events for days or weeks; reprocessing is a matter of resetting the consumer offset to the beginning. This eliminates the complexity of maintaining two separate pipelines. Works when your stream processor (Flink, Kafka Streams) can handle the full historical volume.

The Trade-offs

Property	Batch	Stream
Latency	Hours to days	Milliseconds to seconds
Completeness	High — all data available	Lower — late events, ordering challenges
Throughput	Very high	Moderate (per-event overhead)
Reprocessing	Easy — re-run the job	Hard — must replay event log
Complexity	Low — bounded input	High — windows, state, watermarks
Best for	Analytics, ML training, invoicing	Fraud, alerting, live dashboards

Why this matters for you

The most common mistake is defaulting to stream processing because it sounds more modern, then fighting complexity that wasn't necessary. Ask first: "does this use case actually need results in under a minute?" If the answer is no, batch is almost always simpler and cheaper. Build your data infrastructure batch-first; add streaming only for the specific use cases that genuinely require real-time results. When you do need streaming, Apache Kafka + Apache Flink is the dominant open-source stack; AWS Kinesis + Lambda works well for smaller scale on AWS.

Next: ETL Pipelines — the backbone of data infrastructure: extracting data from sources, transforming it, and loading it into destinations.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom

FRAME 1 OF 6

Batch processing — data accumulates over time, then a job runs periodically (hourly, nightly) to process the full dataset at once.