044 · GZIP · SNAPPY · COMPRESSION

Data Compression

Reduce storage cost and network bandwidth by encoding data more compactly.

If you are new here: Compression rewrites data into fewer bytes by exploiting patterns. Lossless compression returns the exact original bytes, so it works for JSON, code, logs, database pages, and backups. Lossy compression throws away detail humans probably will not notice, which is why it works for photos, audio, and video but not for invoices.

TermPlain meaning
CodecAlgorithm used to compress and decompress data, such as gzip, zstd, Snappy, JPEG
Compression ratioHow much smaller the result is, like 10 MB down to 2 MB
LosslessExact original can be reconstructed
LossySmaller output by discarding information
CPU trade-offSmaller bytes usually require more compute to encode or decode

The Problem

Your API returns 2 MB of JSON to mobile clients. The payload includes repeated field names, repeated status strings, and predictable numbers. Sent raw, it burns bandwidth, increases latency, and costs more on every hop. Compressed with gzip or zstd, it might shrink to 200-400 KB.

That is the basic bargain: spend CPU to move and store fewer bytes.

In plain terms: compression is worth it when bytes are the bottleneck and the data has patterns. It is a bad deal when the data is already compressed, encrypted, tiny, or when CPU is the bottleneck.

Analogy: Vacuum-packing clothes makes the suitcase smaller, but you spend time squeezing the air out and time unpacking later. For bulky sweaters, it is great. For a brick, there is nothing to squeeze.

Tiny example: A 1 MB JSON response over a slow mobile link might take hundreds of milliseconds just to transfer. If gzip turns it into 120 KB and decompression takes 3 ms in the browser, users win.

Lossless compression

Lossless codecs preserve the exact original bytes. That requirement matters for structured data: a changed byte in a backup, ledger export, or database page is corruption, not an acceptable approximation.

Common choices have different personalities. gzip is everywhere and safe as a compatibility default. zstd often gives better ratios at good speed. Snappy and LZ4 prioritize speed over maximum shrinkage, which is why they show up in databases and streaming systems.

In plain terms: lossless compression is pattern replacement with a perfect undo button.

Concrete sketch: Logs are excellent candidates because they repeat timestamps, service names, field names, and status strings. Encrypted blobs are terrible candidates because encryption deliberately removes visible patterns.

Lossy compression

Lossy codecs reduce size by discarding information. A JPEG can blur tiny color differences. A video codec can reuse similar frames. An audio codec can remove sounds most people will not perceive.

That is powerful, but the workload decides whether it is legal. A profile photo can lose detail. A signed PDF, source archive, or invoice cannot. Once detail is thrown away, you cannot reconstruct the exact original.

Analogy: Lossless is folding a map smaller. Lossy is redrawing the map with fewer streets because the reader only needs highways.

CPU vs bytes

Compression lives on a curve. Higher levels usually produce smaller output, but consume more CPU and wall-clock time. Lower levels produce bigger output but keep latency predictable.

In plain terms: the best codec is the one that reduces the actual bottleneck. If network egress is expensive, compress harder. If CPU is maxed and traffic stays inside a fast private network, choose a faster codec or skip compression.

SituationBetter choice
Public JSON APIgzip or zstd, moderate level
Kafka-like event pipelineSnappy, LZ4, or zstd fast mode
Cold backupszstd higher level if restore latency is acceptable
Already compressed imagesSkip generic compression

Tiny example: A warehouse export might compress overnight at a high zstd level because nobody waits on it. A p99-sensitive API route should use a moderate level so one large response does not steal CPU from live requests.

HTTP in practice

Browsers advertise what they understand with Accept-Encoding. Servers respond with Content-Encoding, and the browser decompresses before JavaScript sees the body.

GET /api/feed HTTP/1.1
Accept-Encoding: gzip, br, zstd
 
HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzip

The operational detail is caching: compressed and uncompressed variants must not be mixed. CDNs use Vary: Accept-Encoding so a client that cannot decode a format does not receive bytes it cannot read.

Debug tip: compare “wire size” and “decoded size” in browser devtools. A payload can look huge after decompression but still be cheap on the network, or look small in code while costing a lot because compression was disabled by a proxy.

When to skip

Skip compression for data that is already compressed, encrypted, or too small to matter. A 600-byte JSON response may cost more in CPU and headers than it saves. A JPEG, PNG, MP4, .zip, or encrypted backup is already dense.

In plain terms: compression cannot beat randomness. If the bytes look patternless, the codec mostly burns CPU to discover that fact.

Also be careful with secrets in compressed responses. Some attack classes exploit size differences when attacker-controlled text is compressed next to secret text. For ordinary API payloads this is not the daily concern, but auth pages and cross-origin secrets deserve care.

Trade-offs

You gainYou pay
Lower bandwidth and storage costMore CPU for encode/decode
Faster transfer over slow linksPossible latency spikes on large payloads
Better cache efficiencyMore variants and headers to manage
Smaller logs and warehouse filesHarder debugging if files are not easily readable

Why this matters for you

Compression is one of the cheapest performance wins when used deliberately. Turn it on for text-heavy HTTP, logs, columnar data, and backups. Measure CPU, p95/p99 latency, and bytes saved; do not blindly chase the smallest possible file.

Next: Erasure Coding is a related storage efficiency technique — instead of compressing bytes, it reduces the overhead of replication by reconstructing lost pieces from parity shards.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 7

Run-length and dictionary coders exploit repetition — ‘aaaaa’ becomes a tiny token plus count; text and logs compress far better than encrypted noise.