Reduce storage cost and network bandwidth by encoding data more compactly.
If you are new here: Compression rewrites data into fewer bytes by exploiting patterns. Lossless compression returns the exact original bytes, so it works for JSON, code, logs, database pages, and backups. Lossy compression throws away detail humans probably will not notice, which is why it works for photos, audio, and video but not for invoices.
| Term | Plain meaning |
|---|---|
| Codec | Algorithm used to compress and decompress data, such as gzip, zstd, Snappy, JPEG |
| Compression ratio | How much smaller the result is, like 10 MB down to 2 MB |
| Lossless | Exact original can be reconstructed |
| Lossy | Smaller output by discarding information |
| CPU trade-off | Smaller bytes usually require more compute to encode or decode |
Your API returns 2 MB of JSON to mobile clients. The payload includes repeated field names, repeated status strings, and predictable numbers. Sent raw, it burns bandwidth, increases latency, and costs more on every hop. Compressed with gzip or zstd, it might shrink to 200-400 KB.
That is the basic bargain: spend CPU to move and store fewer bytes.
In plain terms: compression is worth it when bytes are the bottleneck and the data has patterns. It is a bad deal when the data is already compressed, encrypted, tiny, or when CPU is the bottleneck.
Analogy: Vacuum-packing clothes makes the suitcase smaller, but you spend time squeezing the air out and time unpacking later. For bulky sweaters, it is great. For a brick, there is nothing to squeeze.
Tiny example: A 1 MB JSON response over a slow mobile link might take hundreds of milliseconds just to transfer. If gzip turns it into 120 KB and decompression takes 3 ms in the browser, users win.
Lossless codecs preserve the exact original bytes. That requirement matters for structured data: a changed byte in a backup, ledger export, or database page is corruption, not an acceptable approximation.
Common choices have different personalities. gzip is everywhere and safe as a compatibility default. zstd often gives better ratios at good speed. Snappy and LZ4 prioritize speed over maximum shrinkage, which is why they show up in databases and streaming systems.
In plain terms: lossless compression is pattern replacement with a perfect undo button.
Concrete sketch: Logs are excellent candidates because they repeat timestamps, service names, field names, and status strings. Encrypted blobs are terrible candidates because encryption deliberately removes visible patterns.
Lossy codecs reduce size by discarding information. A JPEG can blur tiny color differences. A video codec can reuse similar frames. An audio codec can remove sounds most people will not perceive.
That is powerful, but the workload decides whether it is legal. A profile photo can lose detail. A signed PDF, source archive, or invoice cannot. Once detail is thrown away, you cannot reconstruct the exact original.
Analogy: Lossless is folding a map smaller. Lossy is redrawing the map with fewer streets because the reader only needs highways.
Compression lives on a curve. Higher levels usually produce smaller output, but consume more CPU and wall-clock time. Lower levels produce bigger output but keep latency predictable.
In plain terms: the best codec is the one that reduces the actual bottleneck. If network egress is expensive, compress harder. If CPU is maxed and traffic stays inside a fast private network, choose a faster codec or skip compression.
| Situation | Better choice |
|---|---|
| Public JSON API | gzip or zstd, moderate level |
| Kafka-like event pipeline | Snappy, LZ4, or zstd fast mode |
| Cold backups | zstd higher level if restore latency is acceptable |
| Already compressed images | Skip generic compression |
Tiny example: A warehouse export might compress overnight at a high zstd level because nobody waits on it. A p99-sensitive API route should use a moderate level so one large response does not steal CPU from live requests.
Browsers advertise what they understand with Accept-Encoding. Servers respond with Content-Encoding, and the browser decompresses before JavaScript sees the body.
GET /api/feed HTTP/1.1
Accept-Encoding: gzip, br, zstd
HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzipThe operational detail is caching: compressed and uncompressed variants must not be mixed. CDNs use Vary: Accept-Encoding so a client that cannot decode a format does not receive bytes it cannot read.
Debug tip: compare “wire size” and “decoded size” in browser devtools. A payload can look huge after decompression but still be cheap on the network, or look small in code while costing a lot because compression was disabled by a proxy.
Skip compression for data that is already compressed, encrypted, or too small to matter. A 600-byte JSON response may cost more in CPU and headers than it saves. A JPEG, PNG, MP4, .zip, or encrypted backup is already dense.
In plain terms: compression cannot beat randomness. If the bytes look patternless, the codec mostly burns CPU to discover that fact.
Also be careful with secrets in compressed responses. Some attack classes exploit size differences when attacker-controlled text is compressed next to secret text. For ordinary API payloads this is not the daily concern, but auth pages and cross-origin secrets deserve care.
| You gain | You pay |
|---|---|
| Lower bandwidth and storage cost | More CPU for encode/decode |
| Faster transfer over slow links | Possible latency spikes on large payloads |
| Better cache efficiency | More variants and headers to manage |
| Smaller logs and warehouse files | Harder debugging if files are not easily readable |
Compression is one of the cheapest performance wins when used deliberately. Turn it on for text-heavy HTTP, logs, columnar data, and backups. Measure CPU, p95/p99 latency, and bytes saved; do not blindly chase the smallest possible file.
Next: Erasure Coding is a related storage efficiency technique — instead of compressing bytes, it reduces the overhead of replication by reconstructing lost pieces from parity shards.
Run-length and dictionary coders exploit repetition — ‘aaaaa’ becomes a tiny token plus count; text and logs compress far better than encrypted noise.