123 · BLOOM FILTER · PROBABILISTIC · SET

Bloom Filter

Probabilistic set membership — fast 'definitely not / maybe' with zero false negatives.

If you are new here: A Bloom filter is a space-efficient probabilistic data structure that answers one question: "is this element in the set?" — very quickly, using remarkably little memory. The catch: it can have false positives (it might say "maybe yes" for an element that's actually not in the set), but it never has false negatives (if it says "definitely not," that's guaranteed correct). This asymmetry — certain "no", uncertain "yes" — turns out to be enormously useful in systems where you want to avoid an expensive operation (a database lookup, a disk read, a network call) for elements that are definitely not in a set. Bloom filters power Google Chrome's Safe Browsing (checking URLs against malicious site lists), Cassandra and RocksDB's SSTable lookup optimization, cache layers (don't cache-miss to the DB if we know the key doesn't exist), and spam filters.

Term	Plain meaning
Bloom filter	A compact bit array + k hash functions for probabilistic set membership
Bit array	An array of m bits, all initialized to 0
Hash function	A function that maps any input to a position in the bit array
False positive	Filter says "maybe in set," but element is not actually in the set
False negative	Filter says "not in set," but element IS in the set — Bloom filters never do this
FPR	False Positive Rate — probability that a non-member is reported as "maybe present"
m	Number of bits in the filter — larger m = lower FPR
k	Number of hash functions — optimal k = (m/n) × ln(2) where n = expected elements

The Problem

You're building a web browser that needs to warn users before they visit malicious websites. Google maintains a list of 500 million malicious URLs. Checking every URL against a remote server would be too slow (a network call per navigation) and reveal your browsing history to Google.

You want to check the list locally, but storing 500 million URLs takes ~50 GB — too much for a browser. And a full lookup in a 50 GB database takes too long for real-time URL checking.

What if you could represent "is this URL in the list?" in just 1 GB (2× smaller) and check any URL in microseconds, with the only downside being a small probability of a false alarm (which triggers a server-side confirmation check)?

That's exactly what Bloom filters do. Chrome's Safe Browsing uses one.

In plain terms: a Bloom filter is a very compact checklist with a trick. It can say "this is definitely not on the list" (skip the expensive check) or "this might be on the list" (do the expensive check to confirm). The "might be" answers cost you a small number of unnecessary expensive checks; the "definitely not" answers save you the expensive check every time — and most URLs are safe.

Analogy: a spell-checker's word list. Instead of storing the full dictionary (400,000 words, 5 MB), you store a bloom filter of the dictionary (40,000 bits, 5 KB). If the filter says "not a word" — highlight it red immediately. If the filter says "maybe a word" — do a slower exact lookup to confirm. Since most user-typed strings are in the dictionary, you skip the slow lookup most of the time; only unknown words (rare) trigger it.

The Bit Array

A Bloom filter is just two things:

1. A bit array of m bits: initialized to all zeros, representing an empty set. For 500M URLs with a 1% FPR target, you'd use about 5 billion bits = 625 MB. Tiny compared to the raw data.

2. k independent hash functions: each function maps any input string to a position between 0 and m-1 in the bit array. Hash functions are fast (SHA256 or MurmurHash computed in microseconds) and their outputs are uniformly distributed across the array.

Common k values: 3 to 10 hash functions. The optimal k that minimizes false positive rate given m and n (number of elements) is k = (m/n) × ln(2). For m=10 bits per element, optimal k ≈ 7.

Memory calculation: at 10 bits per element with k=7 hash functions, a Bloom filter for n=500M URLs uses:

500M × 10 bits = 5 billion bits = 625 MB
vs ~50 GB to store the raw URLs
80× memory reduction

In plain terms: the bit array is like a bingo card where hash functions mark cells. If all cells for a number are marked, it was probably in the "draw." If any cell is unmarked, it was definitely not.

Inserting an Element

Adding element x to the Bloom filter:

Compute all k hash functions: h1(x), h2(x), ..., hk(x)
Set the bit at each position to 1

def add(bloom_filter, bits, x):
    for i in range(k):
        pos = hash_function[i](x) % m
        bits[pos] = 1

Example: adding "spam.com" with k=3 hash functions:

h1("spam.com") → position 3 → set bit 3 = 1
h2("spam.com") → position 11 → set bit 11 = 1
h3("spam.com") → position 7 → set bit 7 = 1

Add 500 million more malicious URLs the same way. Many bits get set; some positions get set by multiple different URLs (that's fine — we never unset bits).

Bits are never reset to 0: once a bit is set to 1, it stays 1 forever. This is why Bloom filters can't support deletions — you can't unset bit 7 for "spam.com" because bit 7 might have also been set by another URL.

Looking Up: "Definitely Not" is Free

Checking if element x might be in the set:

Compute all k hash functions: h1(x), ..., hk(x)
Check the bit at each position
If any bit is 0 → element is definitely not in the set (guaranteed, no false negatives)
If all bits are 1 → element is possibly in the set (may be a false positive)

def might_contain(bits, x):
    for i in range(k):
        pos = hash_function[i](x) % m
        if bits[pos] == 0:
            return False  # DEFINITELY NOT in set
    return True  # MAYBE in set

Why no false negatives? If element x was actually added, we set bits at h1(x), h2(x), ..., hk(x) to 1. Those bits can only ever go from 0→1, never 1→0. So when we look up x later, all those bits will still be 1. If any bit is 0, x was never added — guaranteed.

In plain terms: if you signed the guestbook at the party (added to the filter), your bits are definitely set. If one of your bits is unset, you definitely weren't there. No question.

False Positives: The Acceptable Price

The false positive rate is the probability that a non-member passes the lookup. It happens because the bits set by one element can coincidentally match the hash positions for a different, unrelated element.

Formula: FPR ≈ (1 - e^(-kn/m))^k

Where:

n = number of elements added
m = number of bits
k = number of hash functions

FPR examples for n=1M elements:

m/n (bits per element)	k	FPR
6 bits	4	5.6%
10 bits	7	0.8%
16 bits	11	0.046%
23 bits	16	0.001%

Handling false positives in practice: a false positive just means you do the expensive operation (DB lookup, disk read) unnecessarily. In a URL safety system, a false positive means you send an unnecessary server-side confirmation check. The system is still correct — you never skip a check you should do (no false negatives), you just occasionally do an extra check you didn't need to (false positive).

Filling up: as you add more elements, the FPR increases because more bits get set. When about half the bits are 1, the FPR becomes unacceptably high. The solution: size your filter appropriately for your expected n, or use a Counting Bloom filter (replaces bits with counters, allows approximate deletion).

Tiny example: Cassandra uses Bloom filters per SSTable (sorted string table). Before doing an expensive disk seek into an SSTable to check if a key exists, Cassandra checks the Bloom filter for that SSTable. If the filter says "definitely not" — skip the SSTable entirely. With a 1% FPR, Cassandra skips 99% of unnecessary disk reads for missing keys, dramatically improving read performance.

Real-World Uses

Beyond the examples above:

Google BigTable and LevelDB/RocksDB: each level of the LSM tree has a Bloom filter. When looking up a key, check the filter for each level — if "definitely not," skip that level's disk read. Avoids most unnecessary I/O for lookups of missing keys.

PostgreSQL and MySQL (more recent versions): can use Bloom filters to speed up certain join and index operations.

Browser safe browsing (Chrome, Firefox): daily-updated Bloom filter downloaded to the browser. ~1 MB representing millions of malicious URLs. Most URLs clear the filter immediately; those that don't trigger a confirmation server call.

Distributed systems deduplication: message queues and stream processors use Bloom filters to track which message IDs have been processed, avoiding expensive DB lookups for deduplication.

The Trade-offs

Property	Hash set	Bloom filter
Memory per element	40–60 bytes (hash + pointer)	6–23 bits
Lookup time	O(1)	O(k) — a few nanoseconds
False negatives	None	None (guaranteed)
False positives	None	Yes (tunable)
Deletion	Supported	Not supported (standard)
Best for	Exact membership, small sets	Approximate membership, huge sets

Why this matters for you

Bloom filters are one of the most practically useful probabilistic data structures. Any time you have: (1) a very large set, (2) frequent membership queries, (3) an expensive consequence for a miss — consider a Bloom filter as a "fast pre-check" layer. Canonical implementations exist in every major language. Redis has a built-in Bloom filter module. For most use cases, 10 bits per element with k=7 hash functions (FPR ≈ 1%) is the right starting configuration.

Next: LSM Tree — the write-optimized tree structure used in Cassandra, RocksDB, and LevelDB — and why it's designed for write-heavy workloads.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom

FRAME 1 OF 6

Set membership problem — before an expensive DB lookup, can we quickly check 'definitely not in set' to skip it entirely?