040 · EMBEDDINGS · SIMILARITY · AI

Vector Database

Store and search high-dimensional embeddings for AI and similarity search.

If you are new here: A vector here is a long list of numbers (e.g. 768 floats) that represents meaning of text, images, or audio — produced by an embedding model. A vector database finds the closest stored vectors to a query vector — “nearest neighbor search” in high dimensions.

StepWhat happens
1. IngestChunk documents → call embedding API → store vector + metadata
2. QueryEmbed user question → find k nearest vectors
3. UsePass chunks to an LLM (RAG) or rank search results

The Problem

Traditional search matches keywords. Users ask ”how do we scale reads?” — documents that never contain that exact phrase might still be relevant. Embeddings turn text into vectors so we can search by meaning using distance in high-dimensional space.

In plain terms: a vector database stores points in meaning-space and answers “what is closest?” fast — the backbone of semantic search and many RAG apps.

Tiny example: The phrases “scale reads” and “handle more traffic” might be close in vector space even though they share no words.

The query becomes a vector; we want the k stored vectors closest by cosine or Euclidean distance — “nearest neighbors” in hundreds or thousands of dimensions.

Similarity search is useful because “closest” can mean semantically related, not textually identical. It retrieves by meaning-like geometry rather than exact keyword overlap.

Tiny example: “refund policy” can retrieve a document titled “returns and exchanges” because the embedding model placed them near each other.

The score is not truth; it is a ranking signal. You still need thresholds, reranking, metadata filters, and evaluation sets to know whether “near” means useful for your users.

Approximate nearest neighbor (ANN)

Exact brute force over billions of vectors is too slow. ANN indexes trade a little accuracy for massive speed and smaller memory — common structures include HNSW (a graph where similar vectors are linked as neighbors) and IVF (which clusters the space into cells so search only checks nearby cells). You tune recall (what fraction of true nearest neighbors you find) vs latency.

In plain terms: ANN returns very good neighbors quickly, not guaranteed perfect neighbors slowly. You evaluate whether the speed/recall trade is good enough for the product.

Semantic retrieval

Build support docs, internal wikis, or catalogs where users ask in natural language — the vector index finds chunks that read similarly even if the wording differs.

The unit of retrieval matters. Whole documents may be too large; tiny chunks may lose context. Good RAG systems spend surprising effort on chunk size, metadata, and reranking.

A strong retrieval layer also preserves provenance. The app should know which document, section, tenant, version, and permission boundary produced each chunk before it shows or sends it to a model.

RAG pattern

Retrieval-augmented generation: fetch context from the vector DB, paste into the LLM prompt, generate an answer — reduces hallucinations when the corpus is grounded.

Sketch:

User question → embed → top-k chunks from vector DB → prompt + chunks → LLM → answer

The vector database is the memory lookup layer, not the reasoning layer. Bad chunks, stale embeddings, or missing permissions still produce bad answers even if nearest-neighbor search is fast.

RAG works best when retrieval is treated as a product surface: measure answer quality, inspect missed documents, refresh embeddings when content changes, and show citations when users need trust.

Combine filters (tenant id, date range, SKU) with vector ranking — narrow with structured queries, rank with embeddings.

Hybrid search often beats pure vector search because exact constraints still matter. You may need tenant_id, language, product category, access permissions, or recency before semantic ranking.

In practice, many systems combine BM25 keyword search, structured filters, vector similarity, and a final reranker. Each stage narrows the candidate set so the expensive semantic scoring happens where it helps most.

Hybrid design also protects access control. Apply hard filters like tenant and permissions before ranking so the model never receives a chunk the user should not see.

Ecosystem

Pinecone, Weaviate, Qdrant, Chroma, pgvector in Postgres — pick managed vs self-hosted, SQL affinity, and latency SLOs.

If vectors are close to transactional data and scale is moderate, pgvector can be enough. If you need large indexes, high QPS, distributed ANN tuning, or managed operations, a dedicated vector store may earn its keep.

Also consider migration cost. Embedding dimensions, ANN parameters, metadata filtering, backup format, and multi-tenant isolation can make a later move more expensive than the first prototype suggests.

Prototype with the simplest store that proves quality, but record the scale assumptions: vector count, QPS, latency target, metadata filters, and reindex time. Those numbers tell you when to graduate.

Trade-offs

You still ownThe DB does not fix
Chunking, permissions, evalsBad source documents
Embedding refresh when content changesHallucinations if context is wrong

Also plan for re-embedding. Changing models, chunking rules, or source documents means old vectors may become stale or incomparable with new vectors.

Why this matters for you

Vector search is one layer in an AI stack: you still need fresh embeddings, chunking strategy, permissions, and evaluation — the database does not fix bad content.

Next: Full-Text Search is the keyword-search counterpart — in practice, production search systems often combine both: BM25 for exact matches, vector similarity for semantic ranking.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 8

A model maps text or images into a dense vector — nearby vectors mean semantically similar content in embedding space.