Skip to content

Vector Databases

Bit: A vector database is just a database where you search by "vibes" instead of exact values. "Find me something similar to this" is literally the query.


★ TL;DR

  • What: Databases optimized for storing and searching high-dimensional vectors (embeddings) using similarity metrics
  • Why: The backbone of RAG, semantic search, recommendation systems — any time you need "find similar things"
  • Key point: Traditional DBs search by exact match. Vector DBs search by meaning/similarity using distance metrics.

★ Overview

Definition

A Vector Database stores data as high-dimensional vectors (embeddings) and enables fast approximate nearest-neighbor (ANN) search. When you embed text/images into vectors using an embedding model, a vector DB lets you find the most similar items efficiently.

Scope

Covers vector DB concepts, comparison of major options, and when to use what. For how vector DBs fit into RAG pipelines, see Rag.

Significance

  • Essential component of every RAG system
  • Growing from "GenAI niche tool" to mainstream data infrastructure
  • Understanding internals (indexing algorithms) = deep tech knowledge

Prerequisites

  • Understanding of embeddings (text → dense vector)
  • Basic Rag concepts

★ Deep Dive

How It Works

TRADITIONAL DATABASE:                    VECTOR DATABASE:
  SELECT * FROM docs                       "Find documents similar to
  WHERE title = 'attention'                 this query about attention"

  → Exact match                            → Semantic similarity
  → Returns: only docs titled              → Returns: docs ABOUT attention
    "attention"                              even if word isn't in title

HOW:
  1. Text → [Embedding Model] → Vector [0.12, -0.45, 0.89, ..., 0.33]
                                         (768-3072 dimensions)
  2. Store vector + metadata in Vector DB
  3. Query → Embed → Find nearest vectors → Return results

Similarity Metrics

Metric Formula Intuition Best For
Cosine Similarity Angle between vectors (ignoring magnitude) Text embeddings (most common)
Euclidean (L2) Straight-line distance Image embeddings
Dot Product Magnitude-aware similarity Normalized embeddings
Cosine Similarity:
  sim(A, B) = (A · B) / (||A|| × ||B||)

  Range: -1 (opposite) to 1 (identical)
  Typical threshold: > 0.7 = "similar"

Indexing Algorithms (How Fast Search Works)

Brute-force search (compare query against ALL vectors) is O(n). At millions of vectors, this is too slow. ANN (Approximate Nearest Neighbor) algorithms trade tiny accuracy loss for massive speed gain.

Algorithm How It Works Used By Speed vs Accuracy
HNSW Hierarchical graph navigation Qdrant, Weaviate, pgvector Best accuracy, more memory
IVF Cluster vectors, search nearest clusters FAISS, Pinecone Good balance
ScaNN Quantize + search Google Very fast, slight accuracy loss
Annoy Random projection trees Spotify Fast build, OK accuracy

HNSW (Hierarchical Navigable Small World) is the most popular — think of it as:

Layer 3: [  A  --------  B  ]           (few nodes, long-range links)
Layer 2: [  A  --  C  --  B  --  D  ]   (more nodes, medium links)
Layer 1: [  A - E - C - F - B - G - D ] (all nodes, short links)

Search: Start at top layer, navigate to approximate area,
        then refine at lower layers. O(log n) complexity.

Major Vector Databases Compared

Database Type Strengths Weaknesses Best For
Pinecone Managed (cloud) Serverless, zero ops, fast start Cost at scale, vendor lock-in Startups, prototypes, managed preference
Qdrant Self-host + Cloud Fast (Rust), rich filtering, best API Newer, smaller community Production self-host, performance-critical
Weaviate Self-host + Cloud Hybrid search built-in, modules Heavier resource use When you need keyword + vector search
Chroma Embedded (in-process) Simplest setup, great for dev Not for large-scale production Prototyping, small datasets, local dev
Milvus Self-host Massive scale, battle-tested Complex to operate Very large datasets (billions)
pgvector Postgres extension Use existing Postgres! Limited scale, basic features When you already have Postgres, < 1M docs
FAISS Library (not DB) Fastest ANN, library-level control No persistence, no API, just a library Research, custom pipelines

Decision Flowchart

Do you need a vector DB at all?
├── < 10K documents → Just use FAISS/numpy in memory
├── < 100K documents → pgvector (if you have Postgres) or Chroma
├── 100K - 10M documents → Qdrant, Weaviate, or Pinecone
└── > 10M documents → Milvus or Qdrant (clustered)

Do you want managed or self-hosted?
├── Managed (no ops): Pinecone, Qdrant Cloud, Weaviate Cloud
└── Self-hosted (control): Qdrant, Weaviate, Milvus (Docker)

◆ Code & Implementation

Quick Start Examples

# ⚠️ Last tested: 2026-04
# ═══ CHROMA (simplest - great for learning) ═══
import chromadb
from chromadb.utils import embedding_functions

# Create client and collection
client = chromadb.Client()  # In-memory, or PersistentClient("./db")
ef = embedding_functions.OpenAIEmbeddingFunction(model_name="text-embedding-3-small")
collection = client.create_collection("my_docs", embedding_function=ef)

# Add documents
collection.add(
    documents=["Transformers use attention", "RAG retrieves context", "LoRA is efficient"],
    ids=["doc1", "doc2", "doc3"],
    metadatas=[{"topic": "arch"}, {"topic": "technique"}, {"topic": "training"}]
)

# Query
results = collection.query(query_texts=["How do language models work?"], n_results=2)
print(results["documents"])  # → Most similar docs
# ⚠️ Last tested: 2026-04
# ═══ QDRANT (production-ready) ═══
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient("localhost", port=6333)  # or QdrantClient(":memory:")

# Create collection
client.create_collection(
    collection_name="genai_notes",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Upsert vectors (you'd get these from an embedding model)
client.upsert(
    collection_name="genai_notes",
    points=[
        PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"text": "...", "topic": "rag"}),
        PointStruct(id=2, vector=[0.3, 0.4, ...], payload={"text": "...", "topic": "lora"}),
    ]
)

# Search
results = client.query_points(
    collection_name="genai_notes",
    query=[0.15, 0.25, ...],  # query embedding
    limit=5,
)
# ═══ DOCKER: Run Qdrant locally ═══
docker run -p 6333:6333 qdrant/qdrant

# ═══ DOCKER: Run Weaviate locally ═══
docker run -p 8080:8080 semitechnologies/weaviate

◆ Strengths vs Limitations

✅ Strengths ❌ Limitations
Semantic search ("find similar" not "find exact") Approximate — may miss some results
Sub-millisecond search at million-scale Embedding quality determines search quality
Rich metadata filtering + vector search Additional infra to manage
Growing ecosystem and tooling Each DB has different APIs (no standard)
Critical for RAG, recommendations, anomaly detection Memory-intensive (vectors are large)

◆ Quick Reference

CHOOSING A VECTOR DB:
  Prototyping → Chroma (embedded, zero setup)
  Production (managed) → Pinecone or Qdrant Cloud
  Production (self-host) → Qdrant or Weaviate
  Already have Postgres → pgvector
  Massive scale (billions) → Milvus
  Just need a library → FAISS

KEY PARAMETERS:
  - Distance metric: Cosine (text), L2 (images)
  - Index type: HNSW (best accuracy), IVF (good balance)
  - EF (HNSW): Higher = more accurate, slower search
  - Segment size: Tune for memory vs speed

EMBEDDING DIMENSIONS:
  text-embedding-3-small: 1536
  text-embedding-3-large: 3072
  bge-m3: 1024
  nomic-embed-text: 768

○ Gotchas & Common Mistakes

  • ⚠️ Embedding model matters more than the DB: A bad embedding model with Pinecone will perform worse than a good one with Chroma.
  • ⚠️ Don't forget metadata filtering: Most queries need both vector similarity AND metadata filters (e.g., "similar to X AND category = 'tutorials'").
  • ⚠️ pgvector is good enough for most: Don't adopt a specialized vector DB if pgvector in your existing Postgres handles your scale.
  • ⚠️ Index before you search: Without building an index (HNSW/IVF), searches fall back to brute-force and become slow.
  • ⚠️ Embedding mismatch: The model that embeds documents MUST be the same model that embeds queries. Mixing models = garbage results.

○ Interview Angles

  • Q: How does approximate nearest neighbor search work?
  • A: ANN algorithms like HNSW build a graph structure where similar vectors are connected. Search starts from random entry points and greedily navigates toward the query vector through the graph. It's O(log n) vs O(n) for brute force, with ~95-99% recall.

  • Q: How would you choose between Pinecone and self-hosting Qdrant?

  • A: Pinecone: zero ops, serverless pricing, fast start. Qdrant self-host: lower cost at scale, data stays on your infra, more control over indexing. Decision factors: team size, data sensitivity, query volume, and operational expertise.

★ Connections

Relationship Topics
Builds on Embeddings, Similarity search, Llms Overview
Leads to Rag, Semantic search engines, Recommendation systems
Compare with Traditional databases (SQL), Search engines (Elasticsearch), Knowledge graphs
Cross-domain Information retrieval, Computational geometry (nearest neighbor)

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Index staleness New documents not found in search Ingestion pipeline lag, no real-time indexing Streaming ingestion, write-ahead log, refresh intervals
Recall vs latency tradeoff High recall requires seconds-long queries Exact search too slow, ANN too lossy Tune HNSW parameters (ef, M), benchmark your data distribution
Embedding model lock-in Cannot switch embedding models without full re-index Vectors are model-specific Abstract embedding layer, plan for re-indexing, Matryoshka
Metadata filter performance Filtered queries 10x slower than unfiltered Poor metadata indexing, post-retrieval filtering Pre-filtered ANN, composite indexes, partition by metadata

◆ Hands-On Exercises

Exercise 1: Benchmark Vector DB Performance

Goal: Compare retrieval quality and latency across vector databases Time: 30 minutes Steps: 1. Prepare 10K vectors from a real embedding model 2. Index in Chroma and a managed solution (Pinecone or Qdrant) 3. Run 100 queries, measure recall@10 and p95 latency 4. Test with metadata filters and measure impact Expected Output: Performance comparison table with recall, latency, and filtered performance


Type Resource Why
🔧 Hands-on Qdrant Documentation Excellent open-source vector DB with filtering support
🔧 Hands-on Pinecone Documentation Managed vector DB — easiest to start with
📄 Paper Johnson et al. "FAISS" (2017) Foundational nearest-neighbor search algorithms
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 3 Vector search in the context of RAG systems

★ Sources

  • Pinecone Learning Center — https://www.pinecone.io/learn/
  • Qdrant documentation — https://qdrant.tech/documentation/
  • Weaviate documentation — https://weaviate.io/developers/weaviate
  • Chroma documentation — https://docs.trychroma.com
  • "HNSW algorithm explained" — https://www.pinecone.io/learn/series/faiss/hnsw/