11  Vector Databases Deep Dive

Note📍 Chapter Overview

Time: ~90 minutes | Level: Intermediate | Prerequisites: Chapter 5

“A vector database doesn’t search for keywords — it searches for meaning.”

11.1 9.1 Why Traditional Databases Fall Short

SQL databases are brilliant at exact matches: WHERE name = 'Lagos'. But ask them “find documents similar in meaning to this sentence” and they struggle. You can’t LIKE '%meaning%'.

Vector databases solve this with Approximate Nearest Neighbour (ANN) search across high-dimensional embedding spaces.


11.2 9.2 How Vector Databases Work

flowchart LR
    A[Your Documents] --> B[Embedding Model]
    B --> C[Vector DB<br/>Store vectors + metadata]
    D[Query Text] --> E[Embedding Model]
    E --> F[Query Vector]
    F --> G[ANN Search<br/>Find nearest vectors]
    C --> G
    G --> H[Top-K Results<br/>Most similar docs]

11.2.1 Key Operations

Operation Description
Insert Store text + its embedding vector
Query Find K nearest vectors to a query
Filter Combine vector search with metadata filters
Update Replace a vector and its metadata
Delete Remove vectors by ID or filter

11.3 9.3 The Vector Database Landscape

Database Type Best For
Chroma Open source, embedded Development, prototyping
Pinecone Managed cloud Production at scale
Weaviate Open source / cloud Multi-modal, hybrid search
Qdrant Open source / cloud High performance, filtering
pgvector PostgreSQL extension If you already use Postgres
FAISS Library (Meta) Research, offline use

11.4 9.4 Working with ChromaDB (Start Here)

import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI

# Initialise Chroma (local, no server needed)
client = chromadb.Client()  # In-memory
# client = chromadb.PersistentClient(path="./chroma_db")  # On disk

# Set up OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key-here",
    model_name="text-embedding-3-small"
)

# Create a collection (like a table)
collection = client.create_collection(
    name="ai_course_notes",
    embedding_function=openai_ef
)

# Insert documents
documents = [
    "LLMs use transformer architecture with attention mechanisms",
    "Embeddings convert text into high-dimensional vectors",
    "RAG combines retrieval with generation for accurate answers",
    "LangChain provides composable components for AI pipelines",
    "Prompt engineering improves AI output quality dramatically"
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[{"chapter": i+1, "topic": "AI Fundamentals"} for i in range(len(documents))]
)

print(f"Stored {collection.count()} documents")

# Query the collection
results = collection.query(
    query_texts=["How do neural networks process language?"],
    n_results=3
)

print("\nTop 3 relevant documents:")
for doc, distance in zip(
    results["documents"][0],
    results["distances"][0]
):
    print(f"  Score: {1-distance:.3f} | {doc}")
library(reticulate)

# Use ChromaDB via Python
chromadb <- import("chromadb")
chroma_ef <- import("chromadb.utils.embedding_functions")

# Create client and collection
client <- chromadb$Client()

openai_ef <- chroma_ef$OpenAIEmbeddingFunction(
  api_key = Sys.getenv("OPENAI_API_KEY"),
  model_name = "text-embedding-3-small"
)

collection <- client$create_collection(
  name = "ai_course_notes",
  embedding_function = openai_ef
)

# Add documents
documents <- c(
  "LLMs use transformer architecture with attention mechanisms",
  "Embeddings convert text into high-dimensional vectors",
  "RAG combines retrieval with generation for accurate answers"
)

collection$add(
  documents = documents,
  ids = paste0("doc_", seq_along(documents) - 1),
  metadatas = lapply(seq_along(documents), function(i) list(chapter = i))
)

# Query
results <- collection$query(
  query_texts = list("How do neural networks process language?"),
  n_results = 3L
)

cat("Top 3 results:\n")
for (i in seq_along(results$documents[[1]])) {
  cat(sprintf("  %s\n", results$documents[[1]][[i]]))
}

11.5 9.5 Metadata Filtering

One of the most powerful features — combine semantic search with structured filters:

# Find similar documents, but only from specific chapters
results = collection.query(
    query_texts=["attention mechanisms in transformers"],
    n_results=3,
    where={"chapter": {"$lte": 3}},  # Only chapters 1-3
    include=["documents", "distances", "metadatas"]
)

11.6 9.6 Indexing Strategies

Vector databases use specialised index algorithms for fast search:

Algorithm Speed Accuracy Memory
HNSW (Hierarchical NSW) ⚡⚡⚡ ⭐⭐⭐ High
IVF (Inverted File) ⚡⚡ ⭐⭐⭐ Medium
FLAT (Brute Force) ⭐⭐⭐⭐ Low
PQ (Product Quantisation) ⚡⚡⚡ ⭐⭐ Very Low

For most applications, HNSW (the default in most databases) is the right choice.


11.7 9.7 Interactive Simulation: Vector Space Navigator

Note🎮 Live Simulation

Upload documents and explore them as points in a 2D semantic map. See how queries navigate to the most relevant content.


11.8 Chapter Summary

  • Vector databases store embeddings and enable semantic search
  • ChromaDB is the easiest starting point for development
  • Metadata filtering combines structured and semantic queries
  • HNSW indexing enables fast approximate nearest-neighbour search
  • Production systems typically use Pinecone, Weaviate, or Qdrant

11.9 What’s Next

Chapter 10: Build a Semantic Search Engine from scratch using everything you’ve learned.


Note📚 Further Reading