11 Vector Databases Deep Dive

📍 Chapter Overview

Time: ~90 minutes | Level: Intermediate | Prerequisites: Chapter 5

“A vector database doesn’t search for keywords — it searches for meaning.”

11.1 9.1 Why Traditional Databases Fall Short

SQL databases are brilliant at exact matches: WHERE name = 'Lagos'. But ask them “find documents similar in meaning to this sentence” and they struggle. You can’t LIKE '%meaning%'.

Vector databases solve this with Approximate Nearest Neighbour (ANN) search across high-dimensional embedding spaces.

11.2 9.2 How Vector Databases Work

flowchart LR
    A[Your Documents] --> B[Embedding Model]
    B --> C[Vector DB<br/>Store vectors + metadata]
    D[Query Text] --> E[Embedding Model]
    E --> F[Query Vector]
    F --> G[ANN Search<br/>Find nearest vectors]
    C --> G
    G --> H[Top-K Results<br/>Most similar docs]

11.2.1 Key Operations

Operation	Description
Insert	Store text + its embedding vector
Query	Find K nearest vectors to a query
Filter	Combine vector search with metadata filters
Update	Replace a vector and its metadata
Delete	Remove vectors by ID or filter

11.3 9.3 The Vector Database Landscape

Database	Type	Best For
Chroma	Open source, embedded	Development, prototyping
Pinecone	Managed cloud	Production at scale
Weaviate	Open source / cloud	Multi-modal, hybrid search
Qdrant	Open source / cloud	High performance, filtering
pgvector	PostgreSQL extension	If you already use Postgres
FAISS	Library (Meta)	Research, offline use

11.4 9.4 Working with ChromaDB (Start Here)

import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI

# Initialise Chroma (local, no server needed)
client = chromadb.Client()  # In-memory
# client = chromadb.PersistentClient(path="./chroma_db")  # On disk

# Set up OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key-here",
    model_name="text-embedding-3-small"
)

# Create a collection (like a table)
collection = client.create_collection(
    name="ai_course_notes",
    embedding_function=openai_ef
)

# Insert documents
documents = [
    "LLMs use transformer architecture with attention mechanisms",
    "Embeddings convert text into high-dimensional vectors",
    "RAG combines retrieval with generation for accurate answers",
    "LangChain provides composable components for AI pipelines",
    "Prompt engineering improves AI output quality dramatically"
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[{"chapter": i+1, "topic": "AI Fundamentals"} for i in range(len(documents))]
)

print(f"Stored {collection.count()} documents")

# Query the collection
results = collection.query(
    query_texts=["How do neural networks process language?"],
    n_results=3
)

print("\nTop 3 relevant documents:")
for doc, distance in zip(
    results["documents"][0],
    results["distances"][0]
):
    print(f"  Score: {1-distance:.3f} | {doc}")

library(reticulate)

# Use ChromaDB via Python
chromadb <- import("chromadb")
chroma_ef <- import("chromadb.utils.embedding_functions")

# Create client and collection
client <- chromadb$Client()

openai_ef <- chroma_ef$OpenAIEmbeddingFunction(
  api_key = Sys.getenv("OPENAI_API_KEY"),
  model_name = "text-embedding-3-small"
)

collection <- client$create_collection(
  name = "ai_course_notes",
  embedding_function = openai_ef
)

# Add documents
documents <- c(
  "LLMs use transformer architecture with attention mechanisms",
  "Embeddings convert text into high-dimensional vectors",
  "RAG combines retrieval with generation for accurate answers"
)

collection$add(
  documents = documents,
  ids = paste0("doc_", seq_along(documents) - 1),
  metadatas = lapply(seq_along(documents), function(i) list(chapter = i))
)

# Query
results <- collection$query(
  query_texts = list("How do neural networks process language?"),
  n_results = 3L
)

cat("Top 3 results:\n")
for (i in seq_along(results$documents[[1]])) {
  cat(sprintf("  %s\n", results$documents[[1]][[i]]))
}

11.5 9.5 Metadata Filtering

One of the most powerful features — combine semantic search with structured filters:

# Find similar documents, but only from specific chapters
results = collection.query(
    query_texts=["attention mechanisms in transformers"],
    n_results=3,
    where={"chapter": {"$lte": 3}},  # Only chapters 1-3
    include=["documents", "distances", "metadatas"]
)

11.6 9.6 Indexing Strategies

Vector databases use specialised index algorithms for fast search:

Algorithm	Speed	Accuracy	Memory
HNSW (Hierarchical NSW)	⚡⚡⚡	⭐⭐⭐	High
IVF (Inverted File)	⚡⚡	⭐⭐⭐	Medium
FLAT (Brute Force)	⚡	⭐⭐⭐⭐	Low
PQ (Product Quantisation)	⚡⚡⚡	⭐⭐	Very Low

For most applications, HNSW (the default in most databases) is the right choice.

11.7 9.7 Interactive Simulation: Vector Space Navigator

🎮 Live Simulation

Upload documents and explore them as points in a 2D semantic map. See how queries navigate to the most relevant content.

11.8 Chapter Summary

Vector databases store embeddings and enable semantic search
ChromaDB is the easiest starting point for development
Metadata filtering combines structured and semantic queries
HNSW indexing enables fast approximate nearest-neighbour search
Production systems typically use Pinecone, Weaviate, or Qdrant

11.9 What’s Next

Chapter 10: Build a Semantic Search Engine from scratch using everything you’ve learned.

📚 Further Reading

ChromaDB Documentation: docs.trychroma.com
Pinecone Learning Center: learn.pinecone.io
ANN Benchmarks: ann-benchmarks.com

# Vector Databases Deep Dive {#sec-vectordb} ::: {.callout-note icon="false"} ## 📍 Chapter Overview **Time:** ~90 minutes | **Level:** Intermediate | **Prerequisites:** @sec-embeddings ::: > *"A vector database doesn't search for keywords — it searches for meaning."* ## 9.1 Why Traditional Databases Fall Short SQL databases are brilliant at exact matches: `WHERE name = 'Lagos'`. But ask them *"find documents similar in meaning to this sentence"* and they struggle. You can't `LIKE '%meaning%'`. Vector databases solve this with **Approximate Nearest Neighbour (ANN)** search across high-dimensional embedding spaces. --- ## 9.2 How Vector Databases Work ```{mermaid} flowchart LR A[Your Documents] --> B[Embedding Model] B --> C[Vector DB<br/>Store vectors + metadata] D[Query Text] --> E[Embedding Model] E --> F[Query Vector] F --> G[ANN Search<br/>Find nearest vectors] C --> G G --> H[Top-K Results<br/>Most similar docs] ``` ### Key Operations | Operation | Description | |-----------|-------------| | **Insert** | Store text + its embedding vector | | **Query** | Find K nearest vectors to a query | | **Filter** | Combine vector search with metadata filters | | **Update** | Replace a vector and its metadata | | **Delete** | Remove vectors by ID or filter | --- ## 9.3 The Vector Database Landscape | Database | Type | Best For | |----------|------|---------| | **Chroma** | Open source, embedded | Development, prototyping | | **Pinecone** | Managed cloud | Production at scale | | **Weaviate** | Open source / cloud | Multi-modal, hybrid search | | **Qdrant** | Open source / cloud | High performance, filtering | | **pgvector** | PostgreSQL extension | If you already use Postgres | | **FAISS** | Library (Meta) | Research, offline use | --- ## 9.4 Working with ChromaDB (Start Here) ::: {.panel-tabset} ## 🐍 Python ```python import chromadb from chromadb.utils import embedding_functions from openai import OpenAI # Initialise Chroma (local, no server needed) client = chromadb.Client() # In-memory # client = chromadb.PersistentClient(path="./chroma_db") # On disk # Set up OpenAI embeddings openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key-here", model_name="text-embedding-3-small" ) # Create a collection (like a table) collection = client.create_collection( name="ai_course_notes", embedding_function=openai_ef ) # Insert documents documents = [ "LLMs use transformer architecture with attention mechanisms", "Embeddings convert text into high-dimensional vectors", "RAG combines retrieval with generation for accurate answers", "LangChain provides composable components for AI pipelines", "Prompt engineering improves AI output quality dramatically" ] collection.add( documents=documents, ids=[f"doc_{i}" for i in range(len(documents))], metadatas=[{"chapter": i+1, "topic": "AI Fundamentals"} for i in range(len(documents))] ) print(f"Stored {collection.count()} documents") # Query the collection results = collection.query( query_texts=["How do neural networks process language?"], n_results=3 ) print("\nTop 3 relevant documents:") for doc, distance in zip( results["documents"][0], results["distances"][0] ): print(f" Score: {1-distance:.3f} | {doc}") ``` ## 📊 R ```r library(reticulate) # Use ChromaDB via Python chromadb <- import("chromadb") chroma_ef <- import("chromadb.utils.embedding_functions") # Create client and collection client <- chromadb$Client() openai_ef <- chroma_ef$OpenAIEmbeddingFunction( api_key = Sys.getenv("OPENAI_API_KEY"), model_name = "text-embedding-3-small" ) collection <- client$create_collection( name = "ai_course_notes", embedding_function = openai_ef ) # Add documents documents <- c( "LLMs use transformer architecture with attention mechanisms", "Embeddings convert text into high-dimensional vectors", "RAG combines retrieval with generation for accurate answers" ) collection$add( documents = documents, ids = paste0("doc_", seq_along(documents) - 1), metadatas = lapply(seq_along(documents), function(i) list(chapter = i)) ) # Query results <- collection$query( query_texts = list("How do neural networks process language?"), n_results = 3L ) cat("Top 3 results:\n") for (i in seq_along(results$documents[[1]])) { cat(sprintf(" %s\n", results$documents[[1]][[i]])) } ``` ::: --- ## 9.5 Metadata Filtering One of the most powerful features — combine semantic search with structured filters: ```python # Find similar documents, but only from specific chapters results = collection.query( query_texts=["attention mechanisms in transformers"], n_results=3, where={"chapter": {"$lte": 3}}, # Only chapters 1-3 include=["documents", "distances", "metadatas"] ) ``` --- ## 9.6 Indexing Strategies Vector databases use specialised index algorithms for fast search: | Algorithm | Speed | Accuracy | Memory | |-----------|-------|---------|--------| | **HNSW** (Hierarchical NSW) | ⚡⚡⚡ | ⭐⭐⭐ | High | | **IVF** (Inverted File) | ⚡⚡ | ⭐⭐⭐ | Medium | | **FLAT** (Brute Force) | ⚡ | ⭐⭐⭐⭐ | Low | | **PQ** (Product Quantisation) | ⚡⚡⚡ | ⭐⭐ | Very Low | For most applications, **HNSW** (the default in most databases) is the right choice. --- ## 9.7 Interactive Simulation: Vector Space Navigator ::: {.callout-note icon="false"} ## 🎮 Live Simulation Upload documents and explore them as points in a 2D semantic map. See how queries navigate to the most relevant content. ::: ```{=html} <iframe src="../shiny/vector-explorer/index.html" width="100%" height="650px" style="border: 2px solid #e2e8f0; border-radius: 12px;"> </iframe> ``` --- ## Chapter Summary - Vector databases store embeddings and enable **semantic search** - **ChromaDB** is the easiest starting point for development - **Metadata filtering** combines structured and semantic queries - **HNSW indexing** enables fast approximate nearest-neighbour search - Production systems typically use Pinecone, Weaviate, or Qdrant ## What's Next Chapter 10: **Build a Semantic Search Engine** from scratch using everything you've learned. --- ::: {.callout-note icon="false"} ## 📚 Further Reading - ChromaDB Documentation: [docs.trychroma.com](https://docs.trychroma.com) - Pinecone Learning Center: [learn.pinecone.io](https://learn.pinecone.io) - ANN Benchmarks: [ann-benchmarks.com](https://ann-benchmarks.com) :::