5 Embeddings & Vector Representations

📍 Chapter Overview

Time: ~75 minutes | Level: Beginner–Intermediate | Prerequisites: Chapter 4

“To an AI, meaning is a location in space.”

5.1 3.1 From Words to Numbers

Computers can’t think about words — they work with numbers. But how do you turn the meaning of “king” minus “man” plus “woman” into a calculation that gives you “queen”?

That’s exactly what embeddings do.

An embedding is a high-dimensional vector (a list of numbers) that represents the semantic meaning of text. Similar meanings cluster near each other in this vector space.

graph LR
    A["'cat'"] --> E[Embedding Model]
    B["'dog'"] --> E
    C["'automobile'"] --> E
    E --> F["[0.23, -0.45, 0.78, ...]"]
    E --> G["[0.21, -0.42, 0.81, ...]"]
    E --> H["[-0.67, 0.12, -0.34, ...]"]

5.2 3.2 Visualising Vector Space

💡 Intuition

Imagine a vast 3D space. Words with similar meanings are placed close together. “Python” and “programming” are neighbors. “Python” and “snake” are also nearby (same word, different context). “Joy” and “happiness” almost overlap.

5.2.1 The Famous Word Arithmetic

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
Doctor - illness + law ≈ Lawyer

This isn’t magic — it’s geometry. The relationships between concepts are encoded as directions in vector space.

5.3 3.3 How Embeddings Are Created

Embedding models are neural networks trained to map text to vectors such that semantically similar text produces similar vectors.

Model	Dimensions	Use Case
`text-embedding-3-small`	1,536	General purpose, cost-effective
`text-embedding-3-large`	3,072	Higher accuracy
`sentence-transformers`	384–768	Open source, local
`nomic-embed-text`	768	Open source, high performance

5.4 3.4 Creating Embeddings in Code

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    """Get embedding vector for a text string."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def cosine_similarity(vec1: list, vec2: list) -> float:
    """Calculate how similar two vectors are (0=different, 1=identical)."""
    v1, v2 = np.array(vec1), np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Test it
texts = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "The stock market rose 2% today",
    "Artificial intelligence powers modern chatbots"
]

embeddings = [get_embedding(t) for t in texts]

# Compare all pairs
print("Similarity Matrix:")
for i, t1 in enumerate(texts):
    for j, t2 in enumerate(texts):
        if i < j:
            sim = cosine_similarity(embeddings[i], embeddings[j])
            print(f"  {sim:.3f} | '{t1[:40]}...' vs '{t2[:40]}...'")

library(httr2)
library(jsonlite)

get_embedding <- function(text) {
  req <- request("https://api.openai.com/v1/embeddings") |>
    req_headers(
      "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")),
      "Content-Type" = "application/json"
    ) |>
    req_body_json(list(
      input = text,
      model = "text-embedding-3-small"
    ))

  resp <- req_perform(req)
  resp_body_json(resp)$data[[1]]$embedding |> unlist()
}

cosine_similarity <- function(v1, v2) {
  sum(v1 * v2) / (sqrt(sum(v1^2)) * sqrt(sum(v2^2)))
}

texts <- c(
  "Machine learning is a subset of AI",
  "Deep learning uses neural networks",
  "The stock market rose 2% today",
  "Artificial intelligence powers modern chatbots"
)

embeddings <- lapply(texts, get_embedding)

# Compute pairwise similarities
for (i in seq_along(texts)) {
  for (j in seq_along(texts)) {
    if (i < j) {
      sim <- cosine_similarity(embeddings[[i]], embeddings[[j]])
      cat(sprintf("%.3f | %.40s vs %.40s\n", sim, texts[i], texts[j]))
    }
  }
}

5.5 3.5 Cosine Similarity: Measuring Meaning Distance

The standard way to compare two embeddings is cosine similarity — it measures the angle between two vectors.

graph LR
    A["Cosine Sim = 1.0<br/>↗↗ Same direction<br/>Identical meaning"]
    B["Cosine Sim = 0.0<br/>↗→ Perpendicular<br/>Unrelated meaning"]
    C["Cosine Sim = -1.0<br/>↗↙ Opposite<br/>Opposite meaning"]

5.6 3.6 Interactive Simulation: Embedding Space Explorer

🎮 Live Simulation

Explore a 2D visualisation of semantic space — type words, see them plotted by meaning. No code needed.

5.7 3.7 Why Embeddings Power Modern AI

Embeddings are the backbone of:

Semantic search — find documents by meaning, not keywords
Recommendation systems — “users who liked X also liked Y”
RAG (Retrieval Augmented Generation) — find relevant context to inject into LLM prompts
Anomaly detection — documents far from their expected cluster may be unusual
Clustering — automatically group similar documents

5.8 Chapter Summary

Embeddings convert text into vectors (lists of numbers)
Similar meanings produce similar vectors — measurable with cosine similarity
Embeddings enable semantic search, RAG, and recommendations
Vector dimensionality (1,536–3,072) captures nuanced meaning

5.9 What’s Next

Chapter 4 introduces LangChain — the framework that chains LLMs, embeddings, and tools together into intelligent pipelines.

📚 Further Reading

Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
OpenAI Embeddings Guide: platform.openai.com/docs/guides/embeddings

# Embeddings & Vector Representations {#sec-embeddings} ::: {.callout-note icon="false"} ## 📍 Chapter Overview **Time:** ~75 minutes | **Level:** Beginner–Intermediate | **Prerequisites:** @sec-llms ::: > *"To an AI, meaning is a location in space."* ## 3.1 From Words to Numbers Computers can't think about words — they work with numbers. But how do you turn the *meaning* of "king" minus "man" plus "woman" into a calculation that gives you "queen"? That's exactly what **embeddings** do. An embedding is a high-dimensional vector (a list of numbers) that represents the *semantic meaning* of text. Similar meanings cluster near each other in this vector space. ```{mermaid} graph LR A["'cat'"] --> E[Embedding Model] B["'dog'"] --> E C["'automobile'"] --> E E --> F["[0.23, -0.45, 0.78, ...]"] E --> G["[0.21, -0.42, 0.81, ...]"] E --> H["[-0.67, 0.12, -0.34, ...]"] ``` ## 3.2 Visualising Vector Space ::: {.callout-tip} ## 💡 Intuition Imagine a vast 3D space. Words with similar meanings are placed close together. "Python" and "programming" are neighbors. "Python" and "snake" are also nearby (same word, different context). "Joy" and "happiness" almost overlap. ::: ### The Famous Word Arithmetic ``` king - man + woman ≈ queen Paris - France + Germany ≈ Berlin Doctor - illness + law ≈ Lawyer ``` This isn't magic — it's geometry. The relationships between concepts are encoded as directions in vector space. --- ## 3.3 How Embeddings Are Created Embedding models are neural networks trained to map text to vectors such that semantically similar text produces similar vectors. | Model | Dimensions | Use Case | |-------|-----------|---------| | `text-embedding-3-small` | 1,536 | General purpose, cost-effective | | `text-embedding-3-large` | 3,072 | Higher accuracy | | `sentence-transformers` | 384–768 | Open source, local | | `nomic-embed-text` | 768 | Open source, high performance | --- ## 3.4 Creating Embeddings in Code ::: {.panel-tabset} ## 🐍 Python ```python from openai import OpenAI import numpy as np client = OpenAI() def get_embedding(text: str) -> list[float]: """Get embedding vector for a text string.""" response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding def cosine_similarity(vec1: list, vec2: list) -> float: """Calculate how similar two vectors are (0=different, 1=identical).""" v1, v2 = np.array(vec1), np.array(vec2) return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)) # Test it texts = [ "Machine learning is a subset of AI", "Deep learning uses neural networks", "The stock market rose 2% today", "Artificial intelligence powers modern chatbots" ] embeddings = [get_embedding(t) for t in texts] # Compare all pairs print("Similarity Matrix:") for i, t1 in enumerate(texts): for j, t2 in enumerate(texts): if i < j: sim = cosine_similarity(embeddings[i], embeddings[j]) print(f" {sim:.3f} | '{t1[:40]}...' vs '{t2[:40]}...'") ``` ## 📊 R ```r library(httr2) library(jsonlite) get_embedding <- function(text) { req <- request("https://api.openai.com/v1/embeddings") |> req_headers( "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")), "Content-Type" = "application/json" ) |> req_body_json(list( input = text, model = "text-embedding-3-small" )) resp <- req_perform(req) resp_body_json(resp)$data[[1]]$embedding |> unlist() } cosine_similarity <- function(v1, v2) { sum(v1 * v2) / (sqrt(sum(v1^2)) * sqrt(sum(v2^2))) } texts <- c( "Machine learning is a subset of AI", "Deep learning uses neural networks", "The stock market rose 2% today", "Artificial intelligence powers modern chatbots" ) embeddings <- lapply(texts, get_embedding) # Compute pairwise similarities for (i in seq_along(texts)) { for (j in seq_along(texts)) { if (i < j) { sim <- cosine_similarity(embeddings[[i]], embeddings[[j]]) cat(sprintf("%.3f | %.40s vs %.40s\n", sim, texts[i], texts[j])) } } } ``` ::: --- ## 3.5 Cosine Similarity: Measuring Meaning Distance The standard way to compare two embeddings is **cosine similarity** — it measures the angle between two vectors. ```{mermaid} graph LR A["Cosine Sim = 1.0<br/>↗↗ Same direction<br/>Identical meaning"] B["Cosine Sim = 0.0<br/>↗→ Perpendicular<br/>Unrelated meaning"] C["Cosine Sim = -1.0<br/>↗↙ Opposite<br/>Opposite meaning"] ``` --- ## 3.6 Interactive Simulation: Embedding Space Explorer ::: {.callout-note icon="false"} ## 🎮 Live Simulation Explore a 2D visualisation of semantic space — type words, see them plotted by meaning. No code needed. ::: ```{=html} <iframe src="../shiny/embedding-explorer/index.html" width="100%" height="600px" style="border: 2px solid #e2e8f0; border-radius: 12px;"> </iframe> ``` --- ## 3.7 Why Embeddings Power Modern AI Embeddings are the backbone of: - **Semantic search** — find documents by meaning, not keywords - **Recommendation systems** — "users who liked X also liked Y" - **RAG (Retrieval Augmented Generation)** — find relevant context to inject into LLM prompts - **Anomaly detection** — documents far from their expected cluster may be unusual - **Clustering** — automatically group similar documents --- ## Chapter Summary - Embeddings convert text into **vectors** (lists of numbers) - Similar meanings produce **similar vectors** — measurable with cosine similarity - Embeddings enable **semantic search**, RAG, and recommendations - Vector dimensionality (1,536–3,072) captures nuanced meaning ## What's Next Chapter 4 introduces **LangChain** — the framework that chains LLMs, embeddings, and tools together into intelligent pipelines. --- ::: {.callout-note icon="false"} ## 📚 Further Reading - Mikolov et al. (2013). *Efficient Estimation of Word Representations in Vector Space*. [arXiv:1301.3781](https://arxiv.org/abs/1301.3781) - OpenAI Embeddings Guide: [platform.openai.com/docs/guides/embeddings](https://platform.openai.com/docs/guides/embeddings) :::