4 How LLMs Work in Real Time

📍 Chapter Overview

Time: ~90 minutes | Level: Beginner–Intermediate | Prerequisites: Chapter 3

“To use a tool well, you must understand it. To understand an AI, you must see inside it.”

4.1 The Big Question: What is an LLM Actually Doing?

When you type a message to ChatGPT, Claude, or Gemini and get back a response that feels eerily intelligent — what is actually happening under the hood? Is the machine “thinking”? Does it “know” things?

The short answer: No — and that’s what makes it fascinating.

Large Language Models are, at their core, extraordinarily sophisticated next-token predictors. They don’t think. They don’t know. They predict, with stunning accuracy, what word (technically, what token) should come next given everything that came before.

Let’s unpack that.

4.2 2.1 Tokens: The Atoms of Language

LLMs don’t read text the way you do. They read tokens — chunks of text that may be whole words, parts of words, or punctuation.

Key Concept: Tokens

A token is the basic unit an LLM processes. English text averages ~0.75 words per token. The sentence “I love AI” might be 4 tokens: I, love, AI.

4.2.1 Why Does This Matter?

Cost: API calls are billed per token (input + output)
Context window: Models have a maximum token limit per conversation
Reasoning: Some tasks require more tokens to reason through carefully

4.2.2 Try It: Tokenize a Sentence

import tiktoken

# GPT-4's tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

text = "Fundamentals of AI is a game-changer for professionals."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {[enc.decode([t]) for t in tokens]}")

library(httr2)
library(jsonlite)

# Use OpenAI's tokenization endpoint or tiktoken via reticulate
library(reticulate)
tiktoken <- import("tiktoken")

enc <- tiktoken$encoding_for_model("gpt-4")
text <- "Fundamentals of AI is a game-changer for professionals."
tokens <- enc$encode(text)

cat("Token count:", length(tokens), "\n")

4.3 2.2 The Transformer Architecture (No PhD Required)

The engine inside every major LLM is called a Transformer. Introduced by Google in the landmark 2017 paper “Attention Is All You Need”, the Transformer revolutionised how machines process language.

flowchart TD
    A[Your Input Text] --> B[Tokenizer<br/>Text → Numbers]
    B --> C[Embedding Layer<br/>Numbers → Vectors]
    C --> D[Transformer Blocks<br/>🔁 Repeated N times]
    D --> E[Attention Mechanism<br/>What matters to what?]
    E --> F[Feed-Forward Layer<br/>Process & Transform]
    F --> D
    D --> G[Output Layer<br/>Vectors → Probabilities]
    G --> H[Next Token Prediction<br/>Sample from distribution]
    H --> I[Your Response]

4.3.1 2.2.1 The Attention Mechanism: The Heart of Intelligence

The breakthrough insight of the Transformer is self-attention: the ability for every word in a sentence to “look at” every other word and decide how much relevance each word has to it.

Consider this sentence:

“The bank by the river was steep, unlike the financial bank on Main Street.”

A human instantly knows the first “bank” refers to a riverbank and the second to a financial institution — because we attend to context (“river”, “financial”). The attention mechanism does the same thing mathematically.

💡 Intuition

Think of attention as a spotlight that each word casts across the entire sentence, illuminating the words most relevant to its meaning.

4.4 2.3 Training vs. Inference

Understanding this distinction is crucial for using LLMs effectively.

Phase	What Happens	When
Pre-training	Model reads trillions of tokens from the internet; learns language patterns	Once (very expensive)
Fine-tuning	Model trained on specific task/domain data	Sometimes
RLHF	Humans rate outputs; model learns to be helpful & safe	After pre-training
Inference	You send a prompt; model predicts tokens	Every time you use it

When you call the OpenAI API, you are in the inference phase — the model’s weights are frozen, it’s just generating.

4.5 2.4 Context Windows and Memory

⚠️ The Goldfish Problem

LLMs have no persistent memory between conversations. Each API call is stateless — the model only “knows” what you include in the current context window.

Model	Context Window
GPT-4o	128,000 tokens (~96,000 words)
Claude 3.5 Sonnet	200,000 tokens (~150,000 words)
Gemini 1.5 Pro	1,000,000 tokens (~750,000 words)
Llama 3.1	128,000 tokens

This is why techniques like RAG (Chapter 11) and LangGraph (Chapter 13) exist — to give LLMs the illusion of long-term memory.

4.6 2.5 Temperature and Sampling

When an LLM generates the next token, it doesn’t always pick the most likely one. A parameter called temperature controls the randomness.

graph LR
    A["Temperature = 0<br/>🥶 Deterministic<br/>Always picks most likely"]
    B["Temperature = 0.7<br/>😊 Balanced<br/>Creative but coherent"]
    C["Temperature = 2.0<br/>🔥 Random<br/>Unpredictable, creative"]

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

prompt = "The future of AI in business is"

for temp in [0.0, 0.7, 1.5]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=50
    )
    print(f"\nTemperature {temp}:")
    print(response.choices[0].message.content)

library(httr2)
library(jsonlite)

make_completion <- function(prompt, temperature) {
  req <- request("https://api.openai.com/v1/chat/completions") |>
    req_headers(
      "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")),
      "Content-Type" = "application/json"
    ) |>
    req_body_json(list(
      model = "gpt-4o-mini",
      messages = list(list(role = "user", content = prompt)),
      temperature = temperature,
      max_tokens = 50L
    ))

  resp <- req_perform(req)
  resp_body_json(resp)$choices[[1]]$message$content
}

prompt <- "The future of AI in business is"
for (temp in c(0.0, 0.7, 1.5)) {
  cat(sprintf("\nTemperature %.1f:\n", temp))
  cat(make_completion(prompt, temp), "\n")
}

4.7 2.6 Interactive Simulation: LLM Token Predictor

🎮 Live Simulation

The Shiny app below lets you visualise how an LLM “thinks” step by step — showing token probabilities and attention weights. No code required.

4.8 2.7 What LLMs Are (and Aren’t)

LLMs ARE	LLMs ARE NOT
Pattern recognisers trained on text	Databases of facts
Excellent at language tasks	Calculators (they hallucinate math)
Able to reason via chain-of-thought	Conscious or sentient
Context-sensitive responders	Connected to the internet (by default)
Generalisable to many tasks	Always correct

4.9 Chapter Summary

In this chapter you learned that:

LLMs process text as tokens and predict the next token
The Transformer architecture — with its attention mechanism — enables context understanding
Training (learning from data) is separate from inference (generating responses)
Temperature controls creativity vs. determinism
Context windows define the model’s working memory

4.10 What’s Next

In the next chapter, we dive into Embeddings & Vector Representations — how AI converts meaning into mathematics, enabling machines to understand semantic similarity.

📚 Further Reading

Vaswani et al. (2017). Attention Is All You Need. arXiv:1706.03762
Karpathy, A. (2023). The spelled-out intro to language modeling. YouTube
OpenAI Tokenizer: platform.openai.com/tokenizer

# How LLMs Work in Real Time {#sec-llms} ::: {.callout-note icon="false"} ## 📍 Chapter Overview **Time:** ~90 minutes | **Level:** Beginner–Intermediate | **Prerequisites:** @sec-agents ::: > *"To use a tool well, you must understand it. To understand an AI, you must see inside it."* ## The Big Question: What is an LLM Actually Doing? When you type a message to ChatGPT, Claude, or Gemini and get back a response that feels eerily intelligent — what is actually happening under the hood? Is the machine "thinking"? Does it "know" things? The short answer: **No — and that's what makes it fascinating.** Large Language Models are, at their core, extraordinarily sophisticated **next-token predictors**. They don't think. They don't know. They predict, with stunning accuracy, what word (technically, what *token*) should come next given everything that came before. Let's unpack that. --- ## 2.1 Tokens: The Atoms of Language LLMs don't read text the way you do. They read **tokens** — chunks of text that may be whole words, parts of words, or punctuation. ::: {.callout-important} ## Key Concept: Tokens A token is the basic unit an LLM processes. English text averages ~0.75 words per token. The sentence "I love AI" might be 4 tokens: `I`, ` love`, ` AI`. ::: ### Why Does This Matter? - **Cost**: API calls are billed per token (input + output) - **Context window**: Models have a maximum token limit per conversation - **Reasoning**: Some tasks require more tokens to reason through carefully ### Try It: Tokenize a Sentence ::: {.panel-tabset} ## 🐍 Python ```python import tiktoken # GPT-4's tokenizer enc = tiktoken.encoding_for_model("gpt-4") text = "Fundamentals of AI is a game-changer for professionals." tokens = enc.encode(text) print(f"Text: {text}") print(f"Token count: {len(tokens)}") print(f"Tokens: {[enc.decode([t]) for t in tokens]}") ``` ## 📊 R ```r library(httr2) library(jsonlite) # Use OpenAI's tokenization endpoint or tiktoken via reticulate library(reticulate) tiktoken <- import("tiktoken") enc <- tiktoken$encoding_for_model("gpt-4") text <- "Fundamentals of AI is a game-changer for professionals." tokens <- enc$encode(text) cat("Token count:", length(tokens), "\n") ``` ::: --- ## 2.2 The Transformer Architecture (No PhD Required) The engine inside every major LLM is called a **Transformer**. Introduced by Google in the landmark 2017 paper *"Attention Is All You Need"*, the Transformer revolutionised how machines process language. ```{mermaid} flowchart TD A[Your Input Text] --> B[Tokenizer Text → Numbers] B --> C[Embedding Layer Numbers → Vectors] C --> D[Transformer Blocks 🔁 Repeated N times] D --> E[Attention Mechanism What matters to what?] E --> F[Feed-Forward Layer Process & Transform] F --> D D --> G[Output Layer Vectors → Probabilities] G --> H[Next Token Prediction Sample from distribution] H --> I[Your Response] ``` ### 2.2.1 The Attention Mechanism: The Heart of Intelligence The breakthrough insight of the Transformer is **self-attention**: the ability for every word in a sentence to "look at" every other word and decide how much relevance each word has to it. Consider this sentence: > "The **bank** by the **river** was steep, unlike the financial **bank** on Main Street." A human instantly knows the first "bank" refers to a riverbank and the second to a financial institution — because we attend to context ("river", "financial"). The attention mechanism does the same thing mathematically. ::: {.callout-tip} ## 💡 Intuition Think of attention as a **spotlight** that each word casts across the entire sentence, illuminating the words most relevant to its meaning. ::: --- ## 2.3 Training vs. Inference Understanding this distinction is crucial for using LLMs effectively. | Phase | What Happens | When | |-------|-------------|------| | **Pre-training** | Model reads trillions of tokens from the internet; learns language patterns | Once (very expensive) | | **Fine-tuning** | Model trained on specific task/domain data | Sometimes | | **RLHF** | Humans rate outputs; model learns to be helpful & safe | After pre-training | | **Inference** | You send a prompt; model predicts tokens | Every time you use it | When you call the OpenAI API, you are in the **inference** phase — the model's weights are frozen, it's just generating. --- ## 2.4 Context Windows and Memory ::: {.callout-warning} ## ⚠️ The Goldfish Problem LLMs have no persistent memory between conversations. Each API call is stateless — the model only "knows" what you include in the current context window. ::: | Model | Context Window | |-------|---------------| | GPT-4o | 128,000 tokens (~96,000 words) | | Claude 3.5 Sonnet | 200,000 tokens (~150,000 words) | | Gemini 1.5 Pro | 1,000,000 tokens (~750,000 words) | | Llama 3.1 | 128,000 tokens | This is why techniques like **RAG** (Chapter 11) and **LangGraph** (Chapter 13) exist — to give LLMs the *illusion* of long-term memory. --- ## 2.5 Temperature and Sampling When an LLM generates the next token, it doesn't always pick the most likely one. A parameter called **temperature** controls the randomness. ```{mermaid} graph LR A["Temperature = 0 🥶 Deterministic Always picks most likely"] B["Temperature = 0.7 😊 Balanced Creative but coherent"] C["Temperature = 2.0 🔥 Random Unpredictable, creative"] ``` ::: {.panel-tabset} ## 🐍 Python ```python from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() prompt = "The future of AI in business is" for temp in [0.0, 0.7, 1.5]: response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=temp, max_tokens=50 ) print(f"\nTemperature {temp}:") print(response.choices[0].message.content) ``` ## 📊 R ```r library(httr2) library(jsonlite) make_completion <- function(prompt, temperature) { req <- request("https://api.openai.com/v1/chat/completions") |> req_headers( "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")), "Content-Type" = "application/json" ) |> req_body_json(list( model = "gpt-4o-mini", messages = list(list(role = "user", content = prompt)), temperature = temperature, max_tokens = 50L )) resp <- req_perform(req) resp_body_json(resp)$choices[[1]]$message$content } prompt <- "The future of AI in business is" for (temp in c(0.0, 0.7, 1.5)) { cat(sprintf("\nTemperature %.1f:\n", temp)) cat(make_completion(prompt, temp), "\n") } ``` ::: --- ## 2.6 Interactive Simulation: LLM Token Predictor ::: {.callout-note icon="false"} ## 🎮 Live Simulation The Shiny app below lets you visualise how an LLM "thinks" step by step — showing token probabilities and attention weights. No code required. ::: ```{=html} <iframe src="../shiny/llm-simulator/index.html" width="100%" height="600px" style="border: 2px solid #e2e8f0; border-radius: 12px;"> </iframe> ``` --- ## 2.7 What LLMs Are (and Aren't) | LLMs **ARE** | LLMs **ARE NOT** | |-------------|-----------------| | Pattern recognisers trained on text | Databases of facts | | Excellent at language tasks | Calculators (they hallucinate math) | | Able to reason via chain-of-thought | Conscious or sentient | | Context-sensitive responders | Connected to the internet (by default) | | Generalisable to many tasks | Always correct | --- ## Chapter Summary In this chapter you learned that: - LLMs process text as **tokens** and predict the next token - The **Transformer architecture** — with its attention mechanism — enables context understanding - **Training** (learning from data) is separate from **inference** (generating responses) - **Temperature** controls creativity vs. determinism - **Context windows** define the model's working memory ## What's Next In the next chapter, we dive into **Embeddings & Vector Representations** — how AI converts meaning into mathematics, enabling machines to understand semantic similarity. --- ::: {.callout-note icon="false"} ## 📚 Further Reading - Vaswani et al. (2017). *Attention Is All You Need*. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762) - Karpathy, A. (2023). *The spelled-out intro to language modeling*. [YouTube](https://youtube.com) - OpenAI Tokenizer: [platform.openai.com/tokenizer](https://platform.openai.com/tokenizer) :::