4  How LLMs Work in Real Time

Note📍 Chapter Overview

Time: ~90 minutes | Level: Beginner–Intermediate | Prerequisites: Chapter 3

“To use a tool well, you must understand it. To understand an AI, you must see inside it.”

4.1 The Big Question: What is an LLM Actually Doing?

When you type a message to ChatGPT, Claude, or Gemini and get back a response that feels eerily intelligent — what is actually happening under the hood? Is the machine “thinking”? Does it “know” things?

The short answer: No — and that’s what makes it fascinating.

Large Language Models are, at their core, extraordinarily sophisticated next-token predictors. They don’t think. They don’t know. They predict, with stunning accuracy, what word (technically, what token) should come next given everything that came before.

Let’s unpack that.


4.2 2.1 Tokens: The Atoms of Language

LLMs don’t read text the way you do. They read tokens — chunks of text that may be whole words, parts of words, or punctuation.

ImportantKey Concept: Tokens

A token is the basic unit an LLM processes. English text averages ~0.75 words per token. The sentence “I love AI” might be 4 tokens: I, love, AI.

4.2.1 Why Does This Matter?

  • Cost: API calls are billed per token (input + output)
  • Context window: Models have a maximum token limit per conversation
  • Reasoning: Some tasks require more tokens to reason through carefully

4.2.2 Try It: Tokenize a Sentence

import tiktoken

# GPT-4's tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

text = "Fundamentals of AI is a game-changer for professionals."
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {[enc.decode([t]) for t in tokens]}")
library(httr2)
library(jsonlite)

# Use OpenAI's tokenization endpoint or tiktoken via reticulate
library(reticulate)
tiktoken <- import("tiktoken")

enc <- tiktoken$encoding_for_model("gpt-4")
text <- "Fundamentals of AI is a game-changer for professionals."
tokens <- enc$encode(text)

cat("Token count:", length(tokens), "\n")

4.3 2.2 The Transformer Architecture (No PhD Required)

The engine inside every major LLM is called a Transformer. Introduced by Google in the landmark 2017 paper “Attention Is All You Need”, the Transformer revolutionised how machines process language.

flowchart TD
    A[Your Input Text] --> B[Tokenizer<br/>Text → Numbers]
    B --> C[Embedding Layer<br/>Numbers → Vectors]
    C --> D[Transformer Blocks<br/>🔁 Repeated N times]
    D --> E[Attention Mechanism<br/>What matters to what?]
    E --> F[Feed-Forward Layer<br/>Process & Transform]
    F --> D
    D --> G[Output Layer<br/>Vectors → Probabilities]
    G --> H[Next Token Prediction<br/>Sample from distribution]
    H --> I[Your Response]

4.3.1 2.2.1 The Attention Mechanism: The Heart of Intelligence

The breakthrough insight of the Transformer is self-attention: the ability for every word in a sentence to “look at” every other word and decide how much relevance each word has to it.

Consider this sentence:

“The bank by the river was steep, unlike the financial bank on Main Street.”

A human instantly knows the first “bank” refers to a riverbank and the second to a financial institution — because we attend to context (“river”, “financial”). The attention mechanism does the same thing mathematically.

Tip💡 Intuition

Think of attention as a spotlight that each word casts across the entire sentence, illuminating the words most relevant to its meaning.


4.4 2.3 Training vs. Inference

Understanding this distinction is crucial for using LLMs effectively.

Phase What Happens When
Pre-training Model reads trillions of tokens from the internet; learns language patterns Once (very expensive)
Fine-tuning Model trained on specific task/domain data Sometimes
RLHF Humans rate outputs; model learns to be helpful & safe After pre-training
Inference You send a prompt; model predicts tokens Every time you use it

When you call the OpenAI API, you are in the inference phase — the model’s weights are frozen, it’s just generating.


4.5 2.4 Context Windows and Memory

Warning⚠️ The Goldfish Problem

LLMs have no persistent memory between conversations. Each API call is stateless — the model only “knows” what you include in the current context window.

Model Context Window
GPT-4o 128,000 tokens (~96,000 words)
Claude 3.5 Sonnet 200,000 tokens (~150,000 words)
Gemini 1.5 Pro 1,000,000 tokens (~750,000 words)
Llama 3.1 128,000 tokens

This is why techniques like RAG (Chapter 11) and LangGraph (Chapter 13) exist — to give LLMs the illusion of long-term memory.


4.6 2.5 Temperature and Sampling

When an LLM generates the next token, it doesn’t always pick the most likely one. A parameter called temperature controls the randomness.

graph LR
    A["Temperature = 0<br/>🥶 Deterministic<br/>Always picks most likely"]
    B["Temperature = 0.7<br/>😊 Balanced<br/>Creative but coherent"]
    C["Temperature = 2.0<br/>🔥 Random<br/>Unpredictable, creative"]

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

prompt = "The future of AI in business is"

for temp in [0.0, 0.7, 1.5]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=50
    )
    print(f"\nTemperature {temp}:")
    print(response.choices[0].message.content)
library(httr2)
library(jsonlite)

make_completion <- function(prompt, temperature) {
  req <- request("https://api.openai.com/v1/chat/completions") |>
    req_headers(
      "Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")),
      "Content-Type" = "application/json"
    ) |>
    req_body_json(list(
      model = "gpt-4o-mini",
      messages = list(list(role = "user", content = prompt)),
      temperature = temperature,
      max_tokens = 50L
    ))

  resp <- req_perform(req)
  resp_body_json(resp)$choices[[1]]$message$content
}

prompt <- "The future of AI in business is"
for (temp in c(0.0, 0.7, 1.5)) {
  cat(sprintf("\nTemperature %.1f:\n", temp))
  cat(make_completion(prompt, temp), "\n")
}

4.7 2.6 Interactive Simulation: LLM Token Predictor

Note🎮 Live Simulation

The Shiny app below lets you visualise how an LLM “thinks” step by step — showing token probabilities and attention weights. No code required.


4.8 2.7 What LLMs Are (and Aren’t)

LLMs ARE LLMs ARE NOT
Pattern recognisers trained on text Databases of facts
Excellent at language tasks Calculators (they hallucinate math)
Able to reason via chain-of-thought Conscious or sentient
Context-sensitive responders Connected to the internet (by default)
Generalisable to many tasks Always correct

4.9 Chapter Summary

In this chapter you learned that:

  • LLMs process text as tokens and predict the next token
  • The Transformer architecture — with its attention mechanism — enables context understanding
  • Training (learning from data) is separate from inference (generating responses)
  • Temperature controls creativity vs. determinism
  • Context windows define the model’s working memory

4.10 What’s Next

In the next chapter, we dive into Embeddings & Vector Representations — how AI converts meaning into mathematics, enabling machines to understand semantic similarity.


Note📚 Further Reading