flowchart TD
A[Your Input Text] --> B[Tokenizer<br/>Text → Numbers]
B --> C[Embedding Layer<br/>Numbers → Vectors]
C --> D[Transformer Blocks<br/>🔁 Repeated N times]
D --> E[Attention Mechanism<br/>What matters to what?]
E --> F[Feed-Forward Layer<br/>Process & Transform]
F --> D
D --> G[Output Layer<br/>Vectors → Probabilities]
G --> H[Next Token Prediction<br/>Sample from distribution]
H --> I[Your Response]
4 How LLMs Work in Real Time
“To use a tool well, you must understand it. To understand an AI, you must see inside it.”
4.1 The Big Question: What is an LLM Actually Doing?
When you type a message to ChatGPT, Claude, or Gemini and get back a response that feels eerily intelligent — what is actually happening under the hood? Is the machine “thinking”? Does it “know” things?
The short answer: No — and that’s what makes it fascinating.
Large Language Models are, at their core, extraordinarily sophisticated next-token predictors. They don’t think. They don’t know. They predict, with stunning accuracy, what word (technically, what token) should come next given everything that came before.
Let’s unpack that.
4.2 2.1 Tokens: The Atoms of Language
LLMs don’t read text the way you do. They read tokens — chunks of text that may be whole words, parts of words, or punctuation.
4.2.1 Why Does This Matter?
- Cost: API calls are billed per token (input + output)
- Context window: Models have a maximum token limit per conversation
- Reasoning: Some tasks require more tokens to reason through carefully
4.2.2 Try It: Tokenize a Sentence
import tiktoken
# GPT-4's tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
text = "Fundamentals of AI is a game-changer for professionals."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {[enc.decode([t]) for t in tokens]}")library(httr2)
library(jsonlite)
# Use OpenAI's tokenization endpoint or tiktoken via reticulate
library(reticulate)
tiktoken <- import("tiktoken")
enc <- tiktoken$encoding_for_model("gpt-4")
text <- "Fundamentals of AI is a game-changer for professionals."
tokens <- enc$encode(text)
cat("Token count:", length(tokens), "\n")4.3 2.2 The Transformer Architecture (No PhD Required)
The engine inside every major LLM is called a Transformer. Introduced by Google in the landmark 2017 paper “Attention Is All You Need”, the Transformer revolutionised how machines process language.
4.3.1 2.2.1 The Attention Mechanism: The Heart of Intelligence
The breakthrough insight of the Transformer is self-attention: the ability for every word in a sentence to “look at” every other word and decide how much relevance each word has to it.
Consider this sentence:
“The bank by the river was steep, unlike the financial bank on Main Street.”
A human instantly knows the first “bank” refers to a riverbank and the second to a financial institution — because we attend to context (“river”, “financial”). The attention mechanism does the same thing mathematically.
4.4 2.3 Training vs. Inference
Understanding this distinction is crucial for using LLMs effectively.
| Phase | What Happens | When |
|---|---|---|
| Pre-training | Model reads trillions of tokens from the internet; learns language patterns | Once (very expensive) |
| Fine-tuning | Model trained on specific task/domain data | Sometimes |
| RLHF | Humans rate outputs; model learns to be helpful & safe | After pre-training |
| Inference | You send a prompt; model predicts tokens | Every time you use it |
When you call the OpenAI API, you are in the inference phase — the model’s weights are frozen, it’s just generating.
4.5 2.4 Context Windows and Memory
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens (~96,000 words) |
| Claude 3.5 Sonnet | 200,000 tokens (~150,000 words) |
| Gemini 1.5 Pro | 1,000,000 tokens (~750,000 words) |
| Llama 3.1 | 128,000 tokens |
This is why techniques like RAG (Chapter 11) and LangGraph (Chapter 13) exist — to give LLMs the illusion of long-term memory.
4.6 2.5 Temperature and Sampling
When an LLM generates the next token, it doesn’t always pick the most likely one. A parameter called temperature controls the randomness.
graph LR
A["Temperature = 0<br/>🥶 Deterministic<br/>Always picks most likely"]
B["Temperature = 0.7<br/>😊 Balanced<br/>Creative but coherent"]
C["Temperature = 2.0<br/>🔥 Random<br/>Unpredictable, creative"]
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
prompt = "The future of AI in business is"
for temp in [0.0, 0.7, 1.5]:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=temp,
max_tokens=50
)
print(f"\nTemperature {temp}:")
print(response.choices[0].message.content)library(httr2)
library(jsonlite)
make_completion <- function(prompt, temperature) {
req <- request("https://api.openai.com/v1/chat/completions") |>
req_headers(
"Authorization" = paste("Bearer", Sys.getenv("OPENAI_API_KEY")),
"Content-Type" = "application/json"
) |>
req_body_json(list(
model = "gpt-4o-mini",
messages = list(list(role = "user", content = prompt)),
temperature = temperature,
max_tokens = 50L
))
resp <- req_perform(req)
resp_body_json(resp)$choices[[1]]$message$content
}
prompt <- "The future of AI in business is"
for (temp in c(0.0, 0.7, 1.5)) {
cat(sprintf("\nTemperature %.1f:\n", temp))
cat(make_completion(prompt, temp), "\n")
}4.7 2.6 Interactive Simulation: LLM Token Predictor
4.8 2.7 What LLMs Are (and Aren’t)
| LLMs ARE | LLMs ARE NOT |
|---|---|
| Pattern recognisers trained on text | Databases of facts |
| Excellent at language tasks | Calculators (they hallucinate math) |
| Able to reason via chain-of-thought | Conscious or sentient |
| Context-sensitive responders | Connected to the internet (by default) |
| Generalisable to many tasks | Always correct |
4.9 Chapter Summary
In this chapter you learned that:
- LLMs process text as tokens and predict the next token
- The Transformer architecture — with its attention mechanism — enables context understanding
- Training (learning from data) is separate from inference (generating responses)
- Temperature controls creativity vs. determinism
- Context windows define the model’s working memory
4.10 What’s Next
In the next chapter, we dive into Embeddings & Vector Representations — how AI converts meaning into mathematics, enabling machines to understand semantic similarity.