32 Text Analytics Fundamentals

📋 Learning Objectives

Understand the richness of unstructured text data in business contexts and the computational challenges it poses
Appreciate the linguistic and technical complexity of African languages, particularly Nigerian languages
Build a production-ready text preprocessing pipeline handling punctuation, tokenisation, stop words, and lemmatisation
Construct and interpret a document-term matrix and understand the sparsity problem
Compute TF-IDF scores to identify characteristic terms and filter noise-prone common words
Load and apply pre-trained word embeddings (Word2Vec, GloVe) to measure semantic similarity
Leverage transformer-based contextual embeddings (BERT, Sentence-BERT) for document-level representations
Apply text analytics to real Nigerian data (Senate speeches, customer reviews, social media posts)

32.1 Text as a Data Source: Richness, Scale, and Challenges

Text is ubiquitous in business. Nigerian companies generate millions of data points daily: customer service WhatsApp messages (“e don spoil, make una replace”, “very good product”), bank call-centre transcripts, social media posts, contract documents, annual reports, and supplier emails. This unstructured data harbours insights—product defects, market sentiment, regulatory risk—that numerical data alone cannot capture.

Business Use Cases: - Customer Support: Analyse complaints and praise to identify product defects and strengths. - Regulatory Compliance: Monitor contracts and announcements for compliance risk. - Market Intelligence: Track social media and news for competitor activity and consumer trends. - Investor Relations: Gauge sentiment from earnings call transcripts. - HR Analytics: Analyse employee feedback for engagement and retention risks.

The Scale Challenge: A single bank call centre handles 10,000 calls per day; manually reading transcripts is infeasible. Machine learning automates analysis of thousands or millions of documents, extracting patterns that humans would miss.

The Nigerian Language Challenge: Most NLP tools are trained on American English corpora. They fail on Nigerian Pidgin (“e don spoil” = “it has broken”), code-switching (mixing English and Yoruba in one sentence), and linguistic patterns unique to Nigeria. Standard models also miss Hausa, Yoruba, and Igbo. We address this throughout this chapter and Chapter 28.

32.2 The Text Preprocessing Pipeline

Raw text is noisy: mixed case, punctuation, numbers, URLs, repeated spaces. Before building any model, we clean the text through a standard pipeline.

Step 1: Lowercase Conversion: “Product QUALITY” and “product quality” become identical. Reduces vocabulary size without losing semantics.

Step 2: Remove Punctuation and Numbers: “Hello, world!” becomes “Hello world”. Numbers are usually uninformative (“2023” adds noise).

Step 3: Tokenisation: Split text into words or subwords. “Hello world” becomes [“Hello”, “world”]. For sentences, [“Hello”, “world”, “.”, “it”, “is”, “nice”, “.”]. Tokenisation is language-dependent; English uses whitespace, but ideographic languages (Chinese) and agglutinative languages (Swahili) require sophisticated methods.

Step 4: Remove Stop Words: Words like “the”, “is”, “a” appear in every document and carry little signal. Removing them (except in negation contexts, e.g., “not good”) reduces noise. Caution: In Nigerian Pidgin, “done”, “no”, and “go” are common stop words but may carry meaning in context (“e go come” = “he will come”).

Step 5: Stemming vs Lemmatisation: - Stemming: Cuts word endings with rules. “running”, “runs”, “ran” all become “run” (Porter Stemmer). Fast but crude; can conflate unrelated words (“universities” → “univers”). - Lemmatisation: Uses linguistic knowledge and dictionaries. Reduces words to their dictionary base form. “ran” → “run” (past tense), “better” → “good” (comparative). Slower but more accurate. Requires language-specific lexicons (unavailable for many African languages).

We implement a full pipeline and apply it to 200 synthetic Nigerian customer service comments.

📘 Theory: Preprocessing and Normalisation

Preprocessing reduces the vocabulary by removing rare and uninformative tokens, and consolidating synonymous forms. Formally, if the raw vocabulary is \(V\) and the preprocessed vocabulary is \(V'\), then \(|V'| \ll |V|\). This reduces the dimensionality of downstream models and speeds training.

For African languages, the lack of standard corpora means stop word lists and lemmatisers must be curated manually or bootstrapped from available resources.

🔑 Key Concept: Pipeline Stages

\[\text{Raw Text} \rightarrow \text{Lowercase} \rightarrow \text{Remove Punct.} \rightarrow \text{Tokenise} \rightarrow \text{Remove Stops} \rightarrow \text{Lemmatise} \rightarrow \text{Clean Tokens}\]

Each stage reduces noise and vocabulary size. The result is a list of canonical word forms ready for model input.

Show code

library(tidytext)
library(tidyverse)
library(tm)
library(SnowballC)

# Synthetic Nigerian customer service comments (mix of English and Pidgin)
set.seed(777)
comments <- c(
  "The product quality is very good. I recommend to friends.",
  "e don spoil after one week. very bad",
  "customer service is excellent, fast response",
  "e no work at all!!! waste of money",
  "good product, cheap price, fast delivery",
  "the delivery guy was rude, but product is ok",
  "e sweet well-well, this is the best!",
  "no be small thing, product no last",
  "amazing! will buy again next month",
  "terrible quality, received damaged goods"
)

# Expand to 20 comments (repeat and add variations)
comments <- rep(comments, 2)
comments <- c(comments, paste(sample(comments, 50, replace = TRUE),
                              sample(comments, 50, replace = TRUE)))

# Create dataframe
df_text <- data.frame(
  doc_id = 1:length(comments),
  text = comments,
  stringsAsFactors = FALSE
)

cat("Original Comments (first 10):\n")
#> Original Comments (first 10):
print(head(df_text, 10))
#>    doc_id                                                      text
#> 1       1 The product quality is very good. I recommend to friends.
#> 2       2                      e don spoil after one week. very bad
#> 3       3              customer service is excellent, fast response
#> 4       4                        e no work at all!!! waste of money
#> 5       5                  good product, cheap price, fast delivery
#> 6       6              the delivery guy was rude, but product is ok
#> 7       7                      e sweet well-well, this is the best!
#> 8       8                        no be small thing, product no last
#> 9       9                        amazing! will buy again next month
#> 10     10                  terrible quality, received damaged goods

# Preprocessing function
preprocess_text <- function(text) {
  # 1. Lowercase
  text <- tolower(text)

  # 2. Remove URLs
  text <- gsub("http[s]?://\\S+", "", text)

  # 3. Remove special characters and punctuation (keep apostrophes initially)
  text <- gsub("[^a-z0-9\\s']", "", text)

  # 4. Tokenise (split on whitespace)
  tokens <- unlist(strsplit(text, "\\s+"))

  # 5. Remove empty tokens
  tokens <- tokens[tokens != ""]

  # 6. Remove stop words (English + custom Nigerian Pidgin)
  stop_words_english <- c("the", "a", "an", "is", "are", "am", "was", "be",
                          "been", "being", "have", "has", "had", "do", "does",
                          "did", "will", "would", "should", "could", "may",
                          "might", "must", "can", "and", "or", "but", "in",
                          "on", "at", "to", "from", "by", "for", "of", "with",
                          "it", "i", "you", "he", "she", "we", "they", "this",
                          "that", "these", "those")

  # Pidgin-specific stops (use carefully; context-dependent)
  stop_words_pidgin <- c("na", "de", "go", "bien")  # Simplified; more curated in practice

  stop_words <- c(stop_words_english, stop_words_pidgin)

  tokens <- tokens[!(tokens %in% stop_words)]

  # 7. Lemmatisation (Porter Stemmer as proxy; not true lemmatisation)
  tokens <- wordStem(tokens, language = "english")

  # 8. Remove single-character tokens
  tokens <- tokens[nchar(tokens) > 1]

  return(paste(tokens, collapse = " "))
}

# Apply preprocessing
df_text <- df_text |>
  mutate(text_clean = sapply(text, preprocess_text))

cat("\n\nPreprocessed Comments (first 10):\n")
#> 
#> 
#> Preprocessed Comments (first 10):
print(head(df_text[, c("doc_id", "text", "text_clean")], 10))
#>    doc_id                                                      text
#> 1       1 The product quality is very good. I recommend to friends.
#> 2       2                      e don spoil after one week. very bad
#> 3       3              customer service is excellent, fast response
#> 4       4                        e no work at all!!! waste of money
#> 5       5                  good product, cheap price, fast delivery
#> 6       6              the delivery guy was rude, but product is ok
#> 7       7                      e sweet well-well, this is the best!
#> 8       8                        no be small thing, product no last
#> 9       9                        amazing! will buy again next month
#> 10     10                  terrible quality, received damaged goods
#>                                       text_clean
#> 1  theproductqualityisverygoodirecommendtofriend
#> 2                   edonspoilafteroneweekverybad
#> 3          customerserviceisexcellentfastrespons
#> 4                       enoworkatallwasteofmoney
#> 5              goodproductcheappricefastdeliveri
#> 6            thedeliveryguywasrudebutproductisok
#> 7                    esweetwellwellthisisthebest
#> 8                    nobesmallthingproductnolast
#> 9                   amazingwillbuyagainnextmonth
#> 10            terriblequalityreceiveddamagedgood

# Tokenise for analysis
df_tokens <- df_text |>
  unnest_tokens(word, text_clean) |>
  filter(word != "")

# Term frequency
term_freq <- df_tokens |>
  group_by(word) |>
  summarise(frequency = n(), .groups = "drop") |>
  arrange(desc(frequency))

cat("\n\nTop 20 Most Frequent Terms:\n")
#> 
#> 
#> Top 20 Most Frequent Terms:
print(head(term_freq, 20))
#> # A tibble: 20 × 2
#>    word                                                                frequency
#>    <chr>                                                                   <int>
#>  1 enoworkatallwasteofmoneycustomerserviceisexcellentfastrespons               3
#>  2 amazingwillbuyagainnextmonth                                                2
#>  3 customerserviceisexcellentfastrespons                                       2
#>  4 edonspoilafteroneweekverybad                                                2
#>  5 edonspoilafteroneweekverybadnobesmallthingproductnolast                     2
#>  6 enoworkatallwasteofmoney                                                    2
#>  7 enoworkatallwasteofmoneyedonspoilafteroneweekverybad                        2
#>  8 enoworkatallwasteofmoneyesweetwellwellthisisthebest                         2
#>  9 esweetwellwellthisisthebest                                                 2
#> 10 goodproductcheappricefastdeliveri                                           2
#> 11 goodproductcheappricefastdeliverytheproductqualityisverygoodirecom…         2
#> 12 nobesmallthingproductnolast                                                 2
#> 13 nobesmallthingproductnolastcustomerserviceisexcellentfastrespons            2
#> 14 nobesmallthingproductnolastenoworkatallwasteofmoney                         2
#> 15 terriblequalityreceiveddamagedgood                                          2
#> 16 terriblequalityreceiveddamagedgoodsterriblequalityreceiveddamagedg…         2
#> 17 thedeliveryguywasrudebutproductisok                                         2
#> 18 theproductqualityisverygoodirecommendtofriend                               2
#> 19 theproductqualityisverygoodirecommendtofriendsthedeliveryguywasrud…         2
#> 20 amazingwillbuyagainnextmonthcustomerserviceisexcellentfastrespons           1

# Visualise
ggplot(head(term_freq, 15), aes(x = reorder(word, frequency), y = frequency)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Terms in Customer Comments",
       x = "Term", y = "Frequency") +
  theme_minimal()

Show code


# Vocabulary statistics
cat("\n\nVocabulary Statistics:\n")
#> 
#> 
#> Vocabulary Statistics:
cat(sprintf("Total documents: %d\n", nrow(df_text)))
#> Total documents: 70
cat(sprintf("Total tokens (before preprocessing): %d\n",
            sum(sapply(df_text$text, function(x) length(unlist(strsplit(x, "\\s+")))))))
#> Total tokens (before preprocessing): 875
cat(sprintf("Total tokens (after preprocessing): %d\n", nrow(df_tokens)))
#> Total tokens (after preprocessing): 70
cat(sprintf("Unique terms (vocabulary size): %d\n", nrow(term_freq)))
#> Unique terms (vocabulary size): 50
cat(sprintf("Vocabulary reduction: %.1f%%\n",
            (1 - nrow(term_freq) / nrow(df_tokens)) * 100))
#> Vocabulary reduction: 28.6%

# Zipf's law check
df_tokens_cum <- df_tokens |>
  group_by(word) |>
  summarise(frequency = n(), .groups = "drop") |>
  arrange(desc(frequency)) |>
  mutate(rank = 1:n(),
         cumul_freq = cumsum(frequency),
         cumul_pct = cumul_freq / sum(frequency) * 100)

cat("\nZipf's Law Check:\n")
#> 
#> Zipf's Law Check:
cat("Top 20% of terms account for:\n")
#> Top 20% of terms account for:
pct_80 <- df_tokens_cum$cumul_pct[ceiling(nrow(df_tokens_cum) * 0.2)]
cat(sprintf("%.1f%% of all word frequencies\n", pct_80))
#> 30.0% of all word frequencies

Show code

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import matplotlib.pyplot as plt
import re

# Download NLTK data
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
#> FileSystemPathPointer('C:\\Users\\badi\\AppData\\Roaming\\nltk_data\\corpora\\stopwords')

# Synthetic comments
np.random.seed(777)
comments_base = [
    "The product quality is very good. I recommend to friends.",
    "e don spoil after one week. very bad",
    "customer service is excellent, fast response",
    "e no work at all!!! waste of money",
    "good product, cheap price, fast delivery",
    "the delivery guy was rude, but product is ok",
    "e sweet well-well, this is the best!",
    "no be small thing, product no last",
    "amazing! will buy again next month",
    "terrible quality, received damaged goods"
]

comments = comments_base * 2 + [
    f"{np.random.choice(comments_base)} {np.random.choice(comments_base)}"
    for _ in range(50)
]

df_text = pd.DataFrame({
    'doc_id': range(1, len(comments) + 1),
    'text': comments
})

print("Original Comments (first 10):")
#> Original Comments (first 10):
print(df_text.head(10))
#>    doc_id                                               text
#> 0       1  The product quality is very good. I recommend ...
#> 1       2               e don spoil after one week. very bad
#> 2       3       customer service is excellent, fast response
#> 3       4                 e no work at all!!! waste of money
#> 4       5           good product, cheap price, fast delivery
#> 5       6       the delivery guy was rude, but product is ok
#> 6       7               e sweet well-well, this is the best!
#> 7       8                 no be small thing, product no last
#> 8       9                 amazing! will buy again next month
#> 9      10           terrible quality, received damaged goods

# Preprocessing function
stop_words = set(stopwords.words('english'))
stop_words.update(['na', 'de', 'go', 'bien'])  # Pidgin-specific
stemmer = PorterStemmer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)

    # Remove special chars (keep apostrophes)
    text = re.sub(r"[^a-z0-9\s']", '', text)

    # Tokenise
    tokens = text.split()

    # Remove empty strings
    tokens = [t for t in tokens if t]

    # Remove stop words
    tokens = [t for t in tokens if t not in stop_words]

    # Stemming
    tokens = [stemmer.stem(t) for t in tokens]

    # Remove single-char tokens
    tokens = [t for t in tokens if len(t) > 1]

    return ' '.join(tokens)

df_text['text_clean'] = df_text['text'].apply(preprocess_text)

print("\n\nPreprocessed Comments (first 10):")
#> 
#> 
#> Preprocessed Comments (first 10):
for i, row in df_text.head(10).iterrows():
    print(f"{row['doc_id']}: {row['text_clean']}")
#> 1: product qualiti good recommend friend
#> 2: spoil one week bad
#> 3: custom servic excel fast respons
#> 4: work wast money
#> 5: good product cheap price fast deliveri
#> 6: deliveri guy rude product ok
#> 7: sweet wellwel best
#> 8: small thing product last
#> 9: amaz buy next month
#> 10: terribl qualiti receiv damag good

# Tokenise for analysis
all_tokens = []
for text in df_text['text_clean']:
    all_tokens.extend(text.split())

term_freq = pd.Series(all_tokens).value_counts().reset_index()
term_freq.columns = ['word', 'frequency']

print("\n\nTop 20 Most Frequent Terms:")
#> 
#> 
#> Top 20 Most Frequent Terms:
print(term_freq.head(20).to_string(index=False))
#>      word  frequency
#>   product         66
#>      good         34
#>  deliveri         28
#>      last         26
#>     thing         26
#>     small         26
#>      fast         24
#>   qualiti         19
#>     cheap         15
#>     price         15
#>      rude         13
#>       guy         13
#>        ok         13
#> recommend         12
#>       buy         12
#>      amaz         12
#>    friend         12
#>     month         12
#>      next         12
#>     money         10

# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(15), term_freq.head(15)['frequency'].values,
         color='steelblue', alpha=0.7)
plt.yticks(range(15), term_freq.head(15)['word'].values)
#> ([<matplotlib.axis.YTick object at 0x000001CF341A1FD0>, <matplotlib.axis.YTick object at 0x000001CF341A8B90>, <matplotlib.axis.YTick object at 0x000001CF341E4410>, <matplotlib.axis.YTick object at 0x000001CF341E47D0>, <matplotlib.axis.YTick object at 0x000001CF341E4B90>, <matplotlib.axis.YTick object at 0x000001CF341E4F50>, <matplotlib.axis.YTick object at 0x000001CF341E5310>, <matplotlib.axis.YTick object at 0x000001CF341E56D0>, <matplotlib.axis.YTick object at 0x000001CF341E5A90>, <matplotlib.axis.YTick object at 0x000001CF341E5E50>, <matplotlib.axis.YTick object at 0x000001CF341E6210>, <matplotlib.axis.YTick object at 0x000001CF341E65D0>, <matplotlib.axis.YTick object at 0x000001CF341E6990>, <matplotlib.axis.YTick object at 0x000001CF341E6D50>, <matplotlib.axis.YTick object at 0x000001CF341E7110>], [Text(0, 0, 'product'), Text(0, 1, 'good'), Text(0, 2, 'deliveri'), Text(0, 3, 'last'), Text(0, 4, 'thing'), Text(0, 5, 'small'), Text(0, 6, 'fast'), Text(0, 7, 'qualiti'), Text(0, 8, 'cheap'), Text(0, 9, 'price'), Text(0, 10, 'rude'), Text(0, 11, 'guy'), Text(0, 12, 'ok'), Text(0, 13, 'recommend'), Text(0, 14, 'buy')])
plt.xlabel('Frequency')
plt.title('Top 15 Most Frequent Terms in Customer Comments')
plt.tight_layout()
plt.show()

Show code


# Statistics
print("\n\nVocabulary Statistics:")
#> 
#> 
#> Vocabulary Statistics:
print(f"Total documents: {len(df_text)}")
#> Total documents: 70
raw_tokens = sum(len(text.split()) for text in df_text['text'])
print(f"Total tokens (before preprocessing): {raw_tokens}")
#> Total tokens (before preprocessing): 870
print(f"Total tokens (after preprocessing): {len(all_tokens)}")
#> Total tokens (after preprocessing): 533
print(f"Unique terms (vocabulary size): {len(term_freq)}")
#> Unique terms (vocabulary size): 36
vocab_reduction = (1 - len(term_freq) / len(all_tokens)) * 100
print(f"Vocabulary reduction: {vocab_reduction:.1f}%")
#> Vocabulary reduction: 93.2%

📝 Section 27.2 Review Questions

Why remove stop words? Give an example where removing “not” would hurt model performance.
Explain the difference between stemming and lemmatisation. Which is faster? Which is more accurate?
In Nigerian Pidgin, “e” means “it/he/she”. Should “e” be a stop word? Why or why not?
What is vocabulary reduction, and why does it matter for document-term matrix sparsity?
How would you preprocess medical or technical documents differently from customer reviews?

32.3 Bag of Words and the Document-Term Matrix

After preprocessing, we represent each document as a vector in a high-dimensional space. The Bag of Words (BoW) representation ignores word order and grammar, treating each document as a collection of words.

The Document-Term Matrix (DTM) is a matrix where: - Rows = documents - Columns = unique terms (vocabulary) - Cells = word counts (or binary presence/absence, or TF-IDF weights—more on that later)

For 200 documents and 500 unique terms, the DTM has 100,000 cells. Most cells are zero (sparse matrix); a given document contains only a small fraction of all terms. Sparsity creates computational challenges but also offers opportunities for efficient storage.

📘 Theory: Sparsity and the Curse of Dimensionality

A DTM with \(n\) documents and \(m\) unique terms has density \(\rho = \text{nonzero cells} / (n \times m)\). For English text, \(\rho \approx 0.01–0.05\) (99–95% sparse). A dense matrix of 200 × 500 requires 100,000 floating-point numbers; sparse format requires only \(\rho \times 100,000 \approx 1,000–5,000\) nonzero entries plus indices, saving 95–99% of memory.

Sparse matrices enable efficient matrix operations (multiplication, inversion) that would be intractable with dense representations.

🔑 Key Concept: Document-Term Matrix

A \(n \times m\) matrix \(X\) where \(X_{ij}\) = count of term \(j\) in document \(i\): \[X = \begin{pmatrix} c_{11} & c_{12} & \cdots & c_{1m} \\ c_{21} & c_{22} & \cdots & c_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ c_{n1} & c_{n2} & \cdots & c_{nm} \end{pmatrix}\] Each row is a document vector; similarity between documents is computed as cosine of the angle between rows.

Show code

library(tidytext)
library(tidyverse)
library(tm)

# Use preprocessed text from above
# Build DTM using tidytext approach
dtm_tidy <- df_text |>
  unnest_tokens(word, text_clean) |>
  group_by(doc_id, word) |>
  summarise(count = n(), .groups = "drop") |>
  filter(word != "")

# Convert to wide format (DTM)
dtm_matrix <- dtm_tidy |>
  pivot_wider(names_from = word, values_from = count, values_fill = 0) |>
  column_to_rownames("doc_id")

cat("Document-Term Matrix Dimensions:\n")
#> Document-Term Matrix Dimensions:
cat(sprintf("Documents (rows): %d\n", nrow(dtm_matrix)))
#> Documents (rows): 70
cat(sprintf("Vocabulary (columns): %d\n", ncol(dtm_matrix)))
#> Vocabulary (columns): 50

# Sparsity
sparsity <- sum(dtm_matrix == 0) / (nrow(dtm_matrix) * ncol(dtm_matrix))
cat(sprintf("Sparsity: %.2f%%\n", sparsity * 100))
#> Sparsity: 98.00%
cat(sprintf("Non-zero entries: %d\n", sum(dtm_matrix != 0)))
#> Non-zero entries: 70

# Display subset
cat("\n\nSubset of DTM (first 5 docs, first 10 terms):\n")
#> 
#> 
#> Subset of DTM (first 5 docs, first 10 terms):
print(dtm_matrix[1:5, 1:10])
#>   theproductqualityisverygoodirecommendtofriend edonspoilafteroneweekverybad
#> 1                                             1                            0
#> 2                                             0                            1
#> 3                                             0                            0
#> 4                                             0                            0
#> 5                                             0                            0
#>   customerserviceisexcellentfastrespons enoworkatallwasteofmoney
#> 1                                     0                        0
#> 2                                     0                        0
#> 3                                     1                        0
#> 4                                     0                        1
#> 5                                     0                        0
#>   goodproductcheappricefastdeliveri thedeliveryguywasrudebutproductisok
#> 1                                 0                                   0
#> 2                                 0                                   0
#> 3                                 0                                   0
#> 4                                 0                                   0
#> 5                                 1                                   0
#>   esweetwellwellthisisthebest nobesmallthingproductnolast
#> 1                           0                           0
#> 2                           0                           0
#> 3                           0                           0
#> 4                           0                           0
#> 5                           0                           0
#>   amazingwillbuyagainnextmonth terriblequalityreceiveddamagedgood
#> 1                            0                                  0
#> 2                            0                                  0
#> 3                            0                                  0
#> 4                            0                                  0
#> 5                            0                                  0

# Most frequent terms (column sums)
term_totals <- colSums(dtm_matrix)
term_totals <- sort(term_totals, decreasing = TRUE)

cat("\n\nTop 20 Terms by Total Frequency:\n")
#> 
#> 
#> Top 20 Terms by Total Frequency:
print(head(term_totals, 20))
#>                       enoworkatallwasteofmoneycustomerserviceisexcellentfastrespons 
#>                                                                                   3 
#>                                       theproductqualityisverygoodirecommendtofriend 
#>                                                                                   2 
#>                                                        edonspoilafteroneweekverybad 
#>                                                                                   2 
#>                                               customerserviceisexcellentfastrespons 
#>                                                                                   2 
#>                                                            enoworkatallwasteofmoney 
#>                                                                                   2 
#>                                                   goodproductcheappricefastdeliveri 
#>                                                                                   2 
#>                                                 thedeliveryguywasrudebutproductisok 
#>                                                                                   2 
#>                                                         esweetwellwellthisisthebest 
#>                                                                                   2 
#>                                                         nobesmallthingproductnolast 
#>                                                                                   2 
#>                                                        amazingwillbuyagainnextmonth 
#>                                                                                   2 
#>                                                  terriblequalityreceiveddamagedgood 
#>                                                                                   2 
#>                             edonspoilafteroneweekverybadnobesmallthingproductnolast 
#>                                                                                   2 
#>                                 enoworkatallwasteofmoneyesweetwellwellthisisthebest 
#>                                                                                   2 
#>      goodproductcheappricefastdeliverytheproductqualityisverygoodirecommendtofriend 
#>                                                                                   2 
#>                    nobesmallthingproductnolastcustomerserviceisexcellentfastrespons 
#>                                                                                   2 
#>               terriblequalityreceiveddamagedgoodsterriblequalityreceiveddamagedgood 
#>                                                                                   2 
#>                                enoworkatallwasteofmoneyedonspoilafteroneweekverybad 
#>                                                                                   2 
#>   theproductqualityisverygoodirecommendtofriendsthedeliveryguywasrudebutproductisok 
#>                                                                                   2 
#>                                 nobesmallthingproductnolastenoworkatallwasteofmoney 
#>                                                                                   2 
#> theproductqualityisverygoodirecommendtofriendscustomerserviceisexcellentfastrespons 
#>                                                                                   1

# Document lengths
doc_lengths <- rowSums(dtm_matrix)

cat("\n\nDocument Length Statistics:\n")
#> 
#> 
#> Document Length Statistics:
cat(sprintf("Mean tokens per document: %.1f\n", mean(doc_lengths)))
#> Mean tokens per document: 1.0
cat(sprintf("Min tokens: %d\n", min(doc_lengths)))
#> Min tokens: 1
cat(sprintf("Max tokens: %d\n", max(doc_lengths)))
#> Max tokens: 1

# Visualization: term frequency
term_freq_viz <- data.frame(
  term = names(head(term_totals, 15)),
  frequency = as.numeric(head(term_totals, 15))
)

ggplot(term_freq_viz, aes(x = reorder(term, frequency), y = frequency)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 15 Terms in DTM",
       x = "Term", y = "Total Frequency") +
  theme_minimal()

Show code


# Cosine similarity between two documents
cosine_similarity <- function(x, y) {
  sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}

doc1 <- as.numeric(dtm_matrix[1, ])
doc2 <- as.numeric(dtm_matrix[2, ])
sim_12 <- cosine_similarity(doc1, doc2)

cat(sprintf("\n\nExample: Cosine Similarity between Document 1 and 2: %.3f\n", sim_12))
#> 
#> 
#> Example: Cosine Similarity between Document 1 and 2: 0.000

# Build similarity matrix for all documents
similarity_matrix <- matrix(0, nrow = nrow(dtm_matrix), ncol = nrow(dtm_matrix))
for (i in 1:nrow(dtm_matrix)) {
  for (j in i:nrow(dtm_matrix)) {
    sim <- cosine_similarity(as.numeric(dtm_matrix[i, ]),
                            as.numeric(dtm_matrix[j, ]))
    similarity_matrix[i, j] <- sim
    similarity_matrix[j, i] <- sim
  }
}

cat("\nDocument Similarity Statistics:\n")
#> 
#> Document Similarity Statistics:
upper_tri <- similarity_matrix[upper.tri(similarity_matrix)]
cat(sprintf("Mean pairwise similarity: %.3f\n", mean(upper_tri)))
#> Mean pairwise similarity: 0.009
cat(sprintf("Median pairwise similarity: %.3f\n", median(upper_tri)))
#> Median pairwise similarity: 0.000
cat(sprintf("Max (excluding diagonal): %.3f\n", max(similarity_matrix[upper.tri(similarity_matrix)])))
#> Max (excluding diagonal): 1.000

Show code

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

# Use preprocessed text from above
vectorizer = CountVectorizer(max_features=500, min_df=1, max_df=0.9)
dtm_sparse = vectorizer.fit_transform(df_text['text_clean'])
dtm_array = dtm_sparse.toarray()

# Column names (terms)
vocab = vectorizer.get_feature_names_out()

print("Document-Term Matrix Dimensions:")
#> Document-Term Matrix Dimensions:
print(f"Documents (rows): {dtm_array.shape[0]}")
#> Documents (rows): 70
print(f"Vocabulary (columns): {dtm_array.shape[1]}")
#> Vocabulary (columns): 36

# Sparsity
sparsity = 1 - (np.count_nonzero(dtm_array) / dtm_array.size)
print(f"Sparsity: {sparsity:.2%}")
#> Sparsity: 80.52%
print(f"Non-zero entries: {np.count_nonzero(dtm_array)}")
#> Non-zero entries: 491

# Display subset
print("\n\nSubset of DTM (first 5 docs, first 10 terms):")
#> 
#> 
#> Subset of DTM (first 5 docs, first 10 terms):
print(pd.DataFrame(dtm_array[:5, :10], columns=vocab[:10]))
#>    amaz  bad  best  buy  cheap  custom  damag  deliveri  excel  fast
#> 0     0    0     0    0      0       0      0         0      0     0
#> 1     0    1     0    0      0       0      0         0      0     0
#> 2     0    0     0    0      0       1      0         0      1     1
#> 3     0    0     0    0      0       0      0         0      0     0
#> 4     0    0     0    0      1       0      0         1      0     1

# Term frequencies
term_totals = dtm_array.sum(axis=0)
term_df = pd.DataFrame({
    'term': vocab,
    'frequency': term_totals
}).sort_values('frequency', ascending=False)

print("\n\nTop 20 Terms by Total Frequency:")
#> 
#> 
#> Top 20 Terms by Total Frequency:
print(term_df.head(20).to_string(index=False))
#>      term  frequency
#>   product         66
#>      good         34
#>  deliveri         28
#>      last         26
#>     small         26
#>     thing         26
#>      fast         24
#>   qualiti         19
#>     price         15
#>     cheap         15
#>       guy         13
#>      rude         13
#>        ok         13
#>      next         12
#>       buy         12
#>      amaz         12
#> recommend         12
#>     month         12
#>    friend         12
#>     money         10

# Document statistics
doc_lengths = dtm_array.sum(axis=1)
print(f"\n\nDocument Length Statistics:")
#> 
#> 
#> Document Length Statistics:
print(f"Mean tokens per document: {doc_lengths.mean():.1f}")
#> Mean tokens per document: 7.6
print(f"Min tokens: {doc_lengths.min():.0f}")
#> Min tokens: 3
print(f"Max tokens: {doc_lengths.max():.0f}")
#> Max tokens: 12

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(range(15), term_df.head(15)['frequency'].values, color='steelblue', alpha=0.7)
ax.set_yticks(range(15))
ax.set_yticklabels(term_df.head(15)['term'].values)
ax.set_xlabel('Total Frequency')
ax.set_title('Top 15 Terms in DTM')
plt.tight_layout()
plt.show()

Show code


# Cosine similarity
cos_sim = cosine_similarity(dtm_array)
print(f"\n\nExample: Cosine Similarity between Document 1 and 2: {cos_sim[0, 1]:.3f}")
#> 
#> 
#> Example: Cosine Similarity between Document 1 and 2: 0.000

# Similarity statistics
upper_tri = cos_sim[np.triu_indices_from(cos_sim, k=1)]
print(f"\nDocument Similarity Statistics:")
#> 
#> Document Similarity Statistics:
print(f"Mean pairwise similarity: {upper_tri.mean():.3f}")
#> Mean pairwise similarity: 0.259
print(f"Median pairwise similarity: {np.median(upper_tri):.3f}")
#> Median pairwise similarity: 0.158
print(f"Max similarity (excluding diagonal): {upper_tri.max():.3f}")
#> Max similarity (excluding diagonal): 1.000

📝 Section 27.3 Review Questions

What is the curse of dimensionality in NLP? How does it manifest in a DTM?
If a DTM has 500 documents and 10,000 unique terms, how many cells are there? If sparsity is 95%, how many non-zero entries are there?
Why is cosine similarity appropriate for comparing document vectors (as opposed to Euclidean distance)?
If two documents have cosine similarity 0.1, are they similar or different?
How would increasing preprocessing (removing more stop words, stemming aggressively) change the DTM dimensions and sparsity?

32.4 TF-IDF: Term Frequency - Inverse Document Frequency

The Bag of Words approach counts raw occurrences. A common word like “product” might appear 1,000 times across documents, while a rare, distinctive word like “defect” appears 50 times. Raw counts over-weight common words and under-weight distinctive ones.

TF-IDF addresses this by downweighting common terms and upweighting rare, distinctive terms:

\[\text{TF-IDF}_{ij} = \text{TF}_{ij} \times \text{IDF}_j\]

where: - \(\text{TF}_{ij}\) = term frequency: how often term \(j\) appears in document \(i\). - \(\text{IDF}_j\) = inverse document frequency: penalises terms appearing in many documents.

\[\text{IDF}_j = \log\left(\frac{N}{df_j}\right)\]

where \(N\) is the total number of documents and \(df_j\) is the number of documents containing term \(j\).

Intuition: If a term appears in 90% of documents, \(\text{IDF} = \log(1 / 0.9) \approx 0.1\) (low weight). If a term appears in 1% of documents, \(\text{IDF} = \log(1 / 0.01) \approx 4.6\) (high weight). TF-IDF is a proxy for “information content”—rare terms inform us more about a document than common terms.

📘 Theory: Information-Theoretic Interpretation

TF-IDF approximates the mutual information between a term and a document. Terms with high TF-IDF are statistically associated with specific documents or document clusters, while terms with low TF-IDF appear uniformly across documents. Filtering low TF-IDF terms reduces noise.

🔑 Key Formula

TF-IDF Score: \[\text{TF-IDF}_{ij} = \left(\frac{f_{ij}}{\sum_k f_{ik}}\right) \times \log\left(\frac{N}{1 + df_j}\right)\] where \(f_{ij}\) is the raw count of term \(j\) in document \(i\), and the denominator \(1 + df_j\) is a smoothing term to avoid division by zero.

Show code

library(tidytext)
library(tidyverse)

# Compute TF-IDF using tidytext
tfidf_data <- dtm_tidy |>
  bind_tf_idf(word, doc_id, count)

cat("TF-IDF Computation:\n\n")
#> TF-IDF Computation:
cat("Sample TF-IDF scores (first 20 rows):\n")
#> Sample TF-IDF scores (first 20 rows):
print(head(tfidf_data, 20) |>
        mutate(across(where(is.numeric), round, 4)))
#> # A tibble: 20 × 6
#>    doc_id word                                          count    tf   idf tf_idf
#>     <dbl> <chr>                                         <dbl> <dbl> <dbl>  <dbl>
#>  1      1 theproductqualityisverygoodirecommendtofriend     1     1  3.56   3.56
#>  2      2 edonspoilafteroneweekverybad                      1     1  3.56   3.56
#>  3      3 customerserviceisexcellentfastrespons             1     1  3.56   3.56
#>  4      4 enoworkatallwasteofmoney                          1     1  3.56   3.56
#>  5      5 goodproductcheappricefastdeliveri                 1     1  3.56   3.56
#>  6      6 thedeliveryguywasrudebutproductisok               1     1  3.56   3.56
#>  7      7 esweetwellwellthisisthebest                       1     1  3.56   3.56
#>  8      8 nobesmallthingproductnolast                       1     1  3.56   3.56
#>  9      9 amazingwillbuyagainnextmonth                      1     1  3.56   3.56
#> 10     10 terriblequalityreceiveddamagedgood                1     1  3.56   3.56
#> 11     11 theproductqualityisverygoodirecommendtofriend     1     1  3.56   3.56
#> 12     12 edonspoilafteroneweekverybad                      1     1  3.56   3.56
#> 13     13 customerserviceisexcellentfastrespons             1     1  3.56   3.56
#> 14     14 enoworkatallwasteofmoney                          1     1  3.56   3.56
#> 15     15 goodproductcheappricefastdeliveri                 1     1  3.56   3.56
#> 16     16 thedeliveryguywasrudebutproductisok               1     1  3.56   3.56
#> 17     17 esweetwellwellthisisthebest                       1     1  3.56   3.56
#> 18     18 nobesmallthingproductnolast                       1     1  3.56   3.56
#> 19     19 amazingwillbuyagainnextmonth                      1     1  3.56   3.56
#> 20     20 terriblequalityreceiveddamagedgood                1     1  3.56   3.56

# Top terms by TF-IDF for each document
cat("\n\nTop 5 TF-IDF Terms for Selected Documents:\n\n")
#> 
#> 
#> Top 5 TF-IDF Terms for Selected Documents:

for (doc in c(1, 5, 10)) {
  top_terms <- tfidf_data |>
    filter(doc_id == doc) |>
    arrange(desc(tf_idf)) |>
    head(5)

  cat(sprintf("Document %d:\n", doc))
  for (i in 1:nrow(top_terms)) {
    row <- top_terms[i, ]
    cat(sprintf("  %s (TF-IDF: %.4f)\n", row$word, row$tf_idf))
  }
  cat("\n")
}
#> Document 1:
#>   theproductqualityisverygoodirecommendtofriend (TF-IDF: 3.5553)
#> 
#> Document 5:
#>   goodproductcheappricefastdeliveri (TF-IDF: 3.5553)
#> 
#> Document 10:
#>   terriblequalityreceiveddamagedgood (TF-IDF: 3.5553)

# TF-IDF matrix
tfidf_matrix <- tfidf_data |>
  select(doc_id, word, tf_idf) |>
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0) |>
  column_to_rownames("doc_id")

cat("\n\nTF-IDF Matrix Dimensions:\n")
#> 
#> 
#> TF-IDF Matrix Dimensions:
cat(sprintf("Documents: %d, Terms: %d\n", nrow(tfidf_matrix), ncol(tfidf_matrix)))
#> Documents: 70, Terms: 50

# Global IDF statistics
idf_stats <- tfidf_data |>
  select(word, idf) |>
  distinct() |>
  arrange(desc(idf))

cat("\n\nTop 15 Terms by IDF (most distinctive):\n")
#> 
#> 
#> Top 15 Terms by IDF (most distinctive):
print(head(idf_stats, 15) |> mutate(idf = round(idf, 3)))
#> # A tibble: 15 × 2
#>    word                                                                      idf
#>    <chr>                                                                   <dbl>
#>  1 theproductqualityisverygoodirecommendtofriendscustomerserviceisexcelle…  4.25
#>  2 amazingwillbuyagainnextmonthgoodproductcheappricefastdeliveri            4.25
#>  3 terriblequalityreceiveddamagedgoodstheproductqualityisverygoodirecomme…  4.25
#>  4 edonspoilafteroneweekverybadesweetwellwellthisisthebest                  4.25
#>  5 amazingwillbuyagainnextmonthcustomerserviceisexcellentfastrespons        4.25
#>  6 thedeliveryguywasrudebutproductisokgoodproductcheappricefastdeliveri     4.25
#>  7 edonspoilafteroneweekverybadterriblequalityreceiveddamagedgood           4.25
#>  8 esweetwellwellthisisthebestedonspoilafteroneweekverybad                  4.25
#>  9 edonspoilafteroneweekverybadtheproductqualityisverygoodirecommendtofri…  4.25
#> 10 thedeliveryguywasrudebutproductisoktheproductqualityisverygoodirecomme…  4.25
#> 11 thedeliveryguywasrudebutproductisokthedeliveryguywasrudebutproductisok   4.25
#> 12 esweetwellwellthisisthebestgoodproductcheappricefastdeliveri             4.25
#> 13 nobesmallthingproductnolastthedeliveryguywasrudebutproductisok           4.25
#> 14 terriblequalityreceiveddamagedgoodsthedeliveryguywasrudebutproductisok   4.25
#> 15 thedeliveryguywasrudebutproductisokenoworkatallwasteofmoney              4.25

cat("\n\nBottom 15 Terms by IDF (most common):\n")
#> 
#> 
#> Bottom 15 Terms by IDF (most common):
print(tail(idf_stats, 15) |> mutate(idf = round(idf, 3)))
#> # A tibble: 15 × 2
#>    word                                                                      idf
#>    <chr>                                                                   <dbl>
#>  1 goodproductcheappricefastdeliveri                                        3.56
#>  2 thedeliveryguywasrudebutproductisok                                      3.56
#>  3 esweetwellwellthisisthebest                                              3.56
#>  4 nobesmallthingproductnolast                                              3.56
#>  5 amazingwillbuyagainnextmonth                                             3.56
#>  6 terriblequalityreceiveddamagedgood                                       3.56
#>  7 edonspoilafteroneweekverybadnobesmallthingproductnolast                  3.56
#>  8 enoworkatallwasteofmoneyesweetwellwellthisisthebest                      3.56
#>  9 goodproductcheappricefastdeliverytheproductqualityisverygoodirecommend…  3.56
#> 10 nobesmallthingproductnolastcustomerserviceisexcellentfastrespons         3.56
#> 11 terriblequalityreceiveddamagedgoodsterriblequalityreceiveddamagedgood    3.56
#> 12 enoworkatallwasteofmoneyedonspoilafteroneweekverybad                     3.56
#> 13 theproductqualityisverygoodirecommendtofriendsthedeliveryguywasrudebut…  3.56
#> 14 nobesmallthingproductnolastenoworkatallwasteofmoney                      3.56
#> 15 enoworkatallwasteofmoneycustomerserviceisexcellentfastrespons            3.15

# Filter high-IDF terms
high_idf_threshold <- quantile(idf_stats$idf, 0.75)
high_idf_terms <- idf_stats |>
  filter(idf > high_idf_threshold) |>
  pull(word)

cat(sprintf("\n\nTerms with IDF > %.2f (top 25%% by informativeness): %d terms\n",
            high_idf_threshold, length(high_idf_terms)))
#> 
#> 
#> Terms with IDF > 4.25 (top 25% by informativeness): 0 terms

# Visualisation: term IDF distribution
idf_for_plot <- idf_stats |>
  arrange(idf) |>
  head(20)

ggplot(idf_for_plot, aes(x = reorder(word, idf), y = idf)) +
  geom_col(fill = "coral", alpha = 0.7) +
  coord_flip() +
  labs(title = "IDF Scores: 20 Least Informative Terms",
       x = "Term", y = "IDF Score") +
  theme_minimal()

Show code


# Document characteristics via TF-IDF
doc_tfidf_df <- as.data.frame(tfidf_matrix)
doc_tfidf_df$doc_id <- rownames(doc_tfidf_df)
doc_tfidf_profile <- doc_tfidf_df |>
  pivot_longer(cols = -doc_id, names_to = "word", values_to = "tf_idf") |>
  filter(tf_idf > 0)

top_docs <- doc_tfidf_profile |>
  group_by(doc_id) |>
  summarise(mean_tfidf = mean(tf_idf),
            max_tfidf = max(tf_idf),
            num_terms = n(),
            .groups = "drop") |>
  arrange(desc(mean_tfidf))

cat("\n\nDocuments with Highest Mean TF-IDF (most distinctive vocabulary):\n")
#> 
#> 
#> Documents with Highest Mean TF-IDF (most distinctive vocabulary):
print(head(top_docs, 10) |> mutate(across(where(is.numeric), round, 3)))
#> # A tibble: 10 × 4
#>    doc_id mean_tfidf max_tfidf num_terms
#>    <chr>       <dbl>     <dbl>     <dbl>
#>  1 21           4.25      4.25         1
#>  2 22           4.25      4.25         1
#>  3 24           4.25      4.25         1
#>  4 25           4.25      4.25         1
#>  5 26           4.25      4.25         1
#>  6 27           4.25      4.25         1
#>  7 30           4.25      4.25         1
#>  8 31           4.25      4.25         1
#>  9 37           4.25      4.25         1
#> 10 38           4.25      4.25         1

Show code

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

# Compute TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=500, min_df=1, max_df=0.9)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_text['text_clean'])
tfidf_array = tfidf_matrix.toarray()

vocab = tfidf_vectorizer.get_feature_names_out()

print("TF-IDF Matrix Dimensions:")
#> TF-IDF Matrix Dimensions:
print(f"Documents: {tfidf_array.shape[0]}, Terms: {tfidf_array.shape[1]}")
#> Documents: 70, Terms: 36

# Top terms by TF-IDF for each document
print("\n\nTop 5 TF-IDF Terms for Selected Documents:\n")
#> 
#> 
#> Top 5 TF-IDF Terms for Selected Documents:

for doc_idx in [0, 4, 9]:
    top_indices = np.argsort(tfidf_array[doc_idx])[-5:][::-1]
    print(f"Document {doc_idx + 1}:")
    for idx in top_indices:
        term = vocab[idx]
        score = tfidf_array[doc_idx, idx]
        print(f"  {term} (TF-IDF: {score:.4f})")
    print()
#> Document 1:
#>   friend (TF-IDF: 0.5370)
#>   recommend (TF-IDF: 0.5370)
#>   qualiti (TF-IDF: 0.4722)
#>   good (TF-IDF: 0.3577)
#>   product (TF-IDF: 0.2689)
#> 
#> Document 5:
#>   cheap (TF-IDF: 0.4935)
#>   price (TF-IDF: 0.4935)
#>   deliveri (TF-IDF: 0.4027)
#>   fast (TF-IDF: 0.4027)
#>   good (TF-IDF: 0.3471)
#> 
#> Document 10:
#>   receiv (TF-IDF: 0.5081)
#>   damag (TF-IDF: 0.5081)
#>   terribl (TF-IDF: 0.5081)
#>   qualiti (TF-IDF: 0.3786)
#>   good (TF-IDF: 0.2868)

# IDF computation (in TfidfVectorizer, it's stored internally)
idf_scores = tfidf_vectorizer.idf_
idf_df = pd.DataFrame({
    'term': vocab,
    'idf': idf_scores
}).sort_values('idf', ascending=False)

print("\n\nTop 15 Terms by IDF (most distinctive):")
#> 
#> 
#> Top 15 Terms by IDF (most distinctive):
print(idf_df.head(15).round(3).to_string(index=False))
#>    term   idf
#>  receiv 3.183
#> terribl 3.183
#>   damag 3.183
#>     bad 3.065
#>   sweet 3.065
#>   spoil 3.065
#>    best 3.065
#>     one 3.065
#> wellwel 3.065
#>    week 3.065
#> respons 2.960
#>  servic 2.960
#>   excel 2.960
#>  custom 2.960
#>   money 2.865

print("\n\nBottom 15 Terms by IDF (most common):")
#> 
#> 
#> Bottom 15 Terms by IDF (most common):
print(idf_df.tail(15).round(3).to_string(index=False))
#>      term   idf
#>       guy 2.698
#> recommend 2.698
#>      rude 2.698
#>      next 2.698
#>        ok 2.698
#>     cheap 2.555
#>     price 2.555
#>   qualiti 2.372
#>      last 2.085
#>      fast 2.085
#>  deliveri 2.085
#>     thing 2.085
#>     small 2.085
#>      good 1.797
#>   product 1.351

# High-IDF filter
high_idf_threshold = idf_df['idf'].quantile(0.75)
high_idf_terms = idf_df[idf_df['idf'] > high_idf_threshold]
print(f"\nTerms with IDF > {high_idf_threshold:.2f} (top 25%): {len(high_idf_terms)} terms")
#> 
#> Terms with IDF > 3.07 (top 25%): 3 terms

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
bottom_idf = idf_df.tail(20).sort_values('idf')
ax.barh(range(20), bottom_idf['idf'].values, color='coral', alpha=0.7)
ax.set_yticks(range(20))
ax.set_yticklabels(bottom_idf['term'].values)
ax.set_xlabel('IDF Score')
ax.set_title('IDF Scores: 20 Least Informative Terms')
plt.tight_layout()
plt.show()

Show code


# Document TF-IDF profiles
doc_mean_tfidf = tfidf_array.mean(axis=1)
doc_max_tfidf = tfidf_array.max(axis=1)
doc_nonzero = (tfidf_array > 0).sum(axis=1)

doc_profile = pd.DataFrame({
    'doc_id': range(1, len(df_text) + 1),
    'mean_tfidf': doc_mean_tfidf,
    'max_tfidf': doc_max_tfidf,
    'num_terms': doc_nonzero
}).sort_values('mean_tfidf', ascending=False)

print("\n\nDocuments with Highest Mean TF-IDF:")
#> 
#> 
#> Documents with Highest Mean TF-IDF:
print(doc_profile.head(10).round(4).to_string(index=False))
#>  doc_id  mean_tfidf  max_tfidf  num_terms
#>      65      0.0863     0.3608         10
#>      32      0.0863     0.3608         10
#>      57      0.0863     0.3608         10
#>      26      0.0858     0.3682         10
#>      27      0.0829     0.3494          9
#>      43      0.0827     0.4013          9
#>      51      0.0827     0.4013          9
#>      23      0.0827     0.4013          9
#>      47      0.0827     0.3581          9
#>      35      0.0818     0.3659          9

📝 Section 27.4 Review Questions

Why does TF-IDF downweight common terms? Give a practical example.
If a term appears in 50% of documents, what is its IDF value (assume log base 10)?
In customer reviews, which would have higher TF-IDF: “good” or “defective”? Why?
How would you use TF-IDF to automatically filter noise-prone terms for a downstream model?
What is a limitation of TF-IDF for capturing semantic similarity between documents?

32.5 Word Embeddings: Semantic Similarity Through Dense Vectors

Bag of Words and TF-IDF treat words as atomic units: “bank” and “financial institution” have zero similarity even though they’re conceptually related. Word embeddings map words to dense, low-dimensional vectors where semantically similar words cluster nearby in vector space.

Word2Vec (2013): A neural network trained on billions of words to predict surrounding context words. The learned embeddings capture semantic relationships: “king − man + woman ≈ queen” (vector arithmetic). Pre-trained 100-dimensional GloVe embeddings are available for English and other languages.

Cosine Similarity: Given embeddings \(\mathbf{v}_1\) and \(\mathbf{v}_2\): \[\text{similarity} = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{|\mathbf{v}_1| |\mathbf{v}_2|}\]

We load GloVe embeddings and compute nearest neighbours for terms relevant to Nigerian business (“Nigeria”, “Dangote”, “fintech”).

📘 Theory: Word Embeddings from Co-occurrence

Word2Vec learns embeddings by treating word prediction as a supervised task: given context words, predict the target word. The hidden layer of this neural network becomes the embedding. Words with similar contexts (distributive hypothesis: “words are similar if their contexts are similar”) end up with similar embeddings. The resulting space encodes semantic and syntactic relationships.

🔑 Key Concept: Vector Algebra in Embedding Space

Word vectors enable analogy tasks through vector arithmetic: \[\mathbf{v}_{\text{queen}} \approx \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}}\] This works because embeddings capture semantic dimensions (gender, royalty, etc.) as directions in vector space.

Show code

library(tidyverse)
# 'text' package requires torch; this chunk simulates embeddings instead

# Note: Loading full GloVe embeddings in R requires significant memory.
# Here, we simulate with a smaller set and demonstrate the approach.

# For production, download GloVe embeddings (glove.6B.100d.txt) from
# https://nlp.stanford.edu/projects/glove/

# Simulate small embedding space for demonstration
set.seed(888)

# Key terms for Nigerian business context
terms_of_interest <- c(
  "Nigeria", "Ghana", "Senegal",
  "Dangote", "MTN", "Airtel",
  "fintech", "startup", "venture",
  "manufacturing", "agriculture", "technology",
  "profit", "revenue", "growth",
  "bank", "finance", "loan",
  "inflation", "currency", "exchange"
)

# Simulate embeddings (100-dim) with semantic structure
n_terms <- length(terms_of_interest)
embedding_dim <- 100

# Create synthetic embeddings with semantic relationships
embeddings <- matrix(rnorm(n_terms * embedding_dim, mean = 0, sd = 0.1),
                     nrow = n_terms, ncol = embedding_dim)
rownames(embeddings) <- terms_of_interest

# Add structure: country terms should be similar to each other
country_idx <- which(terms_of_interest %in% c("Nigeria", "Ghana", "Senegal"))
for (i in country_idx) {
  embeddings[i, 1:20] <- embeddings[i, 1:20] + 0.5  # Add country signal
}

# Company terms should be similar
company_idx <- which(terms_of_interest %in% c("Dangote", "MTN", "Airtel"))
for (i in company_idx) {
  embeddings[i, 21:40] <- embeddings[i, 21:40] + 0.5  # Add company signal
}

# Finance terms should be similar
finance_idx <- which(terms_of_interest %in% c("bank", "finance", "loan"))
for (i in finance_idx) {
  embeddings[i, 41:60] <- embeddings[i, 41:60] + 0.5  # Add finance signal
}

# Normalise to unit vectors
embeddings <- t(apply(embeddings, 1, function(x) x / sqrt(sum(x^2))))

# Define cosine similarity function
cosine_sim <- function(v1, v2) {
  sum(v1 * v2) / (sqrt(sum(v1^2)) * sqrt(sum(v2^2)))
}

cat("Word Embeddings: Semantic Similarity Analysis\n\n")
#> Word Embeddings: Semantic Similarity Analysis

# Find nearest neighbours
nearest_neighbours <- function(word, embeddings, k = 5) {
  if (!(word %in% rownames(embeddings))) {
    return(NULL)
  }

  word_vec <- embeddings[word, ]
  similarities <- apply(embeddings, 1, function(row) cosine_sim(word_vec, row))

  # Sort and exclude the word itself
  sorted_idx <- order(similarities, decreasing = TRUE)
  result_idx <- sorted_idx[sorted_idx != which(rownames(embeddings) == word)][1:k]

  data.frame(
    word = rownames(embeddings)[result_idx],
    similarity = similarities[result_idx]
  )
}

# Query nearest neighbours for key terms
query_terms <- c("Nigeria", "Dangote", "fintech", "bank")

for (term in query_terms) {
  cat(sprintf("Nearest neighbours of '%s':\n", term))
  neighbors <- nearest_neighbours(term, embeddings, k = 5)
  for (i in 1:nrow(neighbors)) {
    cat(sprintf("  %s (%.3f)\n", neighbors$word[i], neighbors$similarity[i]))
  }
  cat("\n")
}
#> Nearest neighbours of 'Nigeria':
#>   Senegal (0.801)
#>   Ghana (0.790)
#>   technology (0.201)
#>   manufacturing (0.199)
#>   startup (0.195)
#> 
#> Nearest neighbours of 'Dangote':
#>   MTN (0.865)
#>   Airtel (0.865)
#>   manufacturing (0.134)
#>   finance (0.121)
#>   agriculture (0.115)
#> 
#> Nearest neighbours of 'fintech':
#>   inflation (0.196)
#>   manufacturing (0.156)
#>   technology (0.155)
#>   venture (0.109)
#>   Senegal (0.078)
#> 
#> Nearest neighbours of 'bank':
#>   loan (0.858)
#>   finance (0.808)
#>   technology (0.218)
#>   Ghana (0.060)
#>   startup (0.047)

# Similarity matrix for select terms
select_terms <- c("Nigeria", "Dangote", "fintech", "bank", "MTN", "inflation")
select_embeddings <- embeddings[select_terms, ]

sim_matrix <- matrix(0, nrow = length(select_terms), ncol = length(select_terms))
rownames(sim_matrix) <- colnames(sim_matrix) <- select_terms

for (i in 1:length(select_terms)) {
  for (j in 1:length(select_terms)) {
    sim_matrix[i, j] <- cosine_sim(select_embeddings[i, ], select_embeddings[j, ])
  }
}

cat("Similarity Matrix for Selected Terms:\n\n")
#> Similarity Matrix for Selected Terms:
print(round(sim_matrix, 3))
#>           Nigeria Dangote fintech   bank    MTN inflation
#> Nigeria     1.000  -0.114   0.072  0.017 -0.177    -0.112
#> Dangote    -0.114   1.000  -0.029  0.016  0.865    -0.060
#> fintech     0.072  -0.029   1.000 -0.069  0.071     0.196
#> bank        0.017   0.016  -0.069  1.000  0.041    -0.054
#> MTN        -0.177   0.865   0.071  0.041  1.000     0.063
#> inflation  -0.112  -0.060   0.196 -0.054  0.063     1.000

# Heatmap
sim_melted <- as.data.frame(as.table(sim_matrix)) |>
  rename(term1 = Var1, term2 = Var2, similarity = Freq)

ggplot(sim_melted, aes(x = term2, y = term1, fill = similarity)) +
  geom_tile(colour = "white") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0.5, limits = c(0, 1)) +
  labs(title = "Word Embedding Semantic Similarity Matrix",
       x = NULL, y = NULL, fill = "Cosine\nSimilarity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# Simulate embeddings (in production, load pre-trained GloVe)
np.random.seed(888)

terms_of_interest = [
    "Nigeria", "Ghana", "Senegal",
    "Dangote", "MTN", "Airtel",
    "fintech", "startup", "venture",
    "manufacturing", "agriculture", "technology",
    "profit", "revenue", "growth",
    "bank", "finance", "loan",
    "inflation", "currency", "exchange"
]

# Create synthetic embeddings (100-dim)
embeddings = np.random.randn(len(terms_of_interest), 100) * 0.1

# Add semantic structure
country_idx = [i for i, t in enumerate(terms_of_interest)
               if t in ["Nigeria", "Ghana", "Senegal"]]
for i in country_idx:
    embeddings[i, :20] += 0.5

company_idx = [i for i, t in enumerate(terms_of_interest)
               if t in ["Dangote", "MTN", "Airtel"]]
for i in company_idx:
    embeddings[i, 20:40] += 0.5

finance_idx = [i for i, t in enumerate(terms_of_interest)
               if t in ["bank", "finance", "loan"]]
for i in finance_idx:
    embeddings[i, 40:60] += 0.5

# Normalise to unit vectors
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

print("Word Embeddings: Semantic Similarity Analysis\n")
#> Word Embeddings: Semantic Similarity Analysis

# Nearest neighbours
def nearest_neighbours(word, terms, embeddings, k=5):
    if word not in terms:
        return None
    idx = terms.index(word)
    word_vec = embeddings[idx]
    sims = cosine_similarity([word_vec], embeddings)[0]
    sorted_idx = np.argsort(-sims)
    # Exclude the word itself
    result_idx = [i for i in sorted_idx if i != idx][:k]
    return [(terms[i], sims[i]) for i in result_idx]

query_terms = ["Nigeria", "Dangote", "fintech", "bank"]

for term in query_terms:
    print(f"Nearest neighbours of '{term}':")
    neighbors = nearest_neighbours(term, terms_of_interest, embeddings, k=5)
    for word, sim in neighbors:
        print(f"  {word} ({sim:.3f})")
    print()
#> Nearest neighbours of 'Nigeria':
#>   Ghana (0.843)
#>   Senegal (0.838)
#>   startup (0.164)
#>   revenue (0.149)
#>   fintech (0.118)
#> 
#> Nearest neighbours of 'Dangote':
#>   MTN (0.871)
#>   Airtel (0.816)
#>   revenue (0.191)
#>   exchange (0.119)
#>   currency (0.051)
#> 
#> Nearest neighbours of 'fintech':
#>   startup (0.149)
#>   revenue (0.138)
#>   loan (0.129)
#>   Nigeria (0.118)
#>   Ghana (0.103)
#> 
#> Nearest neighbours of 'bank':
#>   loan (0.851)
#>   finance (0.829)
#>   venture (0.136)
#>   startup (0.127)
#>   technology (0.095)

# Similarity matrix
select_terms = ["Nigeria", "Dangote", "fintech", "bank", "MTN", "inflation"]
select_idx = [terms_of_interest.index(t) for t in select_terms]
select_embeddings = embeddings[select_idx]

sim_matrix = cosine_similarity(select_embeddings)

print("Similarity Matrix for Selected Terms:\n")
#> Similarity Matrix for Selected Terms:
sim_df = pd.DataFrame(sim_matrix, index=select_terms, columns=select_terms)
print(sim_df.round(3))
#>            Nigeria  Dangote  fintech   bank    MTN  inflation
#> Nigeria      1.000   -0.107    0.118  0.060 -0.126      0.033
#> Dangote     -0.107    1.000    0.033 -0.017  0.871     -0.118
#> fintech      0.118    0.033    1.000  0.071  0.049     -0.051
#> bank         0.060   -0.017    0.071  1.000 -0.055     -0.025
#> MTN         -0.126    0.871    0.049 -0.055  1.000     -0.031
#> inflation    0.033   -0.118   -0.051 -0.025 -0.031      1.000

# Heatmap
fig, ax = plt.subplots(figsize=(8, 8))
im = ax.imshow(sim_matrix, cmap='RdBu_r', vmin=0, vmax=1)
ax.set_xticks(range(len(select_terms)))
ax.set_yticks(range(len(select_terms)))
ax.set_xticklabels(select_terms, rotation=45, ha='right')
ax.set_yticklabels(select_terms)
ax.set_title('Word Embedding Semantic Similarity Matrix')
plt.colorbar(im, ax=ax, label='Cosine Similarity')
#> <matplotlib.colorbar.Colorbar object at 0x000001CF343FB4D0>
plt.tight_layout()
plt.show()

📝 Section 27.5 Review Questions

Why do pre-trained word embeddings capture semantic meaning better than TF-IDF?
If “bank” and “financial institution” have cosine similarity 0.85, what does this tell you?
How would the vector “king − man + woman” represent the concept of “queen” geometrically?
What is a limitation of word embeddings trained on American English when applied to Nigerian text?
Name two Nigerian business terms that would benefit from custom fine-tuning of embeddings.

32.6 Contextual Embeddings and BERT: Beyond Word-Level Representations

Word embeddings like Word2Vec assign a single vector to each word, regardless of context. In “bank of the river” vs “commercial bank”, the word “bank” is identical in representation, even though the meaning differs entirely.

BERT (Bidirectional Encoder Representations from Transformers) addresses this through contextual embeddings. It reads the full sentence bidirectionally (left-to-right and right-to-left) using transformer attention mechanisms, adjusting the representation of each word based on surrounding context.

Sentence-BERT (SBERT) pools contextual word embeddings into a single document vector, enabling semantic similarity search across hundreds of documents in seconds.

📘 Theory: Attention Mechanism in Transformers

At each position, the attention mechanism computes a weighted average of all other words’ representations. Weights are learned, allowing the model to “decide” which words are most relevant for understanding the current word. For “bank”, the attention to “river”, “financial”, or “account” changes depending on context.

🔑 Key Concept: Contextual Embedding

\[\mathbf{c}_i = \text{BERT}(\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_n, i)\] The representation \(\mathbf{c}_i\) of word \(i\) is a function of the entire sentence context, unlike static embeddings where \(\mathbf{w}_i\) is fixed.

Show code

# Note: BERT in R requires the `text` package and Python backend.
# For demonstration, we use pre-computed contextual similarities.

library(tidyverse)

# Nigerian news headlines (synthetic)
headlines <- c(
  "Nigeria's fintech sector grows 40% as investors flock to Lagos",
  "Dangote Cement reports strong Q4 earnings despite economic headwinds",
  "Central Bank Governor warns of inflation risks in 2024",
  "MTN Nigeria expands 5G network to underserved rural communities",
  "New trade deal between Nigeria and ECOWAS partners boosts exports",
  "Nigerian tech startups attract $500M in VC funding this year",
  "Agriculture minister launches climate-resilient crop initiative",
  "Lagos stock exchange opens strong on bank sector gains",
  "Nigerian banks face regulatory pressure on consumer lending standards",
  "Senegalese entrepreneur launches pan-African fintech platform in Lagos"
)

headlines_df <- data.frame(
  headline_id = 1:length(headlines),
  headline = headlines,
  stringsAsFactors = FALSE
)

cat("Nigerian News Headlines for Embedding Analysis:\n\n")
#> Nigerian News Headlines for Embedding Analysis:
print(headlines_df)
#>    headline_id
#> 1            1
#> 2            2
#> 3            3
#> 4            4
#> 5            5
#> 6            6
#> 7            7
#> 8            8
#> 9            9
#> 10          10
#>                                                                  headline
#> 1          Nigeria's fintech sector grows 40% as investors flock to Lagos
#> 2    Dangote Cement reports strong Q4 earnings despite economic headwinds
#> 3                  Central Bank Governor warns of inflation risks in 2024
#> 4         MTN Nigeria expands 5G network to underserved rural communities
#> 5       New trade deal between Nigeria and ECOWAS partners boosts exports
#> 6            Nigerian tech startups attract $500M in VC funding this year
#> 7         Agriculture minister launches climate-resilient crop initiative
#> 8                  Lagos stock exchange opens strong on bank sector gains
#> 9   Nigerian banks face regulatory pressure on consumer lending standards
#> 10 Senegalese entrepreneur launches pan-African fintech platform in Lagos

# Simulate contextual embeddings (in production, use sentence-transformers library)
# Create synthetic embeddings with semantic relationships
set.seed(999)
n_docs <- nrow(headlines_df)
embed_dim <- 100

# Base embeddings
embeddings_context <- matrix(rnorm(n_docs * embed_dim, 0, 0.1),
                             nrow = n_docs, ncol = embed_dim)

# Add thematic structure
# Finance/tech headlines: indices 1, 2, 6, 9, 10
finance_idx <- c(1, 2, 6, 9, 10)
embeddings_context[finance_idx, 1:25] <- embeddings_context[finance_idx, 1:25] + 0.5

# Policy/regulation: indices 3, 5, 8
policy_idx <- c(3, 5, 8)
embeddings_context[policy_idx, 26:50] <- embeddings_context[policy_idx, 26:50] + 0.5

# Infrastructure/expansion: indices 4, 7
infra_idx <- c(4, 7)
embeddings_context[infra_idx, 51:75] <- embeddings_context[infra_idx, 51:75] + 0.5

# Normalise
embeddings_context <- t(apply(embeddings_context, 1, function(x) x / sqrt(sum(x^2))))

# Compute pairwise similarity
sim_matrix_context <- embeddings_context %*% t(embeddings_context)

cat("\n\nContextual Embedding Similarity Matrix:\n\n")
#> 
#> 
#> Contextual Embedding Similarity Matrix:
rownames(sim_matrix_context) <- colnames(sim_matrix_context) <- 1:n_docs
print(round(sim_matrix_context, 3))
#>         1      2      3      4      5      6      7      8      9     10
#> 1   1.000  0.882  0.005 -0.015 -0.017  0.821 -0.063  0.011  0.879  0.857
#> 2   0.882  1.000  0.015  0.031 -0.011  0.868 -0.019 -0.014  0.902  0.887
#> 3   0.005  0.015  1.000  0.004  0.866  0.025 -0.031  0.850  0.019  0.011
#> 4  -0.015  0.031  0.004  1.000 -0.026 -0.029  0.863 -0.035  0.010 -0.001
#> 5  -0.017 -0.011  0.866 -0.026  1.000 -0.029  0.014  0.866  0.021 -0.011
#> 6   0.821  0.868  0.025 -0.029 -0.029  1.000 -0.097  0.000  0.860  0.851
#> 7  -0.063 -0.019 -0.031  0.863  0.014 -0.097  1.000  0.012 -0.038 -0.048
#> 8   0.011 -0.014  0.850 -0.035  0.866  0.000  0.012  1.000  0.030 -0.003
#> 9   0.879  0.902  0.019  0.010  0.021  0.860 -0.038  0.030  1.000  0.881
#> 10  0.857  0.887  0.011 -0.001 -0.011  0.851 -0.048 -0.003  0.881  1.000

# Find most similar pairs
cat("\n\nMost Similar Headline Pairs (excluding self-similarity):\n\n")
#> 
#> 
#> Most Similar Headline Pairs (excluding self-similarity):

upper_tri <- upper.tri(sim_matrix_context)
sims <- sim_matrix_context[upper_tri]
pairs_idx <- which(upper_tri, arr.ind = TRUE)

similar_pairs <- data.frame(
  headline_1 = pairs_idx[, 1],
  headline_2 = pairs_idx[, 2],
  similarity = sims[sims != 1]
) |>
  arrange(desc(similarity)) |>
  head(10)

for (i in 1:nrow(similar_pairs)) {
  pair <- similar_pairs[i, ]
  h1 <- headlines_df$headline[pair$headline_1]
  h2 <- headlines_df$headline[pair$headline_2]
  cat(sprintf("Pair %d (similarity: %.3f):\n", i, pair$similarity))
  cat(sprintf("  H%d: %s\n", pair$headline_1, h1))
  cat(sprintf("  H%d: %s\n\n", pair$headline_2, h2))
}
#> Pair 1 (similarity: 0.902):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#> 
#> Pair 2 (similarity: 0.887):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H10: Senegalese entrepreneur launches pan-African fintech platform in Lagos
#> 
#> Pair 3 (similarity: 0.882):
#>   H1: Nigeria's fintech sector grows 40% as investors flock to Lagos
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#> 
#> Pair 4 (similarity: 0.881):
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#>   H10: Senegalese entrepreneur launches pan-African fintech platform in Lagos
#> 
#> Pair 5 (similarity: 0.879):
#>   H1: Nigeria's fintech sector grows 40% as investors flock to Lagos
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#> 
#> Pair 6 (similarity: 0.868):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#> 
#> Pair 7 (similarity: 0.866):
#>   H3: Central Bank Governor warns of inflation risks in 2024
#>   H5: New trade deal between Nigeria and ECOWAS partners boosts exports
#> 
#> Pair 8 (similarity: 0.866):
#>   H5: New trade deal between Nigeria and ECOWAS partners boosts exports
#>   H8: Lagos stock exchange opens strong on bank sector gains
#> 
#> Pair 9 (similarity: 0.863):
#>   H4: MTN Nigeria expands 5G network to underserved rural communities
#>   H7: Agriculture minister launches climate-resilient crop initiative
#> 
#> Pair 10 (similarity: 0.860):
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards

# Heatmap
sim_melted <- as.data.frame(as.table(sim_matrix_context)) |>
  rename(h1 = Var1, h2 = Var2, similarity = Freq) |>
  mutate(h1 = as.integer(h1), h2 = as.integer(h2))

ggplot(sim_melted, aes(x = h2, y = h1, fill = similarity)) +
  geom_tile(colour = "white") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0.5, limits = c(-1, 1)) +
  labs(title = "Contextual Embedding Similarity: Nigerian Headlines",
       x = "Headline ID", y = "Headline ID", fill = "Similarity") +
  theme_minimal() +
  theme(axis.text = element_text(size = 8))

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# Nigerian news headlines
headlines = [
    "Nigeria's fintech sector grows 40% as investors flock to Lagos",
    "Dangote Cement reports strong Q4 earnings despite economic headwinds",
    "Central Bank Governor warns of inflation risks in 2024",
    "MTN Nigeria expands 5G network to underserved rural communities",
    "New trade deal between Nigeria and ECOWAS partners boosts exports",
    "Nigerian tech startups attract $500M in VC funding this year",
    "Agriculture minister launches climate-resilient crop initiative",
    "Lagos stock exchange opens strong on bank sector gains",
    "Nigerian banks face regulatory pressure on consumer lending standards",
    "Senegalese entrepreneur launches pan-African fintech platform in Lagos"
]

print("Nigerian News Headlines for Embedding Analysis:\n")
#> Nigerian News Headlines for Embedding Analysis:
for i, h in enumerate(headlines, 1):
    print(f"{i}. {h}")
#> 1. Nigeria's fintech sector grows 40% as investors flock to Lagos
#> 2. Dangote Cement reports strong Q4 earnings despite economic headwinds
#> 3. Central Bank Governor warns of inflation risks in 2024
#> 4. MTN Nigeria expands 5G network to underserved rural communities
#> 5. New trade deal between Nigeria and ECOWAS partners boosts exports
#> 6. Nigerian tech startups attract $500M in VC funding this year
#> 7. Agriculture minister launches climate-resilient crop initiative
#> 8. Lagos stock exchange opens strong on bank sector gains
#> 9. Nigerian banks face regulatory pressure on consumer lending standards
#> 10. Senegalese entrepreneur launches pan-African fintech platform in Lagos

# Simulate contextual embeddings
np.random.seed(999)
embeddings_context = np.random.randn(len(headlines), 100) * 0.1

# Add thematic structure
finance_idx = [0, 1, 5, 8, 9]  # 0-indexed
embeddings_context[finance_idx, :25] += 0.5

policy_idx = [2, 4, 7]
embeddings_context[policy_idx, 25:50] += 0.5

infra_idx = [3, 6]
embeddings_context[infra_idx, 50:75] += 0.5

# Normalise
embeddings_context = embeddings_context / np.linalg.norm(embeddings_context,
                                                          axis=1, keepdims=True)

# Compute similarity
sim_matrix = cosine_similarity(embeddings_context)

print("\n\nContextual Embedding Similarity Matrix:\n")
#> 
#> 
#> Contextual Embedding Similarity Matrix:
sim_df = pd.DataFrame(sim_matrix, index=range(1, 11), columns=range(1, 11))
print(sim_df.round(3))
#>        1      2      3      4      5      6      7      8      9      10
#> 1   1.000  0.858 -0.029  0.037 -0.039  0.885  0.115  0.024  0.843  0.846
#> 2   0.858  1.000 -0.077 -0.015 -0.028  0.880  0.020  0.013  0.862  0.869
#> 3  -0.029 -0.077  1.000  0.004  0.854 -0.017  0.024  0.876 -0.096 -0.090
#> 4   0.037 -0.015  0.004  1.000  0.011 -0.003  0.817  0.079 -0.014 -0.005
#> 5  -0.039 -0.028  0.854  0.011  1.000 -0.025  0.034  0.829 -0.088 -0.096
#> 6   0.885  0.880 -0.017 -0.003 -0.025  1.000  0.060  0.034  0.867  0.865
#> 7   0.115  0.020  0.024  0.817  0.034  0.060  1.000  0.112  0.047  0.040
#> 8   0.024  0.013  0.876  0.079  0.829  0.034  0.112  1.000 -0.042 -0.018
#> 9   0.843  0.862 -0.096 -0.014 -0.088  0.867  0.047 -0.042  1.000  0.862
#> 10  0.846  0.869 -0.090 -0.005 -0.096  0.865  0.040 -0.018  0.862  1.000

# Find most similar pairs
print("\n\nMost Similar Headline Pairs:\n")
#> 
#> 
#> Most Similar Headline Pairs:
upper_tri = np.triu_indices(len(headlines), k=1)
sims = sim_matrix[upper_tri]
pairs = list(zip(upper_tri[0], upper_tri[1], sims))
pairs.sort(key=lambda x: x[2], reverse=True)

for rank, (i, j, sim) in enumerate(pairs[:10], 1):
    print(f"Pair {rank} (similarity: {sim:.3f}):")
    print(f"  H{i+1}: {headlines[i]}")
    print(f"  H{j+1}: {headlines[j]}\n")
#> Pair 1 (similarity: 0.885):
#>   H1: Nigeria's fintech sector grows 40% as investors flock to Lagos
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#> 
#> Pair 2 (similarity: 0.880):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#> 
#> Pair 3 (similarity: 0.876):
#>   H3: Central Bank Governor warns of inflation risks in 2024
#>   H8: Lagos stock exchange opens strong on bank sector gains
#> 
#> Pair 4 (similarity: 0.869):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H10: Senegalese entrepreneur launches pan-African fintech platform in Lagos
#> 
#> Pair 5 (similarity: 0.867):
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#> 
#> Pair 6 (similarity: 0.865):
#>   H6: Nigerian tech startups attract $500M in VC funding this year
#>   H10: Senegalese entrepreneur launches pan-African fintech platform in Lagos
#> 
#> Pair 7 (similarity: 0.862):
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#> 
#> Pair 8 (similarity: 0.862):
#>   H9: Nigerian banks face regulatory pressure on consumer lending standards
#>   H10: Senegalese entrepreneur launches pan-African fintech platform in Lagos
#> 
#> Pair 9 (similarity: 0.858):
#>   H1: Nigeria's fintech sector grows 40% as investors flock to Lagos
#>   H2: Dangote Cement reports strong Q4 earnings despite economic headwinds
#> 
#> Pair 10 (similarity: 0.854):
#>   H3: Central Bank Governor warns of inflation risks in 2024
#>   H5: New trade deal between Nigeria and ECOWAS partners boosts exports

# Heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(sim_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xticks(range(len(headlines)))
ax.set_yticks(range(len(headlines)))
ax.set_xticklabels(range(1, len(headlines) + 1))
ax.set_yticklabels(range(1, len(headlines) + 1))
ax.set_title('Contextual Embedding Similarity: Nigerian Headlines')
ax.set_xlabel('Headline ID')
ax.set_ylabel('Headline ID')
plt.colorbar(im, ax=ax, label='Cosine Similarity')
#> <matplotlib.colorbar.Colorbar object at 0x000001CF344E3770>
plt.tight_layout()
plt.show()

📝 Section 27.6 Review Questions

How does BERT’s contextual representation differ from Word2Vec’s static embeddings?
Give an example of how BERT would represent “bank” differently in “bank of the river” vs “commercial bank”.
What is the attention mechanism, and why is it important for contextual embeddings?
How would Sentence-BERT enable fast semantic search across 1 million documents?
What pre-trained BERT models are available for languages other than English, and which would be relevant for Nigeria?

32.7 Case Study: Mining Nigerian Senate Hansard

The Nigerian Senate maintains a public record of committee statements and debates. We analyse a synthetic corpus of 500 Senate committee statements (Finance, Health, Agriculture, Security committees) to identify key policy themes and compare rhetoric across committees.

Dataset: Synthetic 500 statements, ≈100 words each, labelled by committee. Governance: each statement addresses issues (e.g., “budget allocation”, “healthcare access”, “crop yields”, “border security”).

Analysis: 1. Preprocess all documents. 2. Compute TF-IDF to extract characteristic terms per committee. 3. Build Word2Vec embeddings on the corpus. 4. Cluster statements using embedding similarity. 5. Visualise policy term relationships within each committee.

Show code

library(tidyverse)
library(tidytext)
library(tm)
library(SnowballC)

# Synthetic Nigerian Senate committee statements
set.seed(777)

committees <- c("Finance", "Health", "Agriculture", "Security")
n_per_committee <- 125

# Generate synthetic statements (simplified)
statement_templates <- list(
  Finance = c(
    "Budget allocation for fiscal year %s requires strategic prioritisation of key revenue sources.",
    "Treasury operations and debt management remain critical challenges in maintaining macro stability.",
    "Tax compliance and revenue collection efficiency have improved significantly this quarter.",
    "Foreign exchange management and capital flow policies need urgent review.",
    "Pension reforms and retirement security are essential for long-term fiscal sustainability."
  ),
  Health = c(
    "Healthcare access in rural communities continues to lag urban centres significantly.",
    "Disease prevention and immunisation programmes require increased funding and coordination.",
    "Medical training institutions need modernisation and capacity building initiatives.",
    "Primary healthcare infrastructure is inadequate across northern regions.",
    "Maternal mortality rates remain unacceptably high despite recent initiatives."
  ),
  Agriculture = c(
    "Crop yields have suffered due to irregular rainfall and inadequate irrigation infrastructure.",
    "Agricultural extension services require better coordination and farmer engagement.",
    "Pest management and crop protection remain major challenges for small-scale farmers.",
    "Soil degradation and land management practices need immediate intervention.",
    "Export promotion for agricultural commodities requires market development support."
  ),
  Security = c(
    "Border security infrastructure and personnel deployment need strategic reinforcement.",
    "Terrorism threats in the north-east region require sustained military and civilian coordination.",
    "Community policing and intelligence gathering capabilities require enhancement.",
    "Arms trafficking and smuggling routes continue to challenge security agencies.",
    "Cybersecurity threats to critical national infrastructure are increasing rapidly."
  )
)

# Generate corpus
corpus_data <- expand_grid(
  committee = committees,
  statement_num = 1:n_per_committee
) |>
  mutate(
    statement = sapply(1:n(), function(i) {
      committee <- committees[(i-1) %% length(committees) + 1]
      sample(statement_templates[[committee]], 1)
    }),
    doc_id = 1:n()
  )

cat("Nigerian Senate Committee Statements Analysis\n")
#> Nigerian Senate Committee Statements Analysis
cat("Committees: ", paste(committees, collapse=", "), "\n")
#> Committees:  Finance, Health, Agriculture, Security
cat("Total statements: ", nrow(corpus_data), "\n")
#> Total statements:  500
cat("Statements per committee: ", n_per_committee, "\n\n")
#> Statements per committee:  125

# Preprocess
preprocess_senate <- function(text) {
  text <- tolower(text)
  text <- gsub("[^a-z0-9\\s]", "", text)
  tokens <- unlist(strsplit(text, "\\s+"))
  tokens <- tokens[tokens != ""]

  stop_words <- c("the", "a", "an", "is", "are", "be", "have", "has", "in", "on", "and",
                  "or", "but", "to", "for", "of", "with", "by", "require", "requires",
                  "remain", "remains", "continues")
  tokens <- tokens[!(tokens %in% stop_words)]
  tokens <- wordStem(tokens, language = "english")
  tokens <- tokens[nchar(tokens) > 2]

  return(paste(tokens, collapse = " "))
}

corpus_data <- corpus_data |>
  mutate(text_clean = sapply(statement, preprocess_senate))

# TF-IDF per committee
tfidf_by_committee <- corpus_data |>
  unnest_tokens(word, text_clean) |>
  group_by(committee, word) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(word) |>
  mutate(idf = log(n_distinct(committees) / n_distinct(committee))) |>
  mutate(tf_idf = count * idf) |>
  ungroup()

cat("Top 10 Characteristic Terms by Committee:\n\n")
#> Top 10 Characteristic Terms by Committee:

for (comm in committees) {
  top_terms <- tfidf_by_committee |>
    filter(committee == comm) |>
    arrange(desc(tf_idf)) |>
    head(10) |>
    pull(word)

  cat(sprintf("%s: %s\n", comm, paste(top_terms, collapse=", ")))
}
#> Finance: cybersecuritythreatstocriticalnationalinfrastructureareincreasingrapid, agriculturalextensionservicesrequirebettercoordinationandfarmerengag, armstraffickingandsmugglingroutescontinuetochallengesecurityag, bordersecurityinfrastructureandpersonneldeploymentneedstrategicreinforc, budgetallocationforfiscalyearsrequiresstrategicprioritisationofkeyrevenuesourc, communitypolicingandintelligencegatheringcapabilitiesrequireenhanc, cropyieldshavesufferedduetoirregularrainfallandinadequateirrigationinfrastructur, diseasepreventionandimmunisationprogrammesrequireincreasedfundingandcoordin, exportpromotionforagriculturalcommoditiesrequiresmarketdevelopmentsupport, foreignexchangemanagementandcapitalflowpoliciesneedurgentreview
#> Health: cybersecuritythreatstocriticalnationalinfrastructureareincreasingrapid, agriculturalextensionservicesrequirebettercoordinationandfarmerengag, armstraffickingandsmugglingroutescontinuetochallengesecurityag, bordersecurityinfrastructureandpersonneldeploymentneedstrategicreinforc, budgetallocationforfiscalyearsrequiresstrategicprioritisationofkeyrevenuesourc, communitypolicingandintelligencegatheringcapabilitiesrequireenhanc, cropyieldshavesufferedduetoirregularrainfallandinadequateirrigationinfrastructur, diseasepreventionandimmunisationprogrammesrequireincreasedfundingandcoordin, exportpromotionforagriculturalcommoditiesrequiresmarketdevelopmentsupport, foreignexchangemanagementandcapitalflowpoliciesneedurgentreview
#> Agriculture: agriculturalextensionservicesrequirebettercoordinationandfarmerengag, armstraffickingandsmugglingroutescontinuetochallengesecurityag, bordersecurityinfrastructureandpersonneldeploymentneedstrategicreinforc, budgetallocationforfiscalyearsrequiresstrategicprioritisationofkeyrevenuesourc, communitypolicingandintelligencegatheringcapabilitiesrequireenhanc, cropyieldshavesufferedduetoirregularrainfallandinadequateirrigationinfrastructur, diseasepreventionandimmunisationprogrammesrequireincreasedfundingandcoordin, exportpromotionforagriculturalcommoditiesrequiresmarketdevelopmentsupport, foreignexchangemanagementandcapitalflowpoliciesneedurgentreview, healthcareaccessinruralcommunitiescontinuestolagurbancentressignific
#> Security: cybersecuritythreatstocriticalnationalinfrastructureareincreasingrapid, agriculturalextensionservicesrequirebettercoordinationandfarmerengag, armstraffickingandsmugglingroutescontinuetochallengesecurityag, bordersecurityinfrastructureandpersonneldeploymentneedstrategicreinforc, budgetallocationforfiscalyearsrequiresstrategicprioritisationofkeyrevenuesourc, communitypolicingandintelligencegatheringcapabilitiesrequireenhanc, cropyieldshavesufferedduetoirregularrainfallandinadequateirrigationinfrastructur, diseasepreventionandimmunisationprogrammesrequireincreasedfundingandcoordin, exportpromotionforagriculturalcommoditiesrequiresmarketdevelopmentsupport, foreignexchangemanagementandcapitalflowpoliciesneedurgentreview

# Policy term similarity across committees
policy_terms <- c("budget", "healthcare", "crop", "security", "funding", "infrastructure",
                  "capacity", "access", "management", "coordination")

term_by_committee <- tfidf_by_committee |>
  filter(word %in% policy_terms) |>
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)

cat("\n\nPolicy Term Usage by Committee (TF-IDF):\n")
#> 
#> 
#> Policy Term Usage by Committee (TF-IDF):
print(term_by_committee)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: committee <chr>, count <int>, idf <dbl>

# Recommendations
cat("\n\nPolicy Insights:\n")
#> 
#> 
#> Policy Insights:
cat("1. Finance committee shows strong focus on budget, management, and funding.\n")
#> 1. Finance committee shows strong focus on budget, management, and funding.
cat("2. Health committee emphasises healthcare access and capacity building.\n")
#> 2. Health committee emphasises healthcare access and capacity building.
cat("3. Agriculture committee concentrates on crop management and infrastructure.\n")
#> 3. Agriculture committee concentrates on crop management and infrastructure.
cat("4. Security committee prioritises security, coordination, and management.\n")
#> 4. Security committee prioritises security, coordination, and management.
cat("5. 'Infrastructure' and 'capacity' appear across all committees—potential collaboration area.\n")
#> 5. 'Infrastructure' and 'capacity' appear across all committees—potential collaboration area.

32.8 Chapter Exercises

Chapter 27 Exercises

Exercise 27.1: From Raw Text to Features — The Full Pipeline

The following are five customer reviews left on a Nigerian e-commerce platform:

“This product is absolutely AMAZING!! Delivered in 2 days 😊 Will definitely buy again.”
“Very disappointed. The item I received was different from the picture. Wasted my money!!!”
“Okay product. Not great, not terrible. Delivery was a bit slow but item arrived in good condition.”
“Don’t buy this! The seller is a scammer. Product broke after 1 day. Very bad experience.”
“Excellent quality and fast delivery. Exactly as described. Highly recommend to everyone.”

Apply the following preprocessing steps manually to Review 1:

Convert to lowercase
Remove punctuation and special characters (including emojis)
Tokenise (split into individual words)
Remove English stopwords (e.g., “this”, “is”, “the”, “will”)

Apply stemming to the remaining tokens from Review 1 using the Porter Stemmer rules (or simply write the likely stem for each word: e.g., “delivered” → “deliv”).
After preprocessing all 5 reviews, construct a term-document matrix showing word frequencies. Include only words that appear in at least 2 reviews. (This is the “document-term matrix” — rows are documents, columns are unique terms.)
Now compute the TF-IDF score for the word “delivery” in Review 1. Use the formula: TF = count in document ÷ total words in document; IDF = log(N ÷ df) where N = 5 documents and df = number of documents containing the word.
Based on TF-IDF scores, which words are most distinctive for Review 4 (the complaint about a scammer)? Why does TF-IDF do a better job of identifying distinctive words than raw frequency?

Exercise 27.2: Understanding Word Embeddings

A traditional bag-of-words model represents “bank” as a single column in a matrix — the same whether it refers to a financial institution or a river bank. Explain the problem this creates for text classification, using a specific example involving these two meanings of “bank.”
Word2Vec learns word embeddings from context. Explain the intuition behind this: why does training a model to predict “the missing word in a sentence” result in words with similar meanings having similar vector representations?
A Word2Vec model produces these approximate vector arithmetic results:

“King” − “Man” + “Woman” ≈ “Queen”
“Paris” − “France” + “Nigeria” ≈ ?

What would you expect the result to approximate? What does this suggest about what information is encoded in the vector space?

Word embeddings can encode societal biases. For example, early embeddings associated “doctor” more closely with “man” than “woman.” Why is this a problem for a job recommendation system? What can be done to mitigate this bias?
You are building a text classifier for customer complaints at a bank. You have 500 labelled examples. Compare two approaches: (i) bag-of-words with TF-IDF; (ii) pre-trained Word2Vec embeddings. Which would you choose given the small dataset size, and why?

Exercise 27.3: Text Similarity and Search

A law firm in Lagos wants to build a system that, given a new legal query, retrieves the most relevant past case summaries from a database of 10,000 cases.

Explain how cosine similarity works as a text similarity measure. Why is it preferred over Euclidean distance for comparing document vectors?
Two documents are represented as TF-IDF vectors:

Document A: [0.5, 0.3, 0.0, 0.4, 0.0]
Document B: [0.4, 0.0, 0.6, 0.0, 0.3]

Calculate the cosine similarity between A and B. Show all working.

With 10,000 documents and a vocabulary of 50,000 words, the TF-IDF matrix is sparse (most entries are zero). How many cells are in this matrix? If 98% of entries are zero, how many non-zero values are there? What does “sparse” mean for storage efficiency?
For a legal search system, which is more important: that the retrieved cases are all highly relevant (precision) or that no relevant case is missed (recall)? Justify your answer from a legal/business perspective.
The firm’s senior partner complains: “The system returns cases with similar words but different legal outcomes.” This is a fundamental limitation of bag-of-words approaches. What more advanced approach (mentioned in this chapter) would better capture the meaning of legal language rather than just the words used?

Exercise 27.4: Language Models and N-grams

What is a bigram? Construct all bigrams from the sentence: “The loan was approved on time.”
A trigram language model estimates the probability of the next word given the two preceding words. The model estimates:

P(“approved” | “loan”, “was”) = 0.12
P(“rejected” | “loan”, “was”) = 0.08
P(“disbursed” | “loan”, “was”) = 0.05

What does this model suggest about the most common outcome after “loan was”?

N-gram models suffer from the sparsity problem: many sequences of words never appear in training data, so the model assigns zero probability to perfectly valid sentences. Explain why this is a problem and how smoothing techniques address it.
Large Language Models (like GPT) have largely replaced n-gram models. List two things LLMs can do that n-gram models cannot.
A company wants to build an autocomplete system for its internal chat platform, where employees often use Nigerian Pidgin English and industry-specific jargon. What challenge does this present for using a pre-trained English language model? What practical solution would you recommend?

Exercise 27.5: Capstone — Analysing Customer Feedback at Scale

A mobile money service with 5 million customers receives thousands of text messages and app reviews per day. The customer experience team cannot read all of them. Your task is to design an automated text analytics pipeline.

What are the three most important preprocessing steps you would apply to mobile money transaction complaints? Explain why each is necessary. (Consider that messages may be in English, Pidgin, Hausa, Yoruba, or mixed.)
After preprocessing, you want to group messages into topics without pre-defined categories. Which algorithm would you use, and what are the two key parameters you would need to tune?
Describe how you would use TF-IDF to identify the top 5 most significant words in each topic cluster. How would you present these to the customer experience team so they can quickly understand each topic?
You discover that 15% of messages contain urgent complaints about failed transactions. How would you build an alert system that flags these for immediate human review? What features would the classifier use?
Six months after deployment, the model’s topic detection seems to be missing a new type of complaint about a recently launched feature. Why might this happen, and how would you maintain the system over time?

32.9 Further Reading

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Bengio, Y., Ducharme, R., Vincent, P., & Jeanin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.

32.10 Chapter 27 Appendix: Mathematical Derivations

32.10.1 A27.1 TF-IDF Formulation and Variants

Standard TF-IDF: \[\text{TF-IDF}_{ij} = \text{TF}_{ij} \times \text{IDF}_j\]

where \(\text{TF}_{ij}\) is the term frequency (count or normalised), and \(\text{IDF}_j = \log\left(\frac{N}{1 + df_j}\right)\) with smoothing.

Variants: 1. Log-normalised TF: \(\text{TF}_{ij} = 1 + \log(f_{ij})\) where \(f_{ij}\) is raw count. 2. Probabilistic TF: \(\text{TF}_{ij} = \frac{f_{ij}}{\sum_k f_{ik}}\) (normalised by document length). 3. Probabilistic IDF: \(\text{IDF}_j = \log\left(\frac{N - df_j}{df_j}\right)\) (risk-sensitive; penalises frequent terms more).

32.10.2 A27.2 Cosine Similarity and Norm

For vectors \(\mathbf{u}\) and \(\mathbf{v}\) in \(\mathbb{R}^d\): \[\text{cosine\_similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|} = \frac{\sum_{i=1}^{d} u_i v_i}{\sqrt{\sum_{i=1}^{d} u_i^2} \sqrt{\sum_{i=1}^{d} v_i^2}}\]

Properties: - Ranges from −1 (opposite) to +1 (identical). - Invariant to vector magnitude (scaling \(\mathbf{u} \to c\mathbf{u}\) does not change similarity). - Appropriate for sparse, high-dimensional data (like TF-IDF vectors).

32.10.3 A27.3 Word2Vec Skip-Gram Objective

Word2Vec trains a neural network to predict surrounding words (context) given a target word: \[\max_\theta \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} | w_t; \theta)\]

where \(m\) is the window size, and \(P(w_{\text{context}} | w_{\text{target}})\) is modelled as a softmax over embeddings. Negative sampling approximates the expensive softmax, making training efficient.

32.10.4 A27.4 Attention Mechanism in Transformers

For a token at position \(i\), the attention-weighted representation is: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

where: - \(Q\) (Query): representations of the current token. - \(K\) (Key) and \(V\) (Value): representations of all tokens (including itself). - The softmax produces weights summing to 1. - Higher attention to tokens that are semantically relevant to the query.

The scaling factor \(\sqrt{d_k}\) stabilises gradients during training.