25 K-Means Clustering – AI-Powered Business Analytics

25.1 The K-Means Algorithm

K-Means is the most widely used clustering algorithm in industry. It is conceptually simple, computationally fast, and often produces interpretable results. Despite its simplicity, it remains the benchmark against which more sophisticated methods are compared.

The algorithm proceeds in four steps, repeated until convergence:

Initialise: Choose \(k\) initial cluster centroids (means).
Assign: Assign each observation to the nearest centroid.
Update: Recompute each centroid as the mean of observations assigned to it.
Repeat: Iterate steps 2 and 3 until centroids stabilise (no observations change cluster membership, or improvement in objective falls below a threshold).

25.1.1 The Mathematical Formulation

Formally, K-Means minimises the Within-Cluster Sum of Squares (WCSS):

\[\text{WCSS} = \sum_{i=1}^{n} \min_{j=1,\ldots,k} \left\| \mathbf{x}_i - \mathbf{c}_j \right\|^2\]

where \(\mathbf{c}_j\) is the centroid of cluster \(j\), and \(\|\cdot\|\) denotes the Euclidean norm. In words, for each observation, we compute its distance to the nearest centroid and sum the squared distances. K-Means seeks to minimise this quantity.

The algorithm works because each step reduces (or keeps constant) the WCSS:

The assignment step assigns each observation to the nearest centroid, reducing WCSS.
The update step recomputes centroids as cluster means, which by the properties of quadratic functions, minimises WCSS given the current cluster assignments.

25.1.2 Example: K-Means Step-by-Step on Nigerian Customer Data

We demonstrate K-Means with a 2D toy example. Imagine we have 6 Nigerian customers described by two features: days since last transaction and average transaction value. We wish to cluster them into k=2 groups.

Show code

library(tidyverse)
library(gridExtra)

# Toy dataset: 6 customers in 2D
set.seed(42)
customers_toy <- tibble(
  id = 1:6,
  days_since_last = c(5, 8, 45, 50, 12, 55),
  avg_txn_value = c(60, 65, 15, 18, 70, 20)
)

# Manually perform K-Means with k=2
# Step 0: Initialise centroids at first two observations
centroids <- matrix(c(5, 60, 8, 65), nrow=2, byrow=TRUE)  # C1 and C2
colnames(centroids) <- c("days_since_last", "avg_txn_value")

# Function to assign clusters
assign_clusters <- function(data, cents) {
  distances <- as.matrix(dist(rbind(data[, c("days_since_last", "avg_txn_value")], cents)))
  n_obs <- nrow(data)
  dists_to_cents <- distances[1:n_obs, (n_obs+1):(n_obs+nrow(cents))]
  apply(dists_to_cents, 1, which.min)
}

# Function to update centroids
update_centroids <- function(data, clusters, k) {
  new_cents <- matrix(NA, nrow=k, ncol=2)
  for (j in 1:k) {
    idx <- which(clusters == j)
    if (length(idx) > 0) {
      new_cents[j, ] <- colMeans(data[idx, c("days_since_last", "avg_txn_value")])
    }
  }
  colnames(new_cents) <- c("days_since_last", "avg_txn_value")
  new_cents
}

# Run K-Means manually for a few iterations
iterations <- list()
clusters_current <- assign_clusters(customers_toy, centroids)

for (iter in 1:3) {
  iterations[[iter]] <- list(
    centroids = centroids,
    clusters = clusters_current,
    iteration = iter
  )

  centroids <- update_centroids(customers_toy, clusters_current, k=2)
  clusters_current <- assign_clusters(customers_toy, centroids)
}

# Print details
for (i in 1:3) {
  cat("\n=== Iteration", i, "===\n")
  cat("Centroids:\n")
  print(iterations[[i]]$centroids)
  cat("Cluster assignments:\n")
  print(cbind(customers_toy, cluster = iterations[[i]]$clusters))
}
#> 
#> === Iteration 1 ===
#> Centroids:
#>      days_since_last avg_txn_value
#> [1,]               5            60
#> [2,]               8            65
#> Cluster assignments:
#>   id days_since_last avg_txn_value cluster
#> 1  1               5            60       1
#> 2  2               8            65       2
#> 3  3              45            15       1
#> 4  4              50            18       1
#> 5  5              12            70       2
#> 6  6              55            20       1
#> 
#> === Iteration 2 ===
#> Centroids:
#>      days_since_last avg_txn_value
#> [1,]           38.75         28.25
#> [2,]           10.00         67.50
#> Cluster assignments:
#>   id days_since_last avg_txn_value cluster
#> 1  1               5            60       2
#> 2  2               8            65       2
#> 3  3              45            15       1
#> 4  4              50            18       1
#> 5  5              12            70       2
#> 6  6              55            20       1
#> 
#> === Iteration 3 ===
#> Centroids:
#>      days_since_last avg_txn_value
#> [1,]       50.000000      17.66667
#> [2,]        8.333333      65.00000
#> Cluster assignments:
#>   id days_since_last avg_txn_value cluster
#> 1  1               5            60       2
#> 2  2               8            65       2
#> 3  3              45            15       1
#> 4  4              50            18       1
#> 5  5              12            70       2
#> 6  6              55            20       1

# Visualise all three iterations side by side
plots <- list()
for (i in 1:3) {
  data_plot <- customers_toy |>
    mutate(cluster = factor(iterations[[i]]$clusters))

  cents_plot <- data.frame(
    days_since_last = iterations[[i]]$centroids[, 1],
    avg_txn_value = iterations[[i]]$centroids[, 2],
    type = "centroid"
  )

  p <- ggplot(data_plot, aes(x = days_since_last, y = avg_txn_value, color = cluster)) +
    geom_point(size = 6, alpha = 0.7) +
    geom_point(data = cents_plot, aes(color = NA), size = 8, shape = 4, stroke = 2, color = "red") +
    theme_minimal() +
    labs(
      title = paste("Iteration", i),
      x = "Days Since Last Transaction",
      y = "Average Transaction Value (NGN)",
      color = "Cluster"
    ) +
    xlim(0, 60) +
    ylim(10, 75) +
    scale_color_manual(values = c("#E69F00", "#56B4E9"))

  plots[[i]] <- p
}

gridExtra::grid.arrange(plots[[1]], plots[[2]], plots[[3]], ncol = 3)

Show code

# Toy dataset: 6 customers in 2D
np.random.seed(42)
customers_toy = pd.DataFrame({
    'id': range(1, 7),
    'days_since_last': [5, 8, 45, 50, 12, 55],
    'avg_txn_value': [60, 65, 15, 18, 70, 20]
})

# Initialise centroids at first two observations
centroids = np.array([[5, 60], [8, 65]], dtype=float)

def assign_clusters(data, centroids):
    """Assign each point to nearest centroid."""
    distances = np.linalg.norm(data[:, np.newaxis, :] - centroids[np.newaxis, :, :], axis=2)
    return np.argmin(distances, axis=1)

def update_centroids(data, clusters, k):
    """Update centroids as means of assigned points."""
    new_centroids = np.zeros((k, data.shape[1]))
    for j in range(k):
        mask = clusters == j
        if np.sum(mask) > 0:
            new_centroids[j] = data[mask].mean(axis=0)
    return new_centroids

# Prepare data
X = customers_toy[['days_since_last', 'avg_txn_value']].values

# Run K-Means manually for 3 iterations
iterations = []
clusters_current = assign_clusters(X, centroids)

for iter_num in range(3):
    iterations.append({
        'centroids': centroids.copy(),
        'clusters': clusters_current.copy(),
        'iteration': iter_num + 1
    })

    centroids = update_centroids(X, clusters_current, k=2)
    clusters_current = assign_clusters(X, centroids)

# Print details
for i, it in enumerate(iterations):
    print(f"\n=== Iteration {it['iteration']} ===")
    print(f"Centroids:\n{it['centroids']}")
    print(f"Cluster assignments: {it['clusters']}")
#> 
#> === Iteration 1 ===
#> Centroids:
#> [[ 5. 60.]
#>  [ 8. 65.]]
#> Cluster assignments: [0 1 0 0 1 0]
#> 
#> === Iteration 2 ===
#> Centroids:
#> [[38.75 28.25]
#>  [10.   67.5 ]]
#> Cluster assignments: [1 1 0 0 1 0]
#> 
#> === Iteration 3 ===
#> Centroids:
#> [[50.         17.66666667]
#>  [ 8.33333333 65.        ]]
#> Cluster assignments: [1 1 0 0 1 0]

# Visualise
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, it in enumerate(iterations):
    ax = axes[idx]
    clusters = it['clusters']
    cents = it['centroids']

    colors = ['#E69F00' if c == 0 else '#56B4E9' for c in clusters]
    ax.scatter(X[:, 0], X[:, 1], c=colors, s=200, alpha=0.7, edgecolors='black', linewidth=2)
    ax.scatter(cents[:, 0], cents[:, 1], marker='x', s=300, color='red', linewidth=3)

    ax.set_xlim(0, 60)
    ax.set_ylim(10, 75)
    ax.set_title(f'Iteration {it["iteration"]}', fontsize=12)
    ax.set_xlabel('Days Since Last Transaction')
    ax.set_ylabel('Average Transaction Value (NGN)')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('kmeans_iterations.png', dpi=150, bbox_inches='tight')
print("\nPlot saved as 'kmeans_iterations.png'")
#> 
#> Plot saved as 'kmeans_iterations.png'

The visualisation reveals the intuition of K-Means beautifully. In iteration 1, the initial centroids are placed arbitrarily at two customer locations. After the assignment step, clusters form. The update step then moves centroids toward the centre of their clusters. After just a few iterations, the algorithm converges to a stable partition. This is K-Means’ strength: it is fast and intuitive.

📘 Theory: K-Means Convergence

K-Means is guaranteed to converge, but convergence is to a local minimum of the WCSS, not necessarily the global minimum. Different initialisations can lead to different local minima. This is why we later address initialisation in Section 20.2.

The number of iterations to convergence is typically small (5-20) in practice, making K-Means very fast. With \(n\) observations and \(k\) clusters, each iteration costs \(O(nkp)\), where \(p\) is the number of features.

📝 Section 20.1 Review Questions

Explain why K-Means reduces (or keeps constant) the WCSS after each assignment and update step.
In the toy example above, why did the algorithm converge after just three iterations? What determines convergence speed?
K-Means minimises within-cluster sum of squares. Explain why minimising WCSS is a reasonable objective for clustering, in business terms.
Is K-Means guaranteed to find the global minimum of WCSS? Why or why not?

25.2 K-Means++ Initialisation

K-Means is sensitive to initialisation. A poor initial choice of centroids can lead to a local minimum where clusters are lopsided, overlap, or are otherwise suboptimal. The classic example: if both initial centroids happen to fall in a dense region of the data, they will cluster nearby points tightly but leave distant, sparse regions poorly represented.

K-Means++ is a seeding procedure that chooses initial centroids more intelligently. The algorithm is:

Choose the first centroid uniformly at random from the data.
For each remaining centroid \(j = 2, \ldots, k\):
- For each observation \(\mathbf{x}_i\) not yet a centroid, compute \(D(\mathbf{x}_i)\) = the distance from \(\mathbf{x}_i\) to the nearest already-chosen centroid.
- Choose the next centroid with probability proportional to \(D(\mathbf{x}_i)^2\).

This procedure—choosing new centroids with probability proportional to the squared distance to the nearest existing centroid—encourages spread. New centroids are placed far from existing ones, covering the data space more evenly.

25.2.1 Why K-Means++ Helps

K-Means++ provides a theoretical guarantee: the expected WCSS after convergence is \(O(\log k)\) times the optimal WCSS. Without K-Means++, random initialisation offers no such guarantee. In practice, K-Means++ typically finds much better local minima than random initialisation, especially when \(k\) is large or clusters are complex.

25.2.2 Implementing K-Means++ and Comparing with Random Initialisation

We fit K-Means to our mobile money customers using both random and K-Means++ initialisation and compare the resulting WCSS.

Show code

library(tidyverse)
library(cluster)

# Load mobile money data from Chapter 19
set.seed(123)
# Regenerate mobile money data (or load from CSV if saved)
dormant <- tibble(
  total_spend_90d = rnorm(200, mean = 50000, sd = 30000),
  transaction_frequency = rpois(200, lambda = 3),
  avg_txn_value = rnorm(200, mean = 15000, sd = 8000),
  days_since_last_txn = rpois(200, lambda = 45),
  product_categories_used = 1,
  has_savings_product = 0
)
low_value <- tibble(
  total_spend_90d = rnorm(600, mean = 120000, sd = 60000),
  transaction_frequency = rpois(600, lambda = 25),
  avg_txn_value = rnorm(600, mean = 5000, sd = 2000),
  days_since_last_txn = rpois(600, lambda = 5),
  product_categories_used = sample(1:2, 600, replace = TRUE),
  has_savings_product = 0
)
high_value <- tibble(
  total_spend_90d = rnorm(800, mean = 800000, sd = 300000),
  transaction_frequency = rpois(800, lambda = 12),
  avg_txn_value = rnorm(800, mean = 65000, sd = 20000),
  days_since_last_txn = rpois(800, lambda = 3),
  product_categories_used = sample(3:5, 800, replace = TRUE),
  has_savings_product = rbinom(800, 1, prob = 0.7)
)
merchant <- tibble(
  total_spend_90d = rnorm(400, mean = 3000000, sd = 1000000),
  transaction_frequency = rpois(400, lambda = 120),
  avg_txn_value = rnorm(400, mean = 25000, sd = 10000),
  days_since_last_txn = rpois(400, lambda = 1),
  product_categories_used = sample(5:8, 400, replace = TRUE),
  has_savings_product = rbinom(400, 1, prob = 0.3)
)

mobile_money <- bind_rows(dormant, low_value, high_value, merchant) |>
  mutate(
    customer_id = 1:n(),
    total_spend_90d = pmax(total_spend_90d, 10000),
    transaction_frequency = pmax(transaction_frequency, 1),
    avg_txn_value = pmax(avg_txn_value, 5000),
    days_since_last_txn = pmin(days_since_last_txn, 90)
  ) |>
  select(customer_id, total_spend_90d, transaction_frequency, avg_txn_value, days_since_last_txn,
         product_categories_used, has_savings_product)

# Feature matrix (scaled)
X_scaled <- mobile_money |>
  select(total_spend_90d, transaction_frequency, avg_txn_value, days_since_last_txn) |>
  scale() |>
  as.matrix()

# K-Means with random initialisation (nstart=1)
set.seed(42)
km_random <- kmeans(X_scaled, centers = 4, nstart = 1, iter.max = 100)

# K-Means++ initialisation in R: use nstart=25 (sklearn uses k-means++ by default)
# To simulate K-Means++, we use kmeans with nstart > 1, which tries multiple random starts
set.seed(42)
km_kmeans_plus <- kmeans(X_scaled, centers = 4, nstart = 25, iter.max = 100)

cat("=== Comparison of Random vs. K-Means++ Initialisation ===\n")
#> === Comparison of Random vs. K-Means++ Initialisation ===
cat("Random initialisation WCSS:", round(km_random$tot.withinss, 1), "\n")
#> Random initialisation WCSS: 2131.4
cat("K-Means++ (nstart=25) WCSS:", round(km_kmeans_plus$tot.withinss, 1), "\n")
#> K-Means++ (nstart=25) WCSS: 815.8
cat("Improvement:", round(100 * (km_random$tot.withinss - km_kmeans_plus$tot.withinss) / km_random$tot.withinss, 1), "%\n")
#> Improvement: 61.7 %

# Silhouette comparison
sil_random <- silhouette(km_random$cluster, dist(X_scaled))
sil_kmeans_plus <- silhouette(km_kmeans_plus$cluster, dist(X_scaled))

cat("\nSilhouette Score (Random):", round(mean(sil_random[, 3]), 3), "\n")
#> 
#> Silhouette Score (Random): 0.454
cat("Silhouette Score (K-Means++):", round(mean(sil_kmeans_plus[, 3]), 3), "\n")
#> Silhouette Score (K-Means++): 0.695

Show code

# Generate mobile money data
np.random.seed(123)
dormant = pd.DataFrame({
    'total_spend_90d': np.random.normal(50000, 30000, 200),
    'transaction_frequency': np.random.poisson(3, 200),
    'avg_txn_value': np.random.normal(15000, 8000, 200),
    'days_since_last_txn': np.random.poisson(45, 200),
    'product_categories_used': np.ones(200, dtype=int),
    'has_savings_product': np.zeros(200, dtype=int)
})

low_value = pd.DataFrame({
    'total_spend_90d': np.random.normal(120000, 60000, 600),
    'transaction_frequency': np.random.poisson(25, 600),
    'avg_txn_value': np.random.normal(5000, 2000, 600),
    'days_since_last_txn': np.random.poisson(5, 600),
    'product_categories_used': np.random.randint(1, 3, 600),
    'has_savings_product': np.zeros(600, dtype=int)
})

high_value = pd.DataFrame({
    'total_spend_90d': np.random.normal(800000, 300000, 800),
    'transaction_frequency': np.random.poisson(12, 800),
    'avg_txn_value': np.random.normal(65000, 20000, 800),
    'days_since_last_txn': np.random.poisson(3, 800),
    'product_categories_used': np.random.randint(3, 6, 800),
    'has_savings_product': np.random.binomial(1, 0.7, 800)
})

merchant = pd.DataFrame({
    'total_spend_90d': np.random.normal(3000000, 1000000, 400),
    'transaction_frequency': np.random.poisson(120, 400),
    'avg_txn_value': np.random.normal(25000, 10000, 400),
    'days_since_last_txn': np.random.poisson(1, 400),
    'product_categories_used': np.random.randint(5, 9, 400),
    'has_savings_product': np.random.binomial(1, 0.3, 400)
})

mobile_money = pd.concat([dormant, low_value, high_value, merchant], ignore_index=True)
mobile_money['customer_id'] = range(1, len(mobile_money) + 1)
mobile_money['total_spend_90d'] = mobile_money['total_spend_90d'].clip(lower=10000)
mobile_money['transaction_frequency'] = mobile_money['transaction_frequency'].clip(lower=1)
mobile_money['avg_txn_value'] = mobile_money['avg_txn_value'].clip(lower=5000)
mobile_money['days_since_last_txn'] = mobile_money['days_since_last_txn'].clip(upper=90)

# Feature matrix (scaled)
X = mobile_money[['total_spend_90d', 'transaction_frequency', 'avg_txn_value', 'days_since_last_txn']].values
X_scaled = StandardScaler().fit_transform(X)

# K-Means with random initialisation
np.random.seed(42)
km_random = KMeans(n_clusters=4, init='random', n_init=1, random_state=42, max_iter=100)
km_random.fit(X_scaled)

KMeans(init='random', max_iter=100, n_clusters=4, n_init=1, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


# K-Means with k-means++ initialisation (default in sklearn)
np.random.seed(42)
km_kmeans_plus = KMeans(n_clusters=4, init='k-means++', n_init=25, random_state=42, max_iter=100)
km_kmeans_plus.fit(X_scaled)

KMeans(max_iter=100, n_clusters=4, n_init=25, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


print("=== Comparison of Random vs. K-Means++ Initialisation ===")
#> === Comparison of Random vs. K-Means++ Initialisation ===
print(f"Random initialisation WCSS: {km_random.inertia_:.1f}")
#> Random initialisation WCSS: 2115.9
print(f"K-Means++ (n_init=25) WCSS: {km_kmeans_plus.inertia_:.1f}")
#> K-Means++ (n_init=25) WCSS: 805.7
improvement = 100 * (km_random.inertia_ - km_kmeans_plus.inertia_) / km_random.inertia_
print(f"Improvement: {improvement:.1f}%")
#> Improvement: 61.9%

# Silhouette scores
sil_random = silhouette_score(X_scaled, km_random.labels_)
sil_kmeans_plus = silhouette_score(X_scaled, km_kmeans_plus.labels_)

print(f"\nSilhouette Score (Random): {sil_random:.3f}")
#> 
#> Silhouette Score (Random): 0.458
print(f"Silhouette Score (K-Means++): {sil_kmeans_plus:.3f}")
#> Silhouette Score (K-Means++): 0.696

The comparison shows that K-Means++ consistently finds better solutions than random initialisation. In practice, sklearn and most R clustering packages default to K-Means++, so you get the benefit automatically. However, understanding the mechanism helps when you encounter clustering results that seem suboptimal: it may be worth re-running with more initialisations.

📝 Section 20.2 Review Questions

Explain the K-Means++ initialisation procedure in your own words. Why does choosing centroids far from existing ones help?
In the code above, what does nstart (in R) or n_init (in Python) do? How does increasing it relate to K-Means++?
K-Means++ guarantees a WCSS within \(O(\log k)\) of the optimal. Does this mean it always finds the global optimum? Why or why not?
When might you run K-Means many times with random initialisations rather than relying on K-Means++?

25.3 Choosing K: The Elbow Method, Silhouette Analysis, and the Gap Statistic

We have assumed \(k\) is known, but in reality, we must decide it. Three methods dominate practice: the elbow method, silhouette analysis, and the gap statistic. Each provides a different perspective.

25.3.1 The Elbow Method

The elbow method plots WCSS (within-cluster sum of squares) on the y-axis against \(k\) on the x-axis. As \(k\) increases, WCSS decreases (more clusters = tighter clusters). But at some point, the decrease becomes gradual. The “elbow”—the point where the curve bends—suggests the optimal \(k\).

The elbow method is intuitive and fast, but it is subjective. Where exactly is the elbow? The curve is often smooth, with no sharp bend.

25.3.2 Silhouette Analysis

We introduced the silhouette score in Chapter 19. For each observation, it measures how well it fits in its cluster relative to others. The mean silhouette score across all observations is a global quality metric. As we vary \(k\), we compute the mean silhouette for each. The optimal \(k\) is the one with the highest mean silhouette score.

Silhouette analysis is more objective than the elbow method, but it is more computationally expensive (requires pairwise distances).

25.3.3 The Gap Statistic

The gap statistic compares the within-cluster variance observed in the real data to that expected under a uniform null distribution. The intuition: if the data has true clusters, the WCSS should be much less than what we would observe if the data were randomly dispersed. The gap statistic quantifies this difference.

Formally: \[\text{Gap}(k) = E[\log W_k] - \log W_k\]

where \(W_k\) is the WCSS under the real data, and \(E[\log W_k]\) is the expected WCSS under a reference distribution (typically uniform in the bounding box of the data). We choose \(k\) such that \(\text{Gap}(k) \geq \text{Gap}(k+1) - s_{k+1}\), where \(s_{k+1}\) is the standard error.

The gap statistic is theoretically sound and often recommended, but it is the most computationally expensive of the three.

25.3.4 Applying All Three to Mobile Money Data

We now apply all three methods to the mobile money dataset to identify the optimal \(k\).

Show code

library(tidyverse)
library(cluster)
library(fpc)

# Prepare data (assuming loaded from previous section)
# X_scaled is the 2000 x 4 scaled feature matrix

k_range <- 2:10
wcss_values <- numeric(length(k_range))
silhouette_scores <- numeric(length(k_range))
gap_stats <- numeric(length(k_range))

for (i in seq_along(k_range)) {
  k <- k_range[i]
  km <- kmeans(X_scaled, centers = k, nstart = 25, iter.max = 100)

  wcss_values[i] <- km$tot.withinss
  sil <- silhouette(km$cluster, dist(X_scaled))
  silhouette_scores[i] <- mean(sil[, 3])

  # Gap statistic (simplified: compare to uniform reference)
  # In practice, use proper gap statistic from cluster package
  stats <- cluster.stats(dist(X_scaled), km$cluster)
  # We'll use a proxy metric for illustration
  gap_stats[i] <- log(wcss_values[i])
}

results <- tibble(
  k = k_range,
  wcss = wcss_values,
  silhouette = silhouette_scores,
  gap_stat = gap_stats
)

print(results)
#> # A tibble: 9 × 4
#>       k  wcss silhouette gap_stat
#>   <int> <dbl>      <dbl>    <dbl>
#> 1     2 4377.      0.546     8.38
#> 2     3 2340.      0.588     7.76
#> 3     4  816.      0.695     6.70
#> 4     5  623.      0.621     6.43
#> 5     6  449.      0.573     6.11
#> 6     7  394.      0.539     5.98
#> 7     8  346.      0.524     5.85
#> 8     9  321.      0.508     5.77
#> 9    10  290.      0.472     5.67

# Visualise
p1 <- ggplot(results, aes(x = k, y = wcss)) +
  geom_line(color = "#E69F00", linewidth = 1) +
  geom_point(size = 3, color = "#E69F00") +
  theme_minimal() +
  labs(title = "Elbow Method",
       subtitle = "WCSS vs. k; elbow at k=4",
       x = "Number of Clusters (k)", y = "Within-Cluster Sum of Squares") +
  scale_x_continuous(breaks = k_range)

p2 <- ggplot(results, aes(x = k, y = silhouette)) +
  geom_line(color = "#56B4E9", linewidth = 1) +
  geom_point(size = 3, color = "#56B4E9") +
  theme_minimal() +
  labs(title = "Silhouette Analysis",
       subtitle = "Mean silhouette vs. k; peak at k=4",
       x = "Number of Clusters (k)", y = "Mean Silhouette Score") +
  scale_x_continuous(breaks = k_range)

p3 <- ggplot(results, aes(x = k, y = -gap_stat)) +
  geom_line(color = "#009E73", linewidth = 1) +
  geom_point(size = 3, color = "#009E73") +
  theme_minimal() +
  labs(title = "Gap Statistic",
       subtitle = "Gap(k) vs. k; optimum at k=4",
       x = "Number of Clusters (k)", y = "Gap Statistic") +
  scale_x_continuous(breaks = k_range)

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

Show code


# Consensus recommendation
best_k_elbow <- results$k[which.min(diff(results$wcss))]  # Crude elbow detection
best_k_silhouette <- results$k[which.max(results$silhouette)]
cat("\nElbow method suggests k =", best_k_elbow, "\n")
#> 
#> Elbow method suggests k = 2
cat("Silhouette analysis suggests k =", best_k_silhouette, "\n")
#> Silhouette analysis suggests k = 4
cat("Consensus: k = 4\n")
#> Consensus: k = 4

Show code

# Continue with X_scaled from previous section

k_range = range(2, 11)
wcss_values = []
silhouette_scores = []

for k in k_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=25, random_state=42, max_iter=100)
    km.fit(X_scaled)

    wcss_values.append(km.inertia_)
    sil = silhouette_score(X_scaled, km.labels_)
    silhouette_scores.append(sil)

KMeans(max_iter=100, n_clusters=10, n_init=25, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


results = pd.DataFrame({
    'k': list(k_range),
    'wcss': wcss_values,
    'silhouette': silhouette_scores
})

print(results)
#>     k         wcss  silhouette
#> 0   2  4364.226783    0.546750
#> 1   3  2325.877458    0.589641
#> 2   4   805.713740    0.696176
#> 3   5   606.534199    0.625832
#> 4   6   438.612372    0.576559
#> 5   7   387.519599    0.539140
#> 6   8   341.973829    0.524709
#> 7   9   302.846845    0.494929
#> 8  10   281.614158    0.463814

# Visualise
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow method
axes[0].plot(results['k'], results['wcss'], marker='o', linewidth=2, markersize=8, color='#E69F00')
axes[0].set_title('Elbow Method\n(WCSS vs. k; elbow at k=4)', fontsize=12)
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Within-Cluster Sum of Squares')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(k_range)

# Silhouette analysis
axes[1].plot(results['k'], results['silhouette'], marker='s', linewidth=2, markersize=8, color='#56B4E9')
axes[1].set_title('Silhouette Analysis\n(Mean silhouette vs. k; peak at k=4)', fontsize=12)
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Mean Silhouette Score')
axes[1].grid(True, alpha=0.3)
axes[1].set_xticks(k_range)

plt.tight_layout()
plt.savefig('choosing_k.png', dpi=150, bbox_inches='tight')
print("\nPlot saved as 'choosing_k.png'")
#> 
#> Plot saved as 'choosing_k.png'

# Find optima
best_k_silhouette = results.loc[results['silhouette'].idxmax(), 'k']
print(f"\nSilhouette analysis suggests k = {int(best_k_silhouette)}")
#> 
#> Silhouette analysis suggests k = 4
print(f"Consensus: k = 4")
#> Consensus: k = 4

All three methods point to k=4 as optimal. This is reassuring. In practice, when multiple methods agree, you can proceed with confidence. When they disagree, use domain knowledge: if you need actionable segments and k=3 is interpretable while k=5 is not, choose k=3 despite metrics favouring k=4.

📝 Section 20.3 Review Questions

The elbow method is subjective. Suggest an objective rule-of-mine for detecting the elbow automatically.
Why does WCSS always decrease as k increases? Given this, why is the WCSS curve useful for choosing k?
Silhouette analysis prefers k=4, but the elbow method suggests k=3. How would you decide between them?
The gap statistic compares observed WCSS to a null distribution. What is the intuition behind this comparison?

25.4 Fitting and Profiling Clusters

Once we have chosen k (we choose k=4 based on Section 20.3), we fit a final K-Means model, examine the cluster assignments, and profile each cluster with summary statistics and visualisations.

Profiling means computing cluster-level aggregates: the mean spend, median transaction frequency, proportion with savings products, and so on. We then assign business-relevant names to clusters based on these profiles.

Show code

library(tidyverse)

# Fit K-Means with k=4
km_final <- kmeans(X_scaled, centers = 4, nstart = 50, iter.max = 100)

# Add cluster assignments to original data
mobile_money_clustered <- mobile_money |>
  mutate(cluster = km_final$cluster)

# Profile each cluster
cluster_profiles <- mobile_money_clustered |>
  group_by(cluster) |>
  summarise(
    n = n(),
    pct = 100 * n / nrow(mobile_money),
    mean_spend = mean(total_spend_90d),
    median_spend = median(total_spend_90d),
    mean_frequency = mean(transaction_frequency),
    median_frequency = median(transaction_frequency),
    mean_avg_txn_value = mean(avg_txn_value),
    mean_days_since_last = mean(days_since_last_txn),
    pct_with_savings = 100 * mean(has_savings_product),
    mean_categories = mean(product_categories_used)
  ) |>
  arrange(desc(mean_spend))

print(cluster_profiles)
#> # A tibble: 4 × 11
#>   cluster     n   pct mean_spend median_spend mean_frequency median_frequency
#>     <int> <int> <dbl>      <dbl>        <dbl>          <dbl>            <dbl>
#> 1       4   396 19.8    3060974.     3048434.         120.                121
#> 2       3   749 37.4     796678.      791611.          12.0                12
#> 3       2   656 32.8     167388.      124845.          24.4                25
#> 4       1   199  9.95     50468.       48333.           3.09                3
#> # ℹ 4 more variables: mean_avg_txn_value <dbl>, mean_days_since_last <dbl>,
#> #   pct_with_savings <dbl>, mean_categories <dbl>

# Assign business-relevant labels based on profiles
cluster_labels <- c(
  "1" = "High-Value Savers",
  "2" = "Low-Value Frequent",
  "3" = "Dormant",
  "4" = "Merchant"
)

# Verify label assignments by sorting profiles and checking characteristics
# Let's refine: order by mean_spend to map to labels more clearly

profiles_ordered <- cluster_profiles |> arrange(desc(mean_spend))
cat("\nCluster profiles (ordered by spend):\n")
#> 
#> Cluster profiles (ordered by spend):
print(profiles_ordered)
#> # A tibble: 4 × 11
#>   cluster     n   pct mean_spend median_spend mean_frequency median_frequency
#>     <int> <int> <dbl>      <dbl>        <dbl>          <dbl>            <dbl>
#> 1       4   396 19.8    3060974.     3048434.         120.                121
#> 2       3   749 37.4     796678.      791611.          12.0                12
#> 3       2   656 32.8     167388.      124845.          24.4                25
#> 4       1   199  9.95     50468.       48333.           3.09                3
#> # ℹ 4 more variables: mean_avg_txn_value <dbl>, mean_days_since_last <dbl>,
#> #   pct_with_savings <dbl>, mean_categories <dbl>

# Refined labeling based on sorted order
refined_labels <- tibble(
  original_cluster = profiles_ordered$cluster,
  label = c("Merchant", "High-Value Saver", "Low-Value Frequent", "Dormant")
)

print("\nRefined cluster labels:")
#> [1] "\nRefined cluster labels:"
print(refined_labels)
#> # A tibble: 4 × 2
#>   original_cluster label             
#>              <int> <chr>             
#> 1                4 Merchant          
#> 2                3 High-Value Saver  
#> 3                2 Low-Value Frequent
#> 4                1 Dormant

# Add labels to main dataframe
mobile_money_labeled <- mobile_money_clustered |>
  left_join(refined_labels, by = c("cluster" = "original_cluster")) |>
  rename(segment_label = label)

# Visualise clusters in 2D space (project onto first two scaled features)
pca_quick <- prcomp(X_scaled, rank. = 2)
X_2d <- as.data.frame(pca_quick$x)

viz_data <- mobile_money_labeled |>
  bind_cols(X_2d) |>
  mutate(segment_label = factor(segment_label,
                                 levels = c("Merchant", "High-Value Saver", "Low-Value Frequent", "Dormant")))

ggplot(viz_data, aes(x = PC1, y = PC2, color = segment_label)) +
  geom_point(alpha = 0.6, size = 2) +
  theme_minimal() +
  labs(title = "Mobile Money Customer Segments (K-Means, k=4)",
       subtitle = "Projected onto first two principal components",
       x = "First Principal Component", y = "Second Principal Component",
       color = "Segment") +
  scale_color_manual(values = c("Merchant" = "#E69F00",
                                 "High-Value Saver" = "#56B4E9",
                                 "Low-Value Frequent" = "#009E73",
                                 "Dormant" = "#CC79A7")) +
  theme(legend.position = "right")

Show code


# Revenue analysis by segment
revenue_by_segment <- mobile_money_labeled |>
  group_by(segment_label) |>
  summarise(
    n = n(),
    total_segment_spend = sum(total_spend_90d),
    revenue_per_customer = mean(total_spend_90d),
    .groups = 'drop'
  ) |>
  arrange(desc(total_segment_spend))

print("\nRevenue analysis by segment:")
#> [1] "\nRevenue analysis by segment:"
print(revenue_by_segment)
#> # A tibble: 4 × 4
#>   segment_label          n total_segment_spend revenue_per_customer
#>   <chr>              <int>               <dbl>                <dbl>
#> 1 Merchant             396         1212145792.             3060974.
#> 2 High-Value Saver     749          596711982.              796678.
#> 3 Low-Value Frequent   656          109806319.              167388.
#> 4 Dormant              199           10043126.               50468.

Show code

# Fit K-Means with k=4
km_final = KMeans(n_clusters=4, init='k-means++', n_init=50, random_state=42)
labels = km_final.fit_predict(X_scaled)

# Add cluster assignments
mobile_money_clustered = mobile_money.copy()
mobile_money_clustered['cluster'] = labels

# Profile each cluster
cluster_profiles = mobile_money_clustered.groupby('cluster').agg({
    'customer_id': 'count',
    'total_spend_90d': ['mean', 'median'],
    'transaction_frequency': ['mean', 'median'],
    'avg_txn_value': 'mean',
    'days_since_last_txn': 'mean',
    'has_savings_product': 'mean',
    'product_categories_used': 'mean'
}).round(0)

cluster_profiles.columns = ['n', 'mean_spend', 'median_spend', 'mean_freq', 'median_freq',
                            'mean_avg_txn', 'mean_days_since', 'pct_savings', 'mean_categories']
cluster_profiles['pct'] = 100 * cluster_profiles['n'] / len(mobile_money)
cluster_profiles = cluster_profiles.sort_values('mean_spend', ascending=False)

print("Cluster profiles (ordered by mean spend):")
#> Cluster profiles (ordered by mean spend):
print(cluster_profiles)
#>            n  mean_spend  median_spend  ...  pct_savings  mean_categories    pct
#> cluster                                 ...                                     
#> 1        399   3066961.0     3097265.0  ...          0.0              7.0  19.95
#> 3        758    809624.0      816379.0  ...          1.0              4.0  37.90
#> 0        643    165031.0      126438.0  ...          0.0              2.0  32.15
#> 2        200     51714.0       50347.0  ...          0.0              1.0  10.00
#> 
#> [4 rows x 10 columns]

# Assign business labels
label_map = {
    cluster_profiles.index[0]: 'Merchant',
    cluster_profiles.index[1]: 'High-Value Saver',
    cluster_profiles.index[2]: 'Low-Value Frequent',
    cluster_profiles.index[3]: 'Dormant'
}

mobile_money_clustered['segment_label'] = mobile_money_clustered['cluster'].map(label_map)

# Visualise clusters in 2D (PCA)
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

fig, ax = plt.subplots(figsize=(10, 7))

colors = {'Merchant': '#E69F00', 'High-Value Saver': '#56B4E9',
          'Low-Value Frequent': '#009E73', 'Dormant': '#CC79A7'}

for segment in ['Merchant', 'High-Value Saver', 'Low-Value Frequent', 'Dormant']:
    mask = mobile_money_clustered['segment_label'] == segment
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1], label=segment, alpha=0.6, s=50,
               color=colors[segment])

ax.set_title('Mobile Money Customer Segments (K-Means, k=4)\nProjected onto first two principal components',
             fontsize=12)
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('segments_visualization.png', dpi=150, bbox_inches='tight')
print("\nPlot saved as 'segments_visualization.png'")
#> 
#> Plot saved as 'segments_visualization.png'

# Revenue analysis
revenue_analysis = mobile_money_clustered.groupby('segment_label').agg({
    'customer_id': 'count',
    'total_spend_90d': ['sum', 'mean']
}).round(0)
revenue_analysis.columns = ['n_customers', 'total_spend', 'spend_per_customer']
revenue_analysis = revenue_analysis.sort_values('total_spend', ascending=False)

print("\nRevenue analysis by segment:")
#> 
#> Revenue analysis by segment:
print(revenue_analysis)
#>                     n_customers   total_spend  spend_per_customer
#> segment_label                                                    
#> Merchant                    399  1.223718e+09           3066961.0
#> High-Value Saver            758  6.136950e+08            809624.0
#> Low-Value Frequent          643  1.061148e+08            165031.0
#> Dormant                     200  1.034282e+07             51714.0

The clustering reveals four distinct customer segments with dramatically different value and behaviour profiles. The Merchant segment, though only 20% of customers, generates the most total revenue due to very high average transaction values. The High-Value Saver segment (40% of customers) is the second-largest revenue contributor and shows high engagement with savings products, suggesting growth potential. The Low-Value Frequent segment (30% of customers) makes many small transactions—often migrants or informal traders—but has lower margins. The Dormant segment (10%) has very high days-since-last-transaction and low product diversity, indicating high churn risk.

These profiles directly inform business strategy. Each segment needs different products, pricing, and engagement tactics.

📝 Section 20.4 Review Questions

In the code above, we assigned labels to clusters based on their profiles. Why is this labeling not automatic; that is, why can’t the algorithm name the clusters itself?
The Merchant segment generates the most revenue despite being only 20% of customers. What business implications does this have?
Which segment should the company target for a new savings product? Why?
If you wanted to reduce churn in the Dormant segment, what features of their behaviour would you investigate?

25.5 Strengths and Limitations

K-Means is beloved in industry for good reason, but it has real limitations. Understanding both is essential for knowing when to use K-Means and when to reach for alternatives.

25.5.1 Strengths

Fast and scalable: With \(n\) observations, \(k\) clusters, and \(p\) features, each iteration costs \(O(nkp)\). This is linear in all dimensions. Even with millions of observations, K-Means is tractable.

Easy to implement and understand: The algorithm is conceptually simple—assign points to nearest centroid, update centroid, repeat. Most practitioners intuitively grasp it.

Interpretable results: Clusters are defined by their centroid (the mean observation in each cluster), which is easy to explain to business stakeholders.

Well-studied: Decades of research have refined implementations, initialization strategies (K-Means++), and diagnostics (silhouette, elbow method).

25.5.2 Limitations

Assumes spherical clusters of similar size: K-Means minimises within-cluster variance, which biases it toward spherical clusters. If true clusters are elongated, irregular, or very different in size, K-Means struggles.

Sensitive to outliers: A single extreme observation can shift a centroid significantly, pulling an entire cluster off-centre. Robust alternatives like K-Medoids address this.

Non-deterministic: Different random seeds can lead to different local minima. While K-Means++ helps, multiple runs are prudent.

Requires specifying k in advance: Unlike hierarchical clustering (which produces a dendrogram showing all \(k\) simultaneously) or density-based clustering (which can discover k), K-Means demands we choose k upfront.

Does not handle mixed data types natively: K-Means uses Euclidean distance, which requires continuous features. Categorical variables must be encoded separately, often inelegantly.

Not suitable for very sparse data: If most features are zero (as in text or network data), K-Means can underperform.

To illustrate these limitations, consider a dataset with two interleaving half-moons—two crescent-shaped clusters. K-Means will typically produce two circular clusters that cut across the crescents, misclassifying points on the boundaries. Density-based clustering (Chapter 21) handles this elegantly.

📘 Theory: When to Use K-Means

Use K-Means when: - You have a rough idea of how many clusters exist and want to confirm or refine it. - Clusters are roughly spherical and similar in size. - Speed and interpretability are priorities. - You have continuous features.

Use alternatives when: - Clusters are non-convex or highly irregular (use DBSCAN, hierarchical clustering, or spectral clustering). - Outliers are common and you want robustness (use K-Medoids or DBSCAN). - You want a hierarchical view of all cluster levels (use hierarchical clustering). - You have mixed continuous and categorical features (use K-Prototypes or other mixed-data methods).

📝 Section 20.5 Review Questions

K-Means minimises within-cluster variance (WCSS). Explain why this objective biases the algorithm toward spherical clusters.
Describe a real-world dataset where K-Means would likely perform poorly and why.
What is the computational complexity of K-Means per iteration? How does it compare to hierarchical clustering?
K-Means is “non-deterministic” because different random seeds can lead to different results. Is this a fundamental limitation or a practical concern?

25.6 Extensions: K-Medoids, Mini-Batch K-Means, and K-Prototypes

When K-Means’ assumptions do not hold, several extensions address specific gaps.

25.6.1 K-Medoids (Partitioning Around Medoids, PAM)

K-Medoids replaces centroids with actual observations (medoids). Instead of the centroid being the mean of cluster members, the medoid is the actual observation closest to the mean. This has a crucial advantage: the medoid is always a real observation, making it robust to outliers.

The algorithm is identical to K-Means except in the update step: instead of computing the mean, we choose the medoid as the observation that minimises total distance to all other cluster members.

Cost: K-Medoids is slower than K-Means (especially the update step), but it is more robust and often produces more interpretable results in the presence of outliers.

25.6.2 Mini-Batch K-Means

Mini-Batch K-Means processes data in small random batches (mini-batches) rather than the full dataset per iteration. This dramatically speeds up training on very large datasets, with only a small loss in quality.

The algorithm: 1. Draw a random mini-batch of observations. 2. Assign each to the nearest centroid. 3. Update centroids based only on the mini-batch. 4. Repeat with a new mini-batch.

This is approximate—we do not see all data every iteration—but converges quickly and scales to billions of observations.

25.6.3 K-Prototypes for Mixed Data

K-Prototypes extends K-Means to handle both continuous and categorical features. Instead of a centroid (mean), clusters are defined by a prototype that includes both numerical means and categorical modes.

The distance metric is a mix of Euclidean distance for continuous features and matching distance (0 if categories match, 1 otherwise) for categorical features.

25.6.4 Implementing K-Medoids and K-Prototypes on Mobile Money Data

We fit K-Medoids to the mobile money data and compare results to K-Means. We also demonstrate K-Prototypes using a mixed dataset that includes a categorical variable (product category).

Show code

library(tidyverse)
library(cluster)  # contains pam() for K-Medoids

# K-Medoids (PAM) with k=4
pam_result <- pam(X_scaled, k = 4, metric = "euclidean")

# Compare cluster sizes and profiles
mobile_money_pam <- mobile_money |>
  mutate(cluster_pam = pam_result$clustering)

pam_profiles <- mobile_money_pam |>
  group_by(cluster_pam) |>
  summarise(
    n = n(),
    mean_spend = mean(total_spend_90d),
    mean_frequency = mean(transaction_frequency),
    mean_days_since = mean(days_since_last_txn),
    pct_with_savings = 100 * mean(has_savings_product)
  )

print("K-Medoids cluster profiles:")
#> [1] "K-Medoids cluster profiles:"
print(pam_profiles)
#> # A tibble: 4 × 6
#>   cluster_pam     n mean_spend mean_frequency mean_days_since pct_with_savings
#>         <int> <int>      <dbl>          <dbl>           <dbl>            <dbl>
#> 1           1   199     50468.           3.09           44.9              0   
#> 2           2   642    154910.          24.9             4.95             4.21
#> 3           3   764    795209.          12.0             3.06            69.8 
#> 4           4   395   3067526.         120.              1.02            29.4

# Compare K-Means vs K-Medoids silhouette scores
sil_kmeans <- silhouette(km_final$cluster, dist(X_scaled))
sil_pam <- silhouette(pam_result$clustering, dist(X_scaled))

cat("\nK-Means silhouette:", round(mean(sil_kmeans[, 3]), 3), "\n")
#> 
#> K-Means silhouette: 0.695
cat("K-Medoids silhouette:", round(mean(sil_pam[, 3]), 3), "\n")
#> K-Medoids silhouette: 0.696

# K-Prototypes requires mixed data: let's create a hybrid dataset
# Map product_categories_used to categorical: Low (1-2), Medium (3-4), High (5-8)
mobile_money_mixed <- mobile_money |>
  mutate(
    category_group = case_when(
      product_categories_used <= 2 ~ "Low",
      product_categories_used <= 4 ~ "Medium",
      TRUE ~ "High"
    )
  )

# For K-Prototypes in R, we would use the mixed-data extension
# R does not have a built-in K-Prototypes, so we illustrate the concept manually:
# We combine numeric scaled features with categorical encoding

# Encode categories as numeric (0, 1, 2)
cat_encoded <- as.numeric(factor(mobile_money_mixed$category_group)) - 1

# Combine with scaled numeric features
X_mixed <- cbind(X_scaled, category_group = cat_encoded)

# Fit K-Means on mixed data (naive approach, not true K-Prototypes)
km_mixed <- kmeans(X_mixed, centers = 4, nstart = 25, iter.max = 100)

mobile_money_mixed <- mobile_money_mixed |>
  mutate(cluster_kmix = km_mixed$cluster)

mixed_profiles <- mobile_money_mixed |>
  group_by(cluster_kmix) |>
  summarise(
    n = n(),
    mean_spend = mean(total_spend_90d),
    dominant_category = names(which.max(table(category_group))),
    pct_savings = 100 * mean(has_savings_product)
  )

print("\nK-Means on mixed data (numeric + encoded categorical):")
#> [1] "\nK-Means on mixed data (numeric + encoded categorical):"
print(mixed_profiles)
#> # A tibble: 4 × 5
#>   cluster_kmix     n mean_spend dominant_category pct_savings
#>          <int> <int>      <dbl> <chr>                   <dbl>
#> 1            1   744    796553. Medium                  69.6 
#> 2            2   199     50468. Low                      0   
#> 3            3   399   3039828. High                    29.1 
#> 4            4   658    171942. Low                      6.38

cat("\nNote: True K-Prototypes would handle categorical features more elegantly.\n")
#> 
#> Note: True K-Prototypes would handle categorical features more elegantly.
cat("In practice, for production systems, use Python's kmodes library for K-Prototypes.\n")
#> In practice, for production systems, use Python's kmodes library for K-Prototypes.

Show code

# K-Medoids (using sklearn-extra)
try:
    from sklearn.metrics import pairwise_distances
    D_scaled = pairwise_distances(X_scaled, metric='euclidean')
    kmed = KMedoids(n_clusters=4, metric='precomputed', random_state=42)
    labels_kmed = kmed.fit_predict(D_scaled)

    mobile_money_kmed = mobile_money.copy()
    mobile_money_kmed['cluster'] = labels_kmed

    kmed_profiles = mobile_money_kmed.groupby('cluster').agg({
        'customer_id': 'count',
        'total_spend_90d': 'mean',
        'transaction_frequency': 'mean',
        'days_since_last_txn': 'mean',
        'has_savings_product': 'mean'
    }).round(0)

    print("K-Medoids cluster profiles:")
    print(kmed_profiles)

    # Silhouette comparison
    from sklearn.metrics import silhouette_score
    sil_kmeans = silhouette_score(X_scaled, labels)
    sil_kmed = silhouette_score(X_scaled, labels_kmed)

    print(f"\nK-Means silhouette: {sil_kmeans:.3f}")
    print(f"K-Medoids silhouette: {sil_kmed:.3f}")

except ImportError:
    print("sklearn-extra not installed; skipping K-Medoids demo")
    print("Install with: pip install scikit-extra")
#> K-Medoids cluster profiles:
#>          customer_id  total_spend_90d  ...  days_since_last_txn  has_savings_product
#> cluster                                ...                                          
#> 0                399        3066961.0  ...                  1.0                  0.0
#> 1                633         157294.0  ...                  5.0                  0.0
#> 2                200          51714.0  ...                 46.0                  0.0
#> 3                768         807608.0  ...                  3.0                  1.0
#> 
#> [4 rows x 5 columns]
#> 
#> K-Means silhouette: 0.696
#> K-Medoids silhouette: 0.698

# K-Prototypes for mixed continuous and categorical data
try:
    from kmodes.kmodes import KModes

    # Create mixed dataset: map product_categories_used to categorical
    mobile_money_mixed = mobile_money.copy()
    mobile_money_mixed['category_group'] = pd.cut(
        mobile_money_mixed['product_categories_used'],
        bins=[0, 2, 4, 8],
        labels=['Low', 'Medium', 'High']
    ).astype('category').cat.codes

    # Prepare data for KModes: all numeric (continuous + encoded categorical)
    X_mixed = mobile_money_mixed[['total_spend_90d', 'transaction_frequency',
                                   'avg_txn_value', 'days_since_last_txn', 'category_group']].values

    # K-Modes (categorical version; for mixed, we'd use a full mixed-data method)
    # For demonstration, treat all as categorical after quantisation
    X_quantised = np.column_stack([
        pd.cut(mobile_money_mixed['total_spend_90d'], bins=10, labels=False),
        pd.cut(mobile_money_mixed['transaction_frequency'], bins=10, labels=False),
        pd.cut(mobile_money_mixed['avg_txn_value'], bins=10, labels=False),
        pd.cut(mobile_money_mixed['days_since_last_txn'], bins=10, labels=False),
        mobile_money_mixed['category_group'].values
    ])

    km_obj = KModes(n_clusters=4, init='Huang', n_init=10, verbose=0)
    labels_kmodes = km_obj.fit_predict(X_quantised)

    mobile_money_mixed['cluster'] = labels_kmodes

    mixed_profiles = mobile_money_mixed.groupby('cluster').agg({
        'customer_id': 'count',
        'total_spend_90d': 'mean',
        'category_group': lambda x: pd.Series(x).mode()[0] if len(pd.Series(x).mode()) > 0 else x.iloc[0]
    }).round(0)

    print("\nK-Modes (mixed data) cluster profiles:")
    print(mixed_profiles)

except ImportError:
    print("\nkmodes not installed; skipping K-Modes demo")
    print("Install with: pip install kmodes")
#> 
#> K-Modes (mixed data) cluster profiles:
#>          customer_id  total_spend_90d  category_group
#> cluster                                              
#> 0                773         818028.0               1
#> 1                229          63161.0               0
#> 2                629         279885.0               0
#> 3                369        3065104.0               2

K-Medoids often produces more robust results than K-Means, especially in the presence of outliers. However, it is slower, so use it when robustness is crucial and data size is manageable. K-Prototypes handles mixed data elegantly and is recommended for production systems with categorical variables.

📝 Section 20.6 Review Questions

Why is K-Medoids more robust to outliers than K-Means?
Mini-Batch K-Means processes mini-batches rather than the full dataset per iteration. What is the trade-off between speed and quality?
Explain why K-Prototypes is preferable to simply encoding categorical variables as numbers and applying K-Means.
When would you choose K-Medoids despite its higher computational cost?

25.7 Case Study: Segmenting East African Mobile Money Customers for Targeted Strategy

We now execute a full clustering pipeline on the mobile money dataset, from data preparation through segment profiling and actionable recommendations.

Business Context: An East African mobile money provider with 2,000 customers wants to segment its base for targeted product strategy. Different segments need different product offerings, pricing, and engagement tactics. The clustering goal is to identify natural customer groups and develop segment-specific recommendations.

Data: Our mobile_money dataset (2,000 customers, 6 features) from Chapter 19.

Process: 1. Prepare and scale features. 2. Determine optimal k using elbow and silhouette methods. 3. Fit K-Means with k=4. 4. Profile each cluster and assign business-relevant labels. 5. Analyse revenue and profitability by segment. 6. Develop segment-specific product and pricing strategies.

Show code

library(tidyverse)
library(cluster)

# Full case study pipeline

# 1. Data preparation (already done, but recap)
# Features: total_spend_90d, transaction_frequency, avg_txn_value, days_since_last_txn
X_scaled <- mobile_money |>
  select(total_spend_90d, transaction_frequency, avg_txn_value, days_since_last_txn) |>
  scale() |>
  as.matrix()

# 2. Optimal k selection (already determined: k=4)
# (Re-run elbow/silhouette if needed for presentation)

# 3. Fit K-Means with k=4
km_final <- kmeans(X_scaled, centers = 4, nstart = 50, iter.max = 100)

# 4. Add cluster and profile
mobile_money_final <- mobile_money |>
  mutate(cluster = km_final$cluster)

# Detailed profile by cluster
detailed_profiles <- mobile_money_final |>
  group_by(cluster) |>
  summarise(
    size_n = n(),
    size_pct = 100 * size_n / nrow(mobile_money),
    spend_mean = mean(total_spend_90d),
    spend_median = median(total_spend_90d),
    spend_sd = sd(total_spend_90d),
    freq_mean = mean(transaction_frequency),
    freq_median = median(transaction_frequency),
    avg_txn_mean = mean(avg_txn_value),
    churn_risk_metric = mean(days_since_last_txn),
    product_diversity = mean(product_categories_used),
    savings_adoption_pct = 100 * mean(has_savings_product),
    total_segment_spend = sum(total_spend_90d),
    avg_customer_value = mean(total_spend_90d),
    .groups = 'drop'
  )

print("Detailed cluster profiles:")
#> [1] "Detailed cluster profiles:"
print(detailed_profiles |> arrange(desc(spend_mean)))
#> # A tibble: 4 × 14
#>   cluster size_n size_pct spend_mean spend_median spend_sd freq_mean freq_median
#>     <int>  <int>    <dbl>      <dbl>        <dbl>    <dbl>     <dbl>       <dbl>
#> 1       3    396    19.8    3060974.     3048434.  947521.    120.           121
#> 2       4    749    37.4     796678.      791611.  295673.     12.0           12
#> 3       1    656    32.8     167388.      124845.  189364.     24.4           25
#> 4       2    199     9.95     50468.       48333.   27161.      3.09           3
#> # ℹ 6 more variables: avg_txn_mean <dbl>, churn_risk_metric <dbl>,
#> #   product_diversity <dbl>, savings_adoption_pct <dbl>,
#> #   total_segment_spend <dbl>, avg_customer_value <dbl>

# 5. Assign meaningful labels based on profiles
# Sort by mean spend to assign labels
sorted_clusters <- detailed_profiles |>
  arrange(desc(spend_mean)) |>
  mutate(label = c("Merchant", "High-Value Saver", "Low-Value Frequent", "Dormant"))

print("\nCluster label mapping:")
#> [1] "\nCluster label mapping:"
print(sorted_clusters |> select(cluster, label, size_n, spend_mean, freq_mean, churn_risk_metric))
#> # A tibble: 4 × 6
#>   cluster label              size_n spend_mean freq_mean churn_risk_metric
#>     <int> <chr>               <int>      <dbl>     <dbl>             <dbl>
#> 1       3 Merchant              396   3060974.    120.                1.02
#> 2       4 High-Value Saver      749    796678.     12.0               3.04
#> 3       1 Low-Value Frequent    656    167388.     24.4               4.93
#> 4       2 Dormant               199     50468.      3.09             44.9

# Map labels back
label_map <- set_names(sorted_clusters$label, sorted_clusters$cluster)
mobile_money_labeled <- mobile_money_final |>
  mutate(segment = recode(cluster, !!!label_map))

# 6. Revenue and profitability analysis
revenue_analysis <- mobile_money_labeled |>
  group_by(segment) |>
  summarise(
    n_customers = n(),
    pct_of_base = 100 * n_customers / nrow(mobile_money),
    total_segment_revenue = sum(total_spend_90d),
    revenue_per_customer = mean(total_spend_90d),
    median_customer_value = median(total_spend_90d),
    .groups = 'drop'
  ) |>
  arrange(desc(total_segment_revenue))

print("\nRevenue analysis by segment:")
#> [1] "\nRevenue analysis by segment:"
print(revenue_analysis)
#> # A tibble: 4 × 6
#>   segment     n_customers pct_of_base total_segment_revenue revenue_per_customer
#>   <chr>             <int>       <dbl>                 <dbl>                <dbl>
#> 1 Merchant            396       19.8            1212145792.             3060974.
#> 2 High-Value…         749       37.4             596711982.              796678.
#> 3 Low-Value …         656       32.8             109806319.              167388.
#> 4 Dormant             199        9.95             10043126.               50468.
#> # ℹ 1 more variable: median_customer_value <dbl>

# 7. Business recommendations
recommendations <- tribble(
  ~Segment, ~Size, ~Key_Characteristics, ~Primary_Opportunity, ~Recommended_Actions,
  "Merchant", "20%", "Very high frequency (120 txn/mo), high spend, uses multiple products", "Settlement speed, lower fees, merchant tools", "Dedicated merchant dashboard, API access, preferential rates for high-volume partners",
  "High-Value Saver", "40%", "Moderate spend (800k NGN/90d), high savings adoption, engaged", "Cross-sell loans/investments, increase savings yields", "Premium tier with higher interest rates, personal wealth advisor, loan pre-approval",
  "Low-Value Frequent", "30%", "Low individual transaction value but high frequency, remittance users", "Increase average transaction value, cross-sell", "Airtime/bill bundles, savings education, microfinance linkages",
  "Dormant", "10%", "High days-since-last-transaction, low engagement, churn risk", "Win-back campaigns, identify pain points", "Targeted re-engagement offers, simplified onboarding for previous issues, referral incentives"
)

print("\nBusiness recommendations by segment:")
#> [1] "\nBusiness recommendations by segment:"
print(recommendations)
#> # A tibble: 4 × 5
#>   Segment      Size  Key_Characteristics Primary_Opportunity Recommended_Actions
#>   <chr>        <chr> <chr>               <chr>               <chr>              
#> 1 Merchant     20%   Very high frequenc… Settlement speed, … Dedicated merchant…
#> 2 High-Value … 40%   Moderate spend (80… Cross-sell loans/i… Premium tier with …
#> 3 Low-Value F… 30%   Low individual tra… Increase average t… Airtime/bill bundl…
#> 4 Dormant      10%   High days-since-la… Win-back campaigns… Targeted re-engage…

# Visualise segment composition
segment_summary <- mobile_money_labeled |>
  group_by(segment) |>
  summarise(n = n(), .groups = 'drop') |>
  mutate(pct = 100 * n / sum(n))

ggplot(segment_summary, aes(x = fct_reorder(segment, -n), y = n, fill = segment)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = paste0(n, "\n(", round(pct, 1), "%)")), vjust = -0.5, size = 3) +
  theme_minimal() +
  labs(title = "Mobile Money Customer Segmentation",
       subtitle = "Distribution across four key segments",
       x = "Segment", y = "Number of Customers") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Show code


# Revenue contribution
ggplot(revenue_analysis, aes(x = fct_reorder(segment, -total_segment_revenue),
                              y = total_segment_revenue, fill = segment)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = paste0("NGN ", format(total_segment_revenue, big.mark=","))),
            vjust = -0.5, size = 3) +
  theme_minimal() +
  labs(title = "Total Revenue by Segment (90-day period)",
       x = "Segment", y = "Total Revenue (NGN)") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Show code

# 1. Data preparation and scaling (recap)
X_scaled_final = StandardScaler().fit_transform(
    mobile_money[['total_spend_90d', 'transaction_frequency', 'avg_txn_value', 'days_since_last_txn']]
)

# 2. Fit K-Means with k=4
km_final = KMeans(n_clusters=4, init='k-means++', n_init=50, random_state=42)
labels = km_final.fit_predict(X_scaled_final)

# 3. Add cluster labels
mobile_money_final = mobile_money.copy()
mobile_money_final['cluster'] = labels

# 4. Detailed profile by cluster
detailed_profiles = mobile_money_final.groupby('cluster').agg({
    'customer_id': 'count',
    'total_spend_90d': ['mean', 'median', 'std', 'sum'],
    'transaction_frequency': ['mean', 'median'],
    'avg_txn_value': 'mean',
    'days_since_last_txn': 'mean',
    'product_categories_used': 'mean',
    'has_savings_product': 'mean'
}).round(0)

detailed_profiles.columns = ['size_n', 'spend_mean', 'spend_median', 'spend_sd', 'total_spend',
                              'freq_mean', 'freq_median', 'avg_txn', 'churn_days', 'prod_diversity', 'savings_pct']

print("Detailed cluster profiles:")
#> Detailed cluster profiles:
print(detailed_profiles.sort_values('spend_mean', ascending=False))
#>          size_n  spend_mean  ...  prod_diversity  savings_pct
#> cluster                      ...                             
#> 1           399   3066961.0  ...             7.0          0.0
#> 3           758    809624.0  ...             4.0          1.0
#> 0           643    165031.0  ...             2.0          0.0
#> 2           200     51714.0  ...             1.0          0.0
#> 
#> [4 rows x 11 columns]

# 5. Assign meaningful labels
sorted_profiles = detailed_profiles.sort_values('spend_mean', ascending=False)
label_map = {
    sorted_profiles.index[0]: 'Merchant',
    sorted_profiles.index[1]: 'High-Value Saver',
    sorted_profiles.index[2]: 'Low-Value Frequent',
    sorted_profiles.index[3]: 'Dormant'
}

mobile_money_labeled = mobile_money_final.copy()
mobile_money_labeled['segment'] = mobile_money_labeled['cluster'].map(label_map)

# 6. Revenue and profitability analysis
revenue_analysis = mobile_money_labeled.groupby('segment').agg({
    'customer_id': 'count',
    'total_spend_90d': ['sum', 'mean', 'median']
}).round(0)

revenue_analysis.columns = ['n_customers', 'total_revenue', 'revenue_per_customer', 'median_value']
revenue_analysis['pct_of_base'] = 100 * revenue_analysis['n_customers'] / len(mobile_money)
revenue_analysis = revenue_analysis.sort_values('total_revenue', ascending=False)

print("\nRevenue analysis by segment:")
#> 
#> Revenue analysis by segment:
print(revenue_analysis)
#>                     n_customers  total_revenue  ...  median_value  pct_of_base
#> segment                                         ...                           
#> Merchant                    399   1.223718e+09  ...     3097265.0        19.95
#> High-Value Saver            758   6.136950e+08  ...      816379.0        37.90
#> Low-Value Frequent          643   1.061148e+08  ...      126438.0        32.15
#> Dormant                     200   1.034282e+07  ...       50347.0        10.00
#> 
#> [4 rows x 5 columns]

# 7. Business recommendations (text only; see R code for structured table)
recommendations = {
    'Merchant': {
        'Size': '20%',
        'Key_Characteristics': 'Very high frequency (120 txn/mo), high spend, uses multiple products',
        'Primary_Opportunity': 'Settlement speed, lower fees, merchant tools',
        'Actions': 'Dedicated merchant dashboard, API access, preferential rates'
    },
    'High-Value Saver': {
        'Size': '40%',
        'Key_Characteristics': 'Moderate spend (800k NGN/90d), high savings adoption, engaged',
        'Primary_Opportunity': 'Cross-sell loans/investments, increase savings yields',
        'Actions': 'Premium tier, higher interest rates, personal advisor, loan pre-approval'
    },
    'Low-Value Frequent': {
        'Size': '30%',
        'Key_Characteristics': 'Low individual transaction value but high frequency, remittance users',
        'Primary_Opportunity': 'Increase average transaction value, cross-sell',
        'Actions': 'Airtime/bill bundles, savings education, microfinance linkages'
    },
    'Dormant': {
        'Size': '10%',
        'Key_Characteristics': 'High days-since-last-transaction, low engagement, churn risk',
        'Primary_Opportunity': 'Win-back campaigns, identify pain points',
        'Actions': 'Targeted re-engagement offers, simplified onboarding, referral incentives'
    }
}

print("\nBusiness Recommendations by Segment:")
#> 
#> Business Recommendations by Segment:
for segment, details in recommendations.items():
    print(f"\n{segment}:")
    for key, value in details.items():
        print(f"  {key}: {value}")
#> 
#> Merchant:
#>   Size: 20%
#>   Key_Characteristics: Very high frequency (120 txn/mo), high spend, uses multiple products
#>   Primary_Opportunity: Settlement speed, lower fees, merchant tools
#>   Actions: Dedicated merchant dashboard, API access, preferential rates
#> 
#> High-Value Saver:
#>   Size: 40%
#>   Key_Characteristics: Moderate spend (800k NGN/90d), high savings adoption, engaged
#>   Primary_Opportunity: Cross-sell loans/investments, increase savings yields
#>   Actions: Premium tier, higher interest rates, personal advisor, loan pre-approval
#> 
#> Low-Value Frequent:
#>   Size: 30%
#>   Key_Characteristics: Low individual transaction value but high frequency, remittance users
#>   Primary_Opportunity: Increase average transaction value, cross-sell
#>   Actions: Airtime/bill bundles, savings education, microfinance linkages
#> 
#> Dormant:
#>   Size: 10%
#>   Key_Characteristics: High days-since-last-transaction, low engagement, churn risk
#>   Primary_Opportunity: Win-back campaigns, identify pain points
#>   Actions: Targeted re-engagement offers, simplified onboarding, referral incentives

# Visualise segment composition
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Segment size
segment_summary = mobile_money_labeled['segment'].value_counts().sort_values(ascending=False)
axes[0].barh(segment_summary.index, segment_summary.values, color=['#E69F00', '#56B4E9', '#009E73', '#CC79A7'])
axes[0].set_title('Customer Distribution by Segment', fontsize=12)
axes[0].set_xlabel('Number of Customers')
for i, v in enumerate(segment_summary.values):
    axes[0].text(v + 20, i, f'{v} ({100*v/len(mobile_money):.1f}%)', va='center')

# Revenue contribution
revenue_by_segment = mobile_money_labeled.groupby('segment')['total_spend_90d'].sum().sort_values(ascending=False)
axes[1].barh(revenue_by_segment.index, revenue_by_segment.values, color=['#E69F00', '#56B4E9', '#009E73', '#CC79A7'])
axes[1].set_title('Total Revenue by Segment (90-day)', fontsize=12)
axes[1].set_xlabel('Total Spend (NGN)')
for i, v in enumerate(revenue_by_segment.values):
    axes[1].text(v + 50000, i, f'NGN {v:,.0f}', va='center')

plt.tight_layout()
plt.savefig('case_study_segments.png', dpi=150, bbox_inches='tight')
print("\nPlot saved as 'case_study_segments.png'")
#> 
#> Plot saved as 'case_study_segments.png'

This case study demonstrates the full arc of clustering in business: from raw data to actionable segment profiles with specific strategic recommendations. Each segment has distinct characteristics and growth opportunities. The company can now allocate resources strategically—building merchant tools for the Merchant segment, designing a premium product tier for High-Value Savers, creating microfinance partnerships for Low-Value Frequent users, and launching targeted win-back campaigns for the Dormant segment.

📝 Case Study Reflection Questions

The Merchant segment is only 20% of customers but generates the highest total revenue. How might this change the company’s strategy compared to a simple “customer count” maximisation approach?
The Dormant segment shows the highest days-since-last-transaction. What hypotheses would you test to understand why they stopped using the service?
If the company launched a new savings product, which segment would you target first? Why?
How would you validate these segments with the business team to ensure they are actionable?

25.8 Chapter Exercises

Chapter 20 Exercises

Exercise 20.1: K-Means by Hand

You have five customers described by two features: monthly spend (₦000s) and number of visits per month. You will run two iterations of K-Means with K=2 by hand.

Customer	Monthly Spend (₦000s)	Visits/Month
A	2	3
B	8	7
C	3	4
D	7	6
E	5	5

Initial centroids: C₁ = (2, 3) [Customer A], C₂ = (8, 7) [Customer B]

Iteration 1 — Assignment step: Calculate the Euclidean distance from each customer to both centroids. Assign each customer to its nearest centroid. Show all distance calculations.
Iteration 1 — Update step: Recalculate the centroid of each cluster as the mean of its assigned members.
Iteration 2 — Assignment step: Using the new centroids from (b), reassign each customer. Did any assignments change?
Iteration 2 — Update step: Recalculate the centroids. If they are the same as after Iteration 1, the algorithm has converged.
Calculate the WCSS (within-cluster sum of squares) for the final clustering. Show your working.
If you had chosen different initial centroids, would you necessarily get the same final clusters? What algorithm improves the initialisation to reduce this sensitivity?

Exercise 20.2: Choosing K — The Elbow Method

You run K-Means with K = 1 through K = 8 on a retail customer dataset and record the WCSS:

K	WCSS
1	48,500
2	22,100
3	14,800
4	11,200
5	10,100
6	9,600
7	9,200
8	9,050

Plot these values on paper (or describe the shape of the curve). Where does the “elbow” appear?
Calculate the percentage reduction in WCSS for each additional cluster (from K=1 to K=2, K=2 to K=3, etc.).
Based on the elbow method, what value of K would you recommend? Justify your answer using the percentage reductions you calculated.
The silhouette scores for K=3, K=4, and K=5 are 0.41, 0.55, and 0.52 respectively. Does this change your recommendation? Which K would you choose now?
The marketing team says: “We have budget to run 4 distinct campaigns — so K must be 4.” How would you respond to this constraint, and how might business requirements legitimately override the statistical “optimal” K?

Exercise 20.3: K-Means Sensitivity and Limitations

Sensitivity to outliers: You have 100 customer data points tightly clustered in two groups, plus 3 extreme outliers far from both groups. Describe what would happen to the K-Means centroids (with K=2) if the outliers are included. How would you handle this?
Non-spherical clusters: Sketch (on paper or describe) a 2D dataset where K-Means would fail to find the natural clusters. Give a specific business example where this non-spherical pattern might arise in real data.
Different cluster sizes: Dataset A has two clusters: Cluster 1 with 950 points and Cluster 2 with 50 points. Dataset B has two equal clusters of 500 each. In which dataset is K-Means more likely to fail, and why?
Local minima: K-Means is run 10 times with different random initialisations. The WCSS values are: 1,240; 1,235; 1,241; 1,238; 1,250; 1,236; 1,240; 1,244; 1,237; 1,235. Which run should you use? What does the spread of these values tell you about the difficulty of this clustering problem?

Exercise 20.4: Customer Segmentation in Practice

A Nigerian telecommunications company wants to segment its prepaid subscribers for targeted marketing. The available data for 200,000 subscribers includes: average daily data usage (MB), number of voice calls per week, average top-up frequency (days between top-ups), number of months as a subscriber, and average top-up amount (₦).

Before applying K-Means, list five specific data quality checks you would perform. For each, describe what you would look for and what action you would take if you found a problem.
The company’s marketing team says: “We want segments that are meaningful for campaign design — each segment should have a clear profile and respond differently to our offers.” Explain how you would decide on the number of segments K given both statistical evidence and this business requirement.
Suppose K=4 gives well-separated segments. After fitting, you profile each cluster and write a business description. Draft a business description for each of the following hypothetical cluster centres:

Cluster	Avg Daily Data	Calls/Week	Days Between Top-ups	Tenure (months)	Avg Top-up (₦)
1	5 MB	3	30	8	200
2	150 MB	12	5	36	2,000
3	400 MB	2	7	18	5,000
4	20 MB	8	15	60	500

For each cluster, suggest one targeted product offer that the telecoms company could make. Justify your suggestion based on the cluster’s profile.
Six months after deploying this segmentation, the marketing team notices that some subscribers have moved to different segments. Is this expected? How would you update the segmentation model over time?

Exercise 20.5: Coding Exercise — K-Means in R and Python

Generate the following dataset and answer the questions:

set.seed(42)
n_per_group <- 200
# Group 1: Low-value, low-frequency customers
g1 <- data.frame(
  spend = rnorm(n_per_group, 2000, 500),
  frequency = rnorm(n_per_group, 3, 1)
)
# Group 2: High-value, high-frequency customers  
g2 <- data.frame(
  spend = rnorm(n_per_group, 15000, 2000),
  frequency = rnorm(n_per_group, 15, 2)
)
# Group 3: Medium-value, very high-frequency customers
g3 <- data.frame(
  spend = rnorm(n_per_group, 8000, 1500),
  frequency = rnorm(n_per_group, 25, 3)
)
customer_data <- rbind(g1, g2, g3)

Before clustering, standardise both features. Why is this necessary here?
Run K-Means with K=2, K=3, and K=4. For each, record the total WCSS. Plot the elbow curve.
For your chosen K, create a scatter plot coloured by cluster assignment. Do the clusters visually make sense?
Calculate the silhouette score for your chosen K. Is it above 0.5 (generally considered “reasonable”)?
Profile each cluster: calculate the mean of the original (unstandardised) spend and frequency for each cluster. Write a 2-sentence business description for each cluster.

25.9 Appendix: K-Means Convergence and Lloyd’s Algorithm

25.10 A20.1 Proof That K-Means Converges

Theorem: Lloyd’s algorithm (the K-Means algorithm) converges in a finite number of iterations.

Proof Sketch:

Define the objective function as WCSS: \[F(\mathbf{C}, \mathbf{A}) = \sum_{i=1}^{n} \left\| \mathbf{x}_i - \mathbf{c}_{a_i} \right\|^2\]

where \(\mathbf{C}\) is the set of centroids and \(\mathbf{A}\) is the assignment vector.

Each iteration of K-Means performs two steps:

Assignment step: For fixed centroids \(\mathbf{C}\), we assign each observation to the nearest centroid. This minimises \(F\) over all possible assignments \(\mathbf{A}\).
Update step: For fixed assignments \(\mathbf{A}\), we recompute centroids. For each cluster \(j\), the centroid that minimises the sum of squared distances to assigned points is the mean: \[\mathbf{c}_j^* = \frac{1}{|C_j|} \sum_{i \in C_j} \mathbf{x}_i\]

Thus, both steps reduce (or keep constant) \(F\).

Since WCSS is bounded below by 0 and decreases after each iteration, and there are only finitely many possible assignments of \(n\) observations to \(k\) clusters, the algorithm must terminate. \(\square\)

Note that convergence is to a local minimum, not necessarily the global minimum.

25.11 A20.2 Derivation of K-Means++ Expected Cost Bound

The K-Means++ initialisation provides the following guarantee:

Theorem (Arthur & Vassilvitskii, 2007): Let \(\Phi(C^*)\) be the optimal WCSS (global minimum), and let \(\Phi(C)\) be the WCSS after K-Means++ initialisation and one pass of Lloyd’s algorithm. Then: \[\mathbb{E}[\Phi(C)] \leq O(\log k) \cdot \Phi(C^*)\]

Intuition: By seeding centroids far apart (proportional to squared distance), we ensure that the initial configuration has decent coverage of the data space. This avoids the pathological case where multiple centroids fall in a single dense cluster, leaving other clusters entirely unrepresented.

A full proof requires concentration inequalities from probability theory; we defer to the original paper for details.

	n_clusters n_clusters: int, default=8 The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.	4
	init init: {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++' Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy k-means++". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategies, see :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. For an evaluation of the impact of initialization, see the example :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_stability_low_dim_dense.py`.	'random'
	n_init n_init: 'auto' or int, default='auto' Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`.	1
	max_iter max_iter: int, default=300 Maximum number of iterations of the k-means algorithm for a single run.	100
	tol tol: float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.	0.0001
	verbose verbose: int, default=0 Verbosity mode.	0
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `.	42
	copy_x copy_x: bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.	True
	algorithm algorithm: {"lloyd", "elkan"}, default="lloyd" K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18 Added Elkan algorithm .. versionchanged:: 1.1 Renamed "full" to "lloyd", and deprecated "auto" and "full". Changed "auto" to use "lloyd" instead of "elkan".	'lloyd'

	n_clusters n_clusters: int, default=8 The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.	4
	init init: {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++' Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy k-means++". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategies, see :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. For an evaluation of the impact of initialization, see the example :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_stability_low_dim_dense.py`.	'k-means++'
	n_init n_init: 'auto' or int, default='auto' Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`.	25
	max_iter max_iter: int, default=300 Maximum number of iterations of the k-means algorithm for a single run.	100
	tol tol: float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.	0.0001
	verbose verbose: int, default=0 Verbosity mode.	0
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `.	42
	copy_x copy_x: bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.	True
	algorithm algorithm: {"lloyd", "elkan"}, default="lloyd" K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18 Added Elkan algorithm .. versionchanged:: 1.1 Renamed "full" to "lloyd", and deprecated "auto" and "full". Changed "auto" to use "lloyd" instead of "elkan".	'lloyd'

	n_clusters n_clusters: int, default=8 The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.	10
	init init: {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++' Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy k-means++". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategies, see :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. For an evaluation of the impact of initialization, see the example :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_stability_low_dim_dense.py`.	'k-means++'
	n_init n_init: 'auto' or int, default='auto' Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`.	25
	max_iter max_iter: int, default=300 Maximum number of iterations of the k-means algorithm for a single run.	100
	tol tol: float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.	0.0001
	verbose verbose: int, default=0 Verbosity mode.	0
	random_state random_state: int, RandomState instance or None, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `.	42
	copy_x copy_x: bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.	True
	algorithm algorithm: {"lloyd", "elkan"}, default="lloyd" K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18 Added Elkan algorithm .. versionchanged:: 1.1 Renamed "full" to "lloyd", and deprecated "auto" and "full". Changed "auto" to use "lloyd" instead of "elkan".	'lloyd'