46  Customer Churn Prediction

Note📋 Learning Objectives
  • Understand the economics of churn: why retention is 5–25× cheaper than acquisition
  • Define churn precisely in different business contexts (contractual vs non-contractual, window choice)
  • Engineer predictive features for churn: recency, trend, engagement decay, tenure
  • Apply Kaplan-Meier survival analysis to estimate survival curves and compare groups
  • Build Cox Proportional Hazards models to quantify hazard ratios and test PH assumptions
  • Train and compare classification models (logistic regression, random forest, XGBoost) for churn prediction
  • Design intervention scoring to maximize expected value of retention campaigns

46.1 The Economics of Churn: Why Retention Matters

In nearly every business, retaining a customer is far cheaper than acquiring one. A typical retail bank spends ₦15,000–₦25,000 to acquire a new customer through digital marketing, branch promotions, and onboarding. Yet a retained customer often provides ₦50,000–₦100,000+ in net present value over their lifetime. Telecom operators see even starker ratios: acquiring a new subscriber in Nigeria might cost ₦3,000–₦5,000 via device subsidies and marketing, but the customer acquisition cost is often justified only if the customer stays for 18+ months. A competitor can lure them away with a free handset after 12 months.

This asymmetry drives a fundamental business insight: retention is 5–25× cheaper than acquisition. If you can prevent one customer from churning, you save the acquisition cost of the customer who would have replaced them, plus you retain the revenue and cross-sell opportunity of the existing customer. In mature markets like mobile telecommunications or banking, churn management often drives more bottom-line impact than new customer acquisition because the market is saturated.

Churn Rate is defined as: \[\text{Churn Rate} = \frac{\text{Number of Customers Lost in Period}}{\text{Number of Customers at Start of Period}} \times 100\%\]

For a subscription business with 100,000 subscribers at the start of the month and 2,000 churning during the month, the monthly churn rate is 2%. An annual churn rate of 24% (2% × 12 months if constant) means the company loses and must replace a quarter of its customer base every year. This is expensive.

Revenue Impact: If a customer generates ₦10,000 per month in revenue and churns after 12 months due to a preventable reason, the lost lifetime value is ₦120,000. If you spend ₦15,000 on a retention campaign and save even one in ten customers, the ROI is (₦120,000 × 0.1 − ₦15,000) / ₦15,000 = 80%. That is why churn prediction and prevention is so valuable.

Contractual vs Non-Contractual Churn: In subscription businesses (monthly phone plans, streaming services, SaaS), churn is contractual: the customer’s subscription renews monthly, and they must explicitly cancel. You see churn as a discrete event. In non-contractual businesses (e-commerce, grocery retail, banking), churn is non-contractual: there is no explicit cancellation. You must infer churn from behaviour (e.g., “no purchase in 90 days = churned”). This distinction affects how you define and measure churn.

46.2 Defining and Measuring Churn: The Definition Problem

Churn seems obvious: a customer who leaves. Yet defining churn precisely is harder than it appears, and the definition has major implications for model validity.

The Churn Window Problem: Consider an e-commerce customer. If they haven’t bought in 30 days, are they churned? 60 days? 180 days? The choice is not statistical; it is business-defined. A luxury fashion retailer whose customer visits once every 12 months would define this customer as retained, not churned. A fast-moving consumer goods (FMCG) retailer would call them churned if they haven’t bought in 6 months. The churn window depends on your product’s natural purchase frequency.

Right-Censoring: In survival analysis, right-censoring occurs when a customer is still active at the end of the observation window. If you observe customers for 12 months, and a customer is still active on month 12, you cannot be sure whether they will churn on month 13 or stay for years. This is “right-censored” data: you know they survived at least 12 months, but not the true time to churn. Survival analysis methods (like Kaplan-Meier) properly handle this.

Building the Churn Label: For a classification approach, you need historical data with a known outcome. Typical steps:

  1. Define observation period (e.g., 12 months: months 1–12).
  2. Define outcome period (e.g., 90 days after observation period: months 13–15).
  3. For each customer, collect features during the observation period (spending, frequency, engagement, etc.).
  4. Label: If the customer is active in the outcome period, label = 0 (retained); if inactive, label = 1 (churned).

This labeling scheme ensures features predict future behaviour, not current state.

The Class Imbalance Problem: In a healthy business, churn is rare. If your churn rate is 2%, then 98% of your training data are retained customers. This class imbalance biases many classifiers to simply predict “no churn” for everyone (which achieves 98% accuracy but is useless). You must use techniques like stratified sampling, SMOTE (synthetic oversampling), or cost-weighted loss to address this.

46.3 Feature Engineering for Churn Prediction

Powerful features for churn prediction are based on behavioural signals that indicate engagement and value.

Recency Signals: How recently did the customer transact? - Days since last transaction - Days since last high-value transaction (>5th percentile) - Number of transactions in last 30, 60, 90 days - Active in last 30 days? (0/1 indicator)

Trend Features: Is engagement increasing or decreasing? - Spending trend: \((X_{month12} − X_{month1}) / X_{month1}\) (percent change in spending) - Usage trend: \((F_{month12} − F_{month1}) / F_{month1}\) (percent change in frequency) - Coefficient from a linear time-series regression: \(\beta\) in \(X_t = \alpha + \beta t + \varepsilon_t\)

A negative trend is a strong churn signal. A customer whose usage has declined 50% in the past six months is at high risk.

Engagement Decay: - Is 90-day usage less than prior 90-day usage? (0/1) - Ratio of recent spending to average: \(\frac{\text{Last 30-day spend}}{\text{Average monthly spend over 12 months}}\) - “Quiet period”: months with zero transactions

Tenure: How long has the customer been with you? - Months since acquisition - Customer age cohort (new: <6 months; established: 6–24 months; veteran: >24 months)

Tenure often shows a U-shaped relationship with churn: new customers churn more (haven’t settled in), established customers churn less (sunk costs, habit), and very old customers may churn more again (life changes). Fitting a polynomial (\(\text{tenure} + \text{tenure}^2\)) captures this.

Product and Channel Signals: - Number of products: Do they hold 1 product (risky) or 5+ (sticky)? - Product diversity: Percentage of product categories they use (breadth reduces churn) - Channel preference: Mostly in-branch (older, less sticky) vs app-based (younger, more sticky)? - Cross-product usage: If they use many products, churn falls

Competitive Signals (if available): - Device type: Android vs iPhone (market signal; iPhone users often more loyal) - Did they query competitor offers? (direct churn signal) - Bundle deals taken: Customers on multi-product bundles churn less

46.4 Survival Analysis: Kaplan-Meier Estimator

For contractual churn data (e.g., subscribers with known signup and churn dates), survival analysis is the gold standard. It properly handles right-censoring and computes the probability that a customer survives (remains active) beyond time \(t\).

The Survival Function \(S(t)\) is the probability that a customer survives beyond time \(t\): \[S(t) = P(\text{Time to churn} > t)\]

At \(t=0\), \(S(0) = 1\) (everyone is “alive” at start). As \(t\) increases, \(S(t)\) decreases toward 0 (as customers churn, fewer survive). The shape of \(S(t)\) tells the story: a steep early decline suggests high new-customer churn; a shallow decline suggests a stable, loyal base.

The Kaplan-Meier Estimator is a non-parametric estimate of \(S(t)\). It uses the product-limit formula:

\[\hat{S}(t) = \prod_{t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right)\]

where: - \(t_i\) are the observed churn times - \(d_i\) is the number of customers who churned at time \(t_i\) - \(n_i\) is the number of customers at risk (still active) just before time \(t_i\)

The formula says: at each churn event time, compute the proportion of at-risk customers who survive that time, then multiply all these proportions.

The Log-Rank Test compares survival curves between two groups (e.g., prepaid vs postpaid customers). The test statistic is:

\[Z = \frac{E_1 - O_1}{\sqrt{V}}\]

where: - \(O_1\) is the observed number of churn events in group 1 - \(E_1\) is the expected number under the null hypothesis (no group difference) - \(V\) is the variance

Under the null, \(Z\) follows a standard normal distribution. If \(|Z| > 1.96\), the groups have significantly different survival curves (\(p < 0.05\)).

46.5 The Cox Proportional Hazards Model

While Kaplan-Meier is useful for visualising and comparing survival curves, it does not let you adjust for multiple predictors. The Cox Proportional Hazards (Cox PH) model is a semi-parametric regression method that models the hazard function (instantaneous churn rate) as a function of covariates.

The Hazard Function \(h(t)\) is the instantaneous rate of churning at time \(t\), given the customer has survived to time \(t\). The Cox PH model specifies:

\[h(t; \mathbf{x}) = h_0(t) \times \exp(\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)\]

where: - \(h_0(t)\) is the baseline hazard (the hazard for a customer with all covariates = 0) - \(\mathbf{x} = (x_1, \ldots, x_p)\) are the covariates - \(\boldsymbol{\beta}\) are the regression coefficients

Interpretation of Coefficients: If \(\beta_1 = 0.5\), then a one-unit increase in \(x_1\) multiplies the hazard by \(e^{0.5} \approx 1.65\). This is the Hazard Ratio (HR): \[\text{HR} = e^{\beta}\]

An HR > 1 means the covariate increases churn risk; HR < 1 means it decreases churn risk.

Example: If “days since last transaction” has \(\beta = 0.02\), the HR per day is \(e^{0.02} \approx 1.02\). For a 30-day increase, the HR is \(e^{0.02 \times 30} \approx 1.82\). A customer with 30 more days of inactivity has 82% higher hazard of churning, adjusting for other variables.

The Proportional Hazards Assumption: The key assumption is that the HR is constant over time. That is, the effect of a covariate on churn risk doesn’t change over the customer lifecycle. This is often violated (e.g., new customers’ churn risk is more sensitive to engagement signals). Test this via Schoenfeld residuals: plot standardized residuals against time and look for trends. A smooth, zero-centered cloud suggests PH holds; a trend suggests violation.

Fitting the Cox Model: Estimate \(\boldsymbol{\beta}\) by maximizing the partial likelihood (not the full likelihood, since we don’t model \(h_0(t)\) parametrically). Most software (survival in R, lifelines in Python) does this automatically.

46.6 Classification Models for Churn: Logistic Regression to XGBoost

While Cox models are powerful for continuous-time survival data, many churn problems are framed as classification: given features at time \(t\), will the customer churn by time \(t + \Delta t\)? This is exactly a binary classification problem, and any classifier (logistic regression, random forest, XGBoost) can be applied.

Logistic Regression is a baseline. Fit: \[\log\left( \frac{P(\text{churn})}{1 - P(\text{churn})} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p\]

Interpret coefficients as log-odds ratios. It is interpretable and often performs well.

Random Forest and Gradient Boosting (XGBoost, LightGBM) are non-linear and capture feature interactions. They typically outperform logistic regression on churn data. The trade-off is interpretability: you need SHAP or permutation importance to understand which features drive predictions.

Threshold Selection: Do not use the default 0.5 threshold. Churn is rare (2% in a healthy business), so a 0.5 threshold will flag too few customers. Instead, choose the threshold to maximize expected business value: \[\text{Expected Value} = P(\text{churn}) \times \text{CLV} \times P(\text{retained | intervention}) - \text{Cost of Intervention}\]

If saving a churning customer is worth ₦100,000 (CLV) and the intervention costs ₦5,000, you should intervene if \(P(\text{churn}) > 5\% / 100 = 0.05\), even though the base churn rate is 2%. Use cost-weighted metrics (precision-recall curves, expected calibration curves) to find the optimal threshold.

46.7 Prescriptive Analytics: Intervention Scoring

Once you have a churn probability model, the next step is intervention scoring: ranking customers by how valuable it would be to prevent their churn. This is not just \(P(\text{churn})\); it is:

\[\text{Intervention Score} = P(\text{churn}) \times \text{CLV} \times P(\text{retained | intervention})\]

Components: - \(P(\text{churn})\): Model prediction - \(\text{CLV}\): Customer Lifetime Value (months expected to remain active × monthly spend), a business metric you must compute separately - \(P(\text{retained | intervention})\): Effectiveness of your intervention. If you call a customer and offer a special rate, what fraction return? This is often estimated from historical A/B tests (30–60% are typical). If you don’t have this, use 0.3 as a conservative estimate.

The score tells you: “This customer is predicted to churn (40%), is worth ₦500,000 over their lifetime, and we can save them 50% of the time with our intervention. The expected value of intervention is 0.40 × ₦500,000 × 0.50 = ₦100,000.”

Rank all customers by intervention score and work down the list until your budget runs out. This ensures you allocate retention spend to the highest-value customers most likely to respond.

Caution📝 Section 41.7 Review Questions
  1. Explain the difference between contractual and non-contractual churn. How does this affect model definition?
  2. What is the Kaplan-Meier estimator, and what does the product-limit formula represent?
  3. If a Cox model shows HR = 1.15 for “days since last transaction”, what does this mean in plain language?
  4. Why should you not use a 0.5 probability threshold for churn prediction in a business context?
  5. Design an intervention scoring system for a bank. What are the key inputs, and how would you estimate them?

46.8 Case Study: Subscriber Churn for a Nigerian Mobile Network Operator

Background: A major Nigerian mobile network operator (here called “NairaTel,” a fictional composite) operates a prepaid subscriber base of 100,000 active customers. The company wants to predict which subscribers will churn in the next 90 days and target them with a retention offer (e.g., 50% discount on data bundles for one month). The churn rate is ~2% per month (24% annualized), which is typical for Nigerian telecom. The company collects 12 months of behavioural data per subscriber.

Data: Columns include: - subscriber_id: Unique ID - acquisition_date: When the subscriber joined - last_recharge_date: Last purchase of credit/data bundle - arpu_12m: Average Revenue Per User over past 12 months (NGN) - data_usage_gb: Total data consumed in 12 months - voice_minutes: Total voice minutes in 12 months - recharge_count_12m: Number of recharge transactions - days_inactive: Days since last activity - churn_90d: Binary label (1 = churned in next 90 days, 0 = retained)

46.8.1 Exploratory Data Analysis and Feature Engineering

Show code
library(tidyverse)
library(survival)
library(ggplot2)

set.seed(8152)

# Synthetic Nigerian telecom subscriber data
n_subscribers <- 10000
subscribers <- tibble(
  subscriber_id = 1:n_subscribers,
  acquisition_date = today() - days(sample(30:1095, n_subscribers, replace = TRUE)),
  arpu_12m = rgamma(n_subscribers, shape = 2, scale = 3000), # Mean ₦6k/month
  data_usage_gb = rgamma(n_subscribers, shape = 2, scale = 5),
  voice_minutes = rgamma(n_subscribers, shape = 2, scale = 100),
  recharge_count_12m = rpois(n_subscribers, lambda = 8)
)

# Generate churn labels (higher ARPU, more recent activity = lower churn)
subscribers <- subscribers |>
  mutate(
    days_since_acquisition = as.numeric(today() - acquisition_date),
    tenure_score = pmin(days_since_acquisition / 365, 1), # 0–1, capped at 1 year
    activity_decay = rnorm(n_subscribers, mean = recharge_count_12m / 12, sd = 2),
    activity_decay = pmax(activity_decay, 0),
    churn_prob = 0.15 - 0.05 * tenure_score - 0.02 * log1p(arpu_12m / 1000) - 0.01 * activity_decay,
    churn_prob = pmin(pmax(churn_prob, 0.01), 0.25), # Bound between 1% and 25%
    churn_90d = as.integer(runif(n_subscribers) < churn_prob),
    days_inactive = sample(1:90, n_subscribers, replace = TRUE) +
                    (1 - churn_90d) * sample(90:180, n_subscribers, replace = TRUE)
  ) |>
  select(subscriber_id, acquisition_date, arpu_12m, data_usage_gb,
         voice_minutes, recharge_count_12m, tenure_score, churn_90d, days_inactive)

# Summary statistics
cat("Churn Rate:", mean(subscribers$churn_90d) * 100, "%\n")
#> Churn Rate: 5.98 %
cat("Mean ARPU (Retained):",
    mean(subscribers$arpu_12m[subscribers$churn_90d == 0]), "\n")
#> Mean ARPU (Retained): 6020.864
cat("Mean ARPU (Churned):",
    mean(subscribers$arpu_12m[subscribers$churn_90d == 1]), "\n")
#> Mean ARPU (Churned): 5192.804

# Visualize churn distribution by key features
p1 <- ggplot(subscribers, aes(x = factor(churn_90d), y = arpu_12m)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "ARPU by Churn Status", x = "Churned (0=No, 1=Yes)", y = "ARPU (NGN)") +
  theme_minimal()

p2 <- ggplot(subscribers, aes(x = factor(churn_90d), y = recharge_count_12m)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "Recharge Frequency by Churn Status", x = "Churned (0=No, 1=Yes)", y = "Recharges (12m)") +
  theme_minimal()

p3 <- ggplot(subscribers, aes(x = churn_90d)) +
  geom_bar(aes(fill = factor(churn_90d)), show.legend = FALSE) +
  labs(title = "Churn Distribution", x = "Churned (0=No, 1=Yes)", y = "Count") +
  scale_fill_manual(values = c("0" = "darkgreen", "1" = "darkred")) +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

Churn Distribution and Key Feature Distributions
Show code
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

np.random.seed(8152)

# Synthetic data
n_subscribers = 10000
acquisition_dates = [datetime.now() - timedelta(days=int(d))
                     for d in np.random.randint(30, 1095, n_subscribers)]
arpu_12m = np.random.gamma(shape=2, scale=3000, size=n_subscribers)
data_usage_gb = np.random.gamma(shape=2, scale=5, size=n_subscribers)
voice_minutes = np.random.gamma(shape=2, scale=100, size=n_subscribers)
recharge_count = np.random.poisson(lam=8, size=n_subscribers)

subscribers_py = pd.DataFrame({
    "subscriber_id": np.arange(1, n_subscribers + 1),
    "acquisition_date": acquisition_dates,
    "arpu_12m": arpu_12m,
    "data_usage_gb": data_usage_gb,
    "voice_minutes": voice_minutes,
    "recharge_count_12m": recharge_count
})

# Compute churn
subscribers_py["days_since_acquisition"] = (
    (pd.Timestamp.now() - pd.to_datetime(subscribers_py["acquisition_date"])).dt.days
)
subscribers_py["tenure_score"] = np.minimum(subscribers_py["days_since_acquisition"] / 365, 1)
subscribers_py["activity_decay"] = np.maximum(
    np.random.normal(subscribers_py["recharge_count_12m"] / 12, 2),
    0
)
subscribers_py["churn_prob"] = (0.15 - 0.05 * subscribers_py["tenure_score"]
                                 - 0.02 * np.log1p(subscribers_py["arpu_12m"] / 1000)
                                 - 0.01 * subscribers_py["activity_decay"])
subscribers_py["churn_prob"] = np.clip(subscribers_py["churn_prob"], 0.01, 0.25)
subscribers_py["churn_90d"] = (np.random.random(n_subscribers) < subscribers_py["churn_prob"]).astype(int)

print(f"Churn Rate: {subscribers_py['churn_90d'].mean() * 100:.2f}%")
#> Churn Rate: 5.83%
print(f"Mean ARPU (Retained): {subscribers_py[subscribers_py['churn_90d']==0]['arpu_12m'].mean():.0f}")
#> Mean ARPU (Retained): 6154
print(f"Mean ARPU (Churned): {subscribers_py[subscribers_py['churn_90d']==1]['arpu_12m'].mean():.0f}")
#> Mean ARPU (Churned): 5472

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
subscribers_py.boxplot(column="arpu_12m", by="churn_90d", ax=axes[0])
axes[0].set_title("ARPU by Churn Status")
subscribers_py.boxplot(column="recharge_count_12m", by="churn_90d", ax=axes[1])
axes[1].set_title("Recharge Frequency by Churn Status")
subscribers_py["churn_90d"].value_counts().plot(kind="bar", ax=axes[2], color=["darkgreen", "darkred"])
axes[2].set_title("Churn Distribution")
plt.tight_layout()
plt.savefig("ch41_eda.png", dpi=150, bbox_inches="tight")
plt.show()

46.8.2 Kaplan-Meier Survival Analysis

Show code
# Convert to survival format: convert churn binary to time-to-event
# Assume subscribers followed for ~90 days; if churn_90d=1, event at day 45; else censored at day 90
subscribers <- subscribers |>
  mutate(
    time_to_churn = ifelse(churn_90d == 1,
                           sample(30:90, n(), replace = TRUE),
                           90),
    event = churn_90d,
    arpu_segment = cut(arpu_12m,
                       breaks = quantile(arpu_12m, c(0, 0.33, 0.67, 1)),
                       labels = c("Low ARPU", "Medium ARPU", "High ARPU"),
                       include.lowest = TRUE)
  )

# Fit Kaplan-Meier
km_fit <- survfit(Surv(time_to_churn, event) ~ arpu_segment, data = subscribers)

# Plot survival curves
ggsurvplot <- function(fit, data, title) {
  # Manual implementation for clarity
  summary_df <- summary(fit)
  plot_data <- data.frame(
    time = summary_df$time,
    surv = summary_df$surv,
    strata = rep(names(fit$strata), times = table(fit$strata))
  )

  ggplot(plot_data, aes(x = time, y = surv, color = strata, group = strata)) +
    geom_step(linewidth = 1) +
    labs(title = title,
         x = "Days", y = "Survival Probability",
         color = "Segment") +
    ylim(0, 1) +
    theme_minimal() +
    theme(legend.position = "right")
}

# Alternative: use simplified plotting
plot_km_summary <- summary(km_fit)
cat("Kaplan-Meier Summary:\n")
#> Kaplan-Meier Summary:
print(plot_km_summary)
#> Call: survfit(formula = Surv(time_to_churn, event) ~ arpu_segment, 
#>     data = subscribers)
#> 
#>                 arpu_segment=Low ARPU 
#>  time n.risk n.event survival  std.err lower 95% CI upper 95% CI
#>    30   3300       5    0.998 0.000677        0.997        1.000
#>    31   3295       5    0.997 0.000957        0.995        0.999
#>    32   3290       6    0.995 0.001209        0.993        0.998
#>    33   3284       6    0.993 0.001417        0.991        0.996
#>    34   3278       3    0.992 0.001509        0.989        0.995
#>    36   3275       3    0.992 0.001597        0.988        0.995
#>    37   3272       3    0.991 0.001679        0.987        0.994
#>    38   3269       5    0.989 0.001808        0.986        0.993
#>    39   3264       4    0.988 0.001905        0.984        0.992
#>    40   3260       4    0.987 0.001997        0.983        0.991
#>    41   3256       5    0.985 0.002105        0.981        0.989
#>    42   3251       2    0.985 0.002147        0.980        0.989
#>    43   3249       2    0.984 0.002188        0.980        0.988
#>    44   3247       4    0.983 0.002268        0.978        0.987
#>    45   3243       4    0.982 0.002345        0.977        0.986
#>    46   3239       4    0.980 0.002419        0.976        0.985
#>    47   3235       2    0.980 0.002455        0.975        0.985
#>    48   3233       5    0.978 0.002543        0.973        0.983
#>    49   3228       2    0.978 0.002577        0.973        0.983
#>    50   3226       6    0.976 0.002677        0.971        0.981
#>    51   3220       8    0.973 0.002805        0.968        0.979
#>    52   3212       4    0.972 0.002866        0.967        0.978
#>    53   3208       1    0.972 0.002881        0.966        0.977
#>    54   3207       2    0.971 0.002911        0.966        0.977
#>    55   3205       2    0.971 0.002940        0.965        0.976
#>    56   3203       1    0.970 0.002955        0.965        0.976
#>    57   3202       3    0.969 0.002998        0.964        0.975
#>    58   3199       5    0.968 0.003069        0.962        0.974
#>    59   3194       8    0.965 0.003179        0.959        0.972
#>    60   3186       2    0.965 0.003206        0.959        0.971
#>    61   3184       5    0.963 0.003272        0.957        0.970
#>    62   3179       3    0.962 0.003310        0.956        0.969
#>    63   3176       3    0.962 0.003349        0.955        0.968
#>    64   3173       5    0.960 0.003411        0.953        0.967
#>    65   3168       5    0.958 0.003472        0.952        0.965
#>    66   3163       6    0.957 0.003544        0.950        0.964
#>    67   3157       7    0.955 0.003626        0.947        0.962
#>    68   3150       2    0.954 0.003649        0.947        0.961
#>    69   3148       3    0.953 0.003683        0.946        0.960
#>    70   3145       2    0.952 0.003706        0.945        0.960
#>    71   3143       7    0.950 0.003783        0.943        0.958
#>    72   3136       5    0.949 0.003837        0.941        0.956
#>    73   3131       6    0.947 0.003901        0.939        0.955
#>    74   3125       3    0.946 0.003932        0.938        0.954
#>    75   3122       1    0.946 0.003943        0.938        0.954
#>    76   3121       4    0.945 0.003984        0.937        0.952
#>    77   3117       6    0.943 0.004045        0.935        0.951
#>    78   3111       4    0.942 0.004085        0.934        0.950
#>    79   3107       2    0.941 0.004105        0.933        0.949
#>    80   3105       3    0.940 0.004134        0.932        0.948
#>    81   3102       3    0.939 0.004163        0.931        0.947
#>    82   3099       6    0.937 0.004221        0.929        0.946
#>    83   3093       4    0.936 0.004259        0.928        0.944
#>    84   3089       5    0.935 0.004305        0.926        0.943
#>    85   3084       4    0.933 0.004342        0.925        0.942
#>    86   3080       5    0.932 0.004388        0.923        0.940
#>    87   3075       3    0.931 0.004415        0.922        0.940
#>    88   3072       2    0.930 0.004433        0.922        0.939
#>    89   3070       2    0.930 0.004450        0.921        0.938
#>    90   3068       3    0.929 0.004477        0.920        0.938
#> 
#>                 arpu_segment=Medium ARPU 
#>  time n.risk n.event survival  std.err lower 95% CI upper 95% CI
#>    30   3400       1    1.000 0.000294        0.999        1.000
#>    31   3399       2    0.999 0.000509        0.998        1.000
#>    32   3397       4    0.998 0.000777        0.996        0.999
#>    33   3393       6    0.996 0.001058        0.994        0.998
#>    34   3387       2    0.996 0.001137        0.993        0.998
#>    35   3385       5    0.994 0.001311        0.992        0.997
#>    36   3380       4    0.993 0.001436        0.990        0.996
#>    37   3376       2    0.992 0.001494        0.989        0.995
#>    38   3374       5    0.991 0.001630        0.988        0.994
#>    39   3369       4    0.990 0.001731        0.986        0.993
#>    40   3365       2    0.989 0.001779        0.986        0.993
#>    41   3363       5    0.988 0.001894        0.984        0.991
#>    42   3358       7    0.986 0.002044        0.982        0.990
#>    43   3351       5    0.984 0.002144        0.980        0.988
#>    44   3346       2    0.984 0.002183        0.979        0.988
#>    45   3344       5    0.982 0.002276        0.978        0.987
#>    46   3339       2    0.981 0.002313        0.977        0.986
#>    48   3337       2    0.981 0.002348        0.976        0.985
#>    49   3335       2    0.980 0.002384        0.976        0.985
#>    51   3333       7    0.978 0.002502        0.973        0.983
#>    52   3326       3    0.977 0.002551        0.972        0.982
#>    53   3323       2    0.977 0.002584        0.972        0.982
#>    54   3321       5    0.975 0.002662        0.970        0.981
#>    55   3316       3    0.974 0.002708        0.969        0.980
#>    56   3313       5    0.973 0.002783        0.968        0.978
#>    57   3308       2    0.972 0.002812        0.967        0.978
#>    58   3306       4    0.971 0.002869        0.966        0.977
#>    59   3302       2    0.971 0.002898        0.965        0.976
#>    60   3300       4    0.969 0.002953        0.964        0.975
#>    61   3296       3    0.969 0.002994        0.963        0.974
#>    62   3293       1    0.968 0.003008        0.962        0.974
#>    63   3292       5    0.967 0.003074        0.961        0.973
#>    64   3287       2    0.966 0.003100        0.960        0.972
#>    65   3285       6    0.964 0.003177        0.958        0.971
#>    66   3279       7    0.962 0.003264        0.956        0.969
#>    67   3272       3    0.961 0.003301        0.955        0.968
#>    68   3269       6    0.960 0.003372        0.953        0.966
#>    69   3263       4    0.959 0.003419        0.952        0.965
#>    70   3259       3    0.958 0.003454        0.951        0.964
#>    71   3256       3    0.957 0.003488        0.950        0.964
#>    72   3253       2    0.956 0.003511        0.949        0.963
#>    73   3251       5    0.955 0.003566        0.948        0.962
#>    74   3246       1    0.954 0.003577        0.947        0.961
#>    75   3245       1    0.954 0.003588        0.947        0.961
#>    76   3244       4    0.953 0.003632        0.946        0.960
#>    77   3240       7    0.951 0.003706        0.944        0.958
#>    78   3233       4    0.950 0.003748        0.942        0.957
#>    79   3229       2    0.949 0.003769        0.942        0.957
#>    80   3227       1    0.949 0.003779        0.941        0.956
#>    82   3226       5    0.947 0.003830        0.940        0.955
#>    83   3221       6    0.946 0.003890        0.938        0.953
#>    84   3215       2    0.945 0.003910        0.937        0.953
#>    85   3213       1    0.945 0.003920        0.937        0.952
#>    86   3212       4    0.944 0.003959        0.936        0.951
#>    87   3208       5    0.942 0.004007        0.934        0.950
#>    88   3203       1    0.942 0.004016        0.934        0.950
#>    89   3202       2    0.941 0.004035        0.933        0.949
#>    90   3200       5    0.940 0.004082        0.932        0.948
#> 
#>                 arpu_segment=High ARPU 
#>  time n.risk n.event survival  std.err lower 95% CI upper 95% CI
#>    30   3300       1    1.000 0.000303        0.999        1.000
#>    31   3299       3    0.999 0.000606        0.998        1.000
#>    32   3296       3    0.998 0.000801        0.996        0.999
#>    34   3293       5    0.996 0.001048        0.994        0.998
#>    36   3288       4    0.995 0.001209        0.993        0.998
#>    37   3284       2    0.995 0.001282        0.992        0.997
#>    38   3282       3    0.994 0.001384        0.991        0.996
#>    39   3279       2    0.993 0.001448        0.990        0.996
#>    40   3277       3    0.992 0.001539        0.989        0.995
#>    41   3274       2    0.992 0.001597        0.988        0.995
#>    42   3272       4    0.990 0.001706        0.987        0.994
#>    43   3268       1    0.990 0.001732        0.987        0.993
#>    44   3267       2    0.989 0.001783        0.986        0.993
#>    45   3265       3    0.988 0.001857        0.985        0.992
#>    46   3262       2    0.988 0.001905        0.984        0.992
#>    47   3260       5    0.986 0.002019        0.982        0.990
#>    48   3255       1    0.986 0.002041        0.982        0.990
#>    49   3254       1    0.986 0.002063        0.982        0.990
#>    50   3253       2    0.985 0.002105        0.981        0.989
#>    51   3251       1    0.985 0.002126        0.981        0.989
#>    52   3250       3    0.984 0.002188        0.980        0.988
#>    54   3247       3    0.983 0.002248        0.979        0.987
#>    55   3244       2    0.982 0.002287        0.978        0.987
#>    56   3242       3    0.982 0.002345        0.977        0.986
#>    57   3239       3    0.981 0.002401        0.976        0.985
#>    58   3236       3    0.980 0.002455        0.975        0.985
#>    59   3233       1    0.979 0.002473        0.975        0.984
#>    60   3232       4    0.978 0.002543        0.973        0.983
#>    61   3228       3    0.977 0.002594        0.972        0.982
#>    62   3225       5    0.976 0.002677        0.971        0.981
#>    63   3220       2    0.975 0.002710        0.970        0.980
#>    64   3218       1    0.975 0.002726        0.970        0.980
#>    65   3217       4    0.974 0.002789        0.968        0.979
#>    66   3213       4    0.972 0.002851        0.967        0.978
#>    67   3209       5    0.971 0.002926        0.965        0.977
#>    68   3204       3    0.970 0.002970        0.964        0.976
#>    69   3201       4    0.969 0.003027        0.963        0.975
#>    70   3197       2    0.968 0.003055        0.962        0.974
#>    71   3195       3    0.967 0.003097        0.961        0.973
#>    72   3192       2    0.967 0.003125        0.961        0.973
#>    73   3190       2    0.966 0.003152        0.960        0.972
#>    74   3188       4    0.965 0.003206        0.959        0.971
#>    75   3184       3    0.964 0.003246        0.958        0.970
#>    76   3181       1    0.964 0.003259        0.957        0.970
#>    77   3180       4    0.962 0.003310        0.956        0.969
#>    79   3176       4    0.961 0.003361        0.955        0.968
#>    80   3172       3    0.960 0.003399        0.954        0.967
#>    81   3169       4    0.959 0.003448        0.952        0.966
#>    82   3165       2    0.958 0.003472        0.952        0.965
#>    83   3163       3    0.958 0.003509        0.951        0.964
#>    85   3160       1    0.957 0.003521        0.950        0.964
#>    86   3159       4    0.956 0.003568        0.949        0.963
#>    87   3155       1    0.956 0.003580        0.949        0.963
#>    88   3154       3    0.955 0.003614        0.948        0.962
#>    89   3151       4    0.954 0.003660        0.946        0.961
#>    90   3147       5    0.952 0.003717        0.945        0.959

# Log-rank test
logrank_test <- survdiff(Surv(time_to_churn, event) ~ arpu_segment, data = subscribers)
cat("\nLog-Rank Test (comparing ARPU segments):\n")
#> 
#> Log-Rank Test (comparing ARPU segments):
cat("Chi-squared statistic:", logrank_test$chisq, "\n")
#> Chi-squared statistic: 16.09961
cat("P-value:", 1 - pchisq(logrank_test$chisq, df = 2), "\n")
#> P-value: 0.0003191646
Show code
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
import pandas as pd
import matplotlib.pyplot as plt

# Prepare survival data
subscribers_py["time_to_churn"] = np.where(
    subscribers_py["churn_90d"] == 1,
    np.random.randint(30, 91, n_subscribers),
    90
)
subscribers_py["arpu_segment"] = pd.cut(
    subscribers_py["arpu_12m"],
    bins=3,
    labels=["Low ARPU", "Medium ARPU", "High ARPU"]
)

# Fit KM for each segment
kmf = KaplanMeierFitter()
fig, ax = plt.subplots(figsize=(10, 6))

for segment in ["Low ARPU", "Medium ARPU", "High ARPU"]:
    mask = subscribers_py["arpu_segment"] == segment
    kmf.fit(
        durations=subscribers_py[mask]["time_to_churn"],
        event_observed=subscribers_py[mask]["churn_90d"],
        label=segment
    )
    kmf.plot_survival_function(ax=ax, linewidth=2)

ax.set_xlabel("Days")
ax.set_ylabel("Survival Probability")
ax.set_title("Kaplan-Meier Survival Curves by ARPU Segment")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("ch41_km_curves.png", dpi=150, bbox_inches="tight")
plt.show()

Show code

# Log-rank test
low_mask = subscribers_py["arpu_segment"] == "Low ARPU"
high_mask = subscribers_py["arpu_segment"] == "High ARPU"
results = logrank_test(
    durations_A=subscribers_py[low_mask]["time_to_churn"],
    durations_B=subscribers_py[high_mask]["time_to_churn"],
    event_observed_A=subscribers_py[low_mask]["churn_90d"],
    event_observed_B=subscribers_py[high_mask]["churn_90d"]
)
print("Log-Rank Test (Low vs High ARPU):")
#> Log-Rank Test (Low vs High ARPU):
print(f"Test statistic: {results.test_statistic:.4f}")
#> Test statistic: 1.5885
print(f"P-value: {results.p_value:.4f}")
#> P-value: 0.2075

46.8.3 Cox Proportional Hazards Model

Show code
library(survival)

# Fit Cox PH model
cox_model <- coxph(
  Surv(time_to_churn, event) ~
    log1p(arpu_12m) +
    log1p(data_usage_gb) +
    log1p(voice_minutes) +
    log1p(recharge_count_12m) +
    tenure_score,
  data = subscribers
)

# Summary
summary_cox <- summary(cox_model)
print(summary_cox)
#> Call:
#> coxph(formula = Surv(time_to_churn, event) ~ log1p(arpu_12m) + 
#>     log1p(data_usage_gb) + log1p(voice_minutes) + log1p(recharge_count_12m) + 
#>     tenure_score, data = subscribers)
#> 
#>   n= 10000, number of events= 598 
#> 
#>                                coef exp(coef)  se(coef)      z Pr(>|z|)    
#> log1p(arpu_12m)           -0.223996  0.799318  0.047171 -4.749 2.05e-06 ***
#> log1p(data_usage_gb)       0.136986  1.146813  0.062533  2.191   0.0285 *  
#> log1p(voice_minutes)       0.007346  1.007373  0.051076  0.144   0.8856    
#> log1p(recharge_count_12m) -0.047931  0.953200  0.119385 -0.401   0.6881    
#> tenure_score              -0.589004  0.554879  0.138558 -4.251 2.13e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>                           exp(coef) exp(-coef) lower .95 upper .95
#> log1p(arpu_12m)              0.7993     1.2511    0.7287    0.8767
#> log1p(data_usage_gb)         1.1468     0.8720    1.0145    1.2963
#> log1p(voice_minutes)         1.0074     0.9927    0.9114    1.1134
#> log1p(recharge_count_12m)    0.9532     1.0491    0.7543    1.2045
#> tenure_score                 0.5549     1.8022    0.4229    0.7280
#> 
#> Concordance= 0.58  (se = 0.012 )
#> Likelihood ratio test= 43.51  on 5 df,   p=3e-08
#> Wald test            = 46.04  on 5 df,   p=9e-09
#> Score (logrank) test = 46.17  on 5 df,   p=8e-09

# Extract hazard ratios
hr_table <- data.frame(
  variable = rownames(summary_cox$coefficients),
  coefficient = summary_cox$coefficients[, 1],
  hazard_ratio = summary_cox$coefficients[, 2],
  p_value = summary_cox$coefficients[, 5]
)

cat("\nHazard Ratios (Exponentiated Coefficients):\n")
#> 
#> Hazard Ratios (Exponentiated Coefficients):
print(hr_table)
#>                                            variable  coefficient hazard_ratio
#> log1p(arpu_12m)                     log1p(arpu_12m) -0.223995842    0.7993185
#> log1p(data_usage_gb)           log1p(data_usage_gb)  0.136986375    1.1468125
#> log1p(voice_minutes)           log1p(voice_minutes)  0.007346256    1.0073733
#> log1p(recharge_count_12m) log1p(recharge_count_12m) -0.047930989    0.9531996
#> tenure_score                           tenure_score -0.589004308    0.5548795
#>                                p_value
#> log1p(arpu_12m)           2.048559e-06
#> log1p(data_usage_gb)      2.847878e-02
#> log1p(voice_minutes)      8.856348e-01
#> log1p(recharge_count_12m) 6.880642e-01
#> tenure_score              2.128571e-05

# Test proportional hazards assumption (Schoenfeld residuals)
ph_test <- cox.zph(cox_model)
cat("\nProportional Hazards Test (Schoenfeld Residuals):\n")
#> 
#> Proportional Hazards Test (Schoenfeld Residuals):
print(ph_test)
#>                           chisq df    p
#> log1p(arpu_12m)           0.948  1 0.33
#> log1p(data_usage_gb)      0.610  1 0.43
#> log1p(voice_minutes)      0.103  1 0.75
#> log1p(recharge_count_12m) 0.945  1 0.33
#> tenure_score              2.314  1 0.13
#> GLOBAL                    4.897  5 0.43

# Visualize hazard ratios
hr_plot_data <- hr_table |>
  arrange(hazard_ratio) |>
  mutate(significant = ifelse(p_value < 0.05, "Yes", "No"))

ggplot(hr_plot_data, aes(y = reorder(variable, hazard_ratio), x = hazard_ratio, color = significant)) +
  geom_point(size = 3) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "black") +
  geom_errorbar(aes(xmin = hazard_ratio - 0.1, xmax = hazard_ratio + 0.1), height = 0.2) +
  scale_color_manual(values = c("Yes" = "darkred", "No" = "grey")) +
  labs(title = "Cox PH Hazard Ratios",
       y = "Variable", x = "Hazard Ratio") +
  theme_minimal() +
  theme(legend.position = "bottom")

Cox Model Hazard Ratios and Diagnostics
Show code
from lifelines import CoxPHFitter
import pandas as pd
import numpy as np

# Prepare data for Cox model
cox_data = subscribers_py[["time_to_churn", "churn_90d", "arpu_12m", "data_usage_gb",
                           "voice_minutes", "recharge_count_12m", "tenure_score"]].copy()
cox_data["log_arpu"] = np.log1p(cox_data["arpu_12m"])
cox_data["log_data"] = np.log1p(cox_data["data_usage_gb"])
cox_data["log_voice"] = np.log1p(cox_data["voice_minutes"])
cox_data["log_recharge"] = np.log1p(cox_data["recharge_count_12m"])

# Fit Cox PH
cph = CoxPHFitter()
cph.fit(
    cox_data[["time_to_churn", "churn_90d", "log_arpu", "log_data",
              "log_voice", "log_recharge", "tenure_score"]],
    duration_col="time_to_churn",
    event_col="churn_90d"
)
#> <lifelines.CoxPHFitter: fitted with 10000 total observations, 9417 right-censored observations>

# Summary
print("Cox PH Model Summary:")
#> Cox PH Model Summary:
print(cph.summary)
#>                   coef  exp(coef)  se(coef)  ...         z             p   -log2(p)
#> covariate                                    ...                                   
#> log_arpu     -0.219257   0.803116  0.047107  ... -4.654404  3.249198e-06  18.231485
#> log_data     -0.095066   0.909313  0.062176  ... -1.528995  1.262656e-01   2.985467
#> log_voice    -0.046729   0.954346  0.052237  ... -0.894558  3.710236e-01   1.430417
#> log_recharge -0.275079   0.759512  0.115562  ... -2.380361  1.729567e-02   5.853445
#> tenure_score -0.720971   0.486280  0.137981  ... -5.225144  1.740196e-07  22.454247
#> 
#> [5 rows x 11 columns]

# Hazard ratios
hazard_ratios = np.exp(cph.params_)
print("\nHazard Ratios:")
#> 
#> Hazard Ratios:
print(hazard_ratios)
#> covariate
#> log_arpu        0.803116
#> log_data        0.909313
#> log_voice       0.954346
#> log_recharge    0.759512
#> tenure_score    0.486280
#> Name: coef, dtype: float64

# Proportional hazards check (lifelines >= 0.27 uses check_assumptions())
print("\nProportional Hazards Assumption Check:")
#> 
#> Proportional Hazards Assumption Check:
try:
    cph.check_assumptions(cox_data, p_value_threshold=0.05, show_plots=False)
except Exception as e:
    print(f"  Note: {e}")
#>   Note: shapes (10000,9) and (5,) not aligned: 9 (dim 1) != 5 (dim 0)

46.8.4 Classification Models for Churn Prediction

Show code
library(caret)
library(pROC)
library(randomForest)

# Prepare data: remove survival columns, keep features
churn_data <- subscribers |>
  select(arpu_12m, data_usage_gb, voice_minutes, recharge_count_12m,
         tenure_score, days_inactive, churn_90d) |>
  mutate(
    churn_90d = factor(churn_90d, levels = c(0, 1))
  )

# Split data
set.seed(3647)
train_idx <- createDataPartition(churn_data$churn_90d, p = 0.7, list = FALSE)
train_data <- churn_data[train_idx, ]
test_data <- churn_data[-train_idx, ]

# Logistic Regression
log_model <- glm(churn_90d ~ ., family = "binomial", data = train_data)
log_pred <- predict(log_model, test_data, type = "response")

# Random Forest
rf_model <- randomForest(churn_90d ~ ., data = train_data, ntree = 100)
rf_pred <- predict(rf_model, test_data, type = "prob")[, 2]

# Compute AUC
log_auc <- auc(as.numeric(test_data$churn_90d) - 1, log_pred)
rf_auc <- auc(as.numeric(test_data$churn_90d) - 1, rf_pred)

cat("Logistic Regression AUC:", log_auc, "\n")
#> Logistic Regression AUC: 0.999998
cat("Random Forest AUC:", rf_auc, "\n")
#> Random Forest AUC: 0.9999921

# Plot ROC curves
log_roc <- roc(as.numeric(test_data$churn_90d) - 1, log_pred)
rf_roc <- roc(as.numeric(test_data$churn_90d) - 1, rf_pred)

plot(log_roc, main = "ROC Curves", col = "blue", legacy.axes = TRUE)
plot(rf_roc, col = "red", legacy.axes = TRUE, add = TRUE)
legend("bottomright", c("Logistic", "Random Forest"), col = c("blue", "red"), lwd = 2)

Model Comparison: AUC-ROC and Feature Importance
Show code

# Feature importance from RF
importance_df <- data.frame(
  variable = rownames(importance(rf_model)),
  importance = importance(rf_model)[, 1]
) |>
  arrange(desc(importance))

ggplot(importance_df, aes(y = reorder(variable, importance), x = importance)) +
  geom_col(fill = "steelblue") +
  labs(title = "Feature Importance (Random Forest)",
       y = "Variable", x = "Importance Score") +
  theme_minimal()

Model Comparison: AUC-ROC and Feature Importance
Show code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Prepare data
X = subscribers_py[["arpu_12m", "data_usage_gb", "voice_minutes",
                    "recharge_count_12m", "tenure_score"]].copy()
y = subscribers_py["churn_90d"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                      random_state=3647, stratify=y)

# Fit logistic regression
log_reg = LogisticRegression(random_state=3647, max_iter=1000)
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=1000, random_state=3647)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show code
log_pred = log_reg.predict_proba(X_test)[:, 1]

# Fit random forest
rf = RandomForestClassifier(n_estimators=100, random_state=3647)
rf.fit(X_train, y_train)
RandomForestClassifier(random_state=3647)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show code
rf_pred = rf.predict_proba(X_test)[:, 1]

# Compute AUC
log_auc = roc_auc_score(y_test, log_pred)
rf_auc = roc_auc_score(y_test, rf_pred)

print(f"Logistic Regression AUC: {log_auc:.4f}")
#> Logistic Regression AUC: 0.5822
print(f"Random Forest AUC: {rf_auc:.4f}")
#> Random Forest AUC: 0.5661

# Plot ROC curves
fpr_log, tpr_log, _ = roc_curve(y_test, log_pred)
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_pred)

plt.figure(figsize=(10, 6))
#> <Figure size 1000x600 with 0 Axes>
plt.plot(fpr_log, tpr_log, label=f"Logistic (AUC={log_auc:.3f})", linewidth=2)
#> [<matplotlib.lines.Line2D object at 0x000001CD22C56BA0>]
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (AUC={rf_auc:.3f})", linewidth=2)
#> [<matplotlib.lines.Line2D object at 0x000001CD22C56CF0>]
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
#> [<matplotlib.lines.Line2D object at 0x000001CD22C56E40>]
plt.xlabel("False Positive Rate")
#> Text(0.5, 0, 'False Positive Rate')
plt.ylabel("True Positive Rate")
#> Text(0, 0.5, 'True Positive Rate')
plt.title("ROC Curves: Churn Prediction Models")
#> Text(0.5, 1.0, 'ROC Curves: Churn Prediction Models')
plt.legend()
#> <matplotlib.legend.Legend object at 0x000001CD22C56900>
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("ch41_roc_curves.png", dpi=150, bbox_inches="tight")
plt.show()

Show code

# Feature importance
feature_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=False)

plt.figure(figsize=(10, 6))
#> <Figure size 1000x600 with 0 Axes>
plt.barh(feature_importance["feature"], feature_importance["importance"], color="steelblue")
#> <BarContainer object of 5 artists>
plt.xlabel("Importance")
#> Text(0.5, 0, 'Importance')
plt.title("Feature Importance (Random Forest)")
#> Text(0.5, 1.0, 'Feature Importance (Random Forest)')
plt.grid(axis="x", alpha=0.3)
plt.tight_layout()
plt.savefig("ch41_feature_importance.png", dpi=150, bbox_inches="tight")
plt.show()

46.8.5 Intervention Scoring and ROI Simulation

Show code
# Compute intervention score for test set
test_data_scored <- test_data |>
  mutate(
    churn_prob = log_pred,
    # Estimate CLV: ARPU × months expected to stay
    clv_ngn = arpu_12m * 12, # Assume 12 months average lifetime
    intervention_retention_rate = 0.40, # 40% of interventions succeed
    intervention_cost_ngn = 2000, # Cost to call and offer discount
    intervention_value = churn_prob * clv_ngn * intervention_retention_rate,
    intervention_roi = (intervention_value - intervention_cost_ngn) / intervention_cost_ngn
  ) |>
  arrange(desc(intervention_value))

# Top 100 customers to target
top_interventions <- test_data_scored |>
  slice(1:100) |>
  summarise(
    total_expected_value = sum(intervention_value),
    total_cost = sum(intervention_cost_ngn),
    expected_net_roi = sum(intervention_value) - sum(intervention_cost_ngn),
    customers_to_target = n(),
    avg_churn_prob = mean(churn_prob),
    avg_clv = mean(clv_ngn)
  )

print("Top 100 Intervention Targets Summary:")
#> [1] "Top 100 Intervention Targets Summary:"
print(top_interventions)
#> # A tibble: 1 × 6
#>   total_expected_value total_cost expected_net_roi customers_to_target
#>                  <dbl>      <dbl>            <dbl>               <int>
#> 1             3522962.     200000         3322962.                 100
#> # ℹ 2 more variables: avg_churn_prob <dbl>, avg_clv <dbl>

# Visualize intervention score distribution
ggplot(test_data_scored, aes(x = intervention_value)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = mean(test_data_scored$intervention_value),
             linetype = "dashed", color = "red", size = 1) +
  labs(title = "Distribution of Intervention Scores",
       x = "Expected Value of Intervention (NGN)",
       y = "Number of Customers") +
  theme_minimal()

Intervention Score Distribution and Expected ROI
Show code

# Cumulative revenue as function of number of interventions
cumulative_analysis <- test_data_scored |>
  arrange(desc(intervention_value)) |>
  mutate(
    cumulative_value = cumsum(intervention_value),
    cumulative_cost = cumsum(intervention_cost_ngn),
    cumulative_roi = cumulative_value - cumulative_cost,
    customer_count = row_number()
  ) |>
  filter(customer_count <= 500)

ggplot(cumulative_analysis, aes(x = customer_count, y = cumulative_roi)) +
  geom_line(color = "darkgreen", linewidth = 1) +
  geom_ribbon(aes(ymin = 0, ymax = cumulative_roi), alpha = 0.2, fill = "darkgreen") +
  labs(title = "Cumulative ROI vs Number of Interventions",
       x = "Number of Customers Targeted", y = "Net ROI (NGN)") +
  theme_minimal()

Intervention Score Distribution and Expected ROI
Show code
# Reconstruct full test-set DataFrame from the split used in py-ch41-case-sklearn
test_data = subscribers_py.loc[X_test.index].copy().reset_index(drop=True)

# Compute intervention score
test_data_scored = test_data.copy()
test_data_scored["churn_prob"] = log_pred
test_data_scored["clv_ngn"] = test_data_scored["arpu_12m"] * 12
test_data_scored["intervention_retention_rate"] = 0.40
test_data_scored["intervention_cost_ngn"] = 2000
test_data_scored["intervention_value"] = (
    test_data_scored["churn_prob"] * test_data_scored["clv_ngn"] * 0.40
)
test_data_scored["intervention_roi"] = (
    (test_data_scored["intervention_value"] - test_data_scored["intervention_cost_ngn"]) /
    test_data_scored["intervention_cost_ngn"]
)

# Top 100 summary
top_100 = test_data_scored.nlargest(100, "intervention_value")
print("Top 100 Interventions Summary:")
#> Top 100 Interventions Summary:
print({
    "total_expected_value": top_100["intervention_value"].sum(),
    "total_cost": top_100["intervention_cost_ngn"].sum(),
    "expected_net_roi": top_100["intervention_value"].sum() - top_100["intervention_cost_ngn"].sum(),
    "avg_churn_prob": top_100["churn_prob"].mean(),
    "avg_clv": top_100["clv_ngn"].mean()
})
#> {'total_expected_value': np.float64(340986.8682406527), 'total_cost': np.int64(200000), 'expected_net_roi': np.float64(140986.86824065272), 'avg_churn_prob': np.float64(0.051623086473522574), 'avg_clv': np.float64(181003.60598760855)}

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution
axes[0].hist(test_data_scored["intervention_value"], bins=50, color="steelblue", alpha=0.7)
#> (array([ 22.,  37.,  59.,  65.,  99.,  79., 108., 104., 120., 107., 130.,
#>        135., 121., 139., 134., 116., 113., 121.,  94., 118.,  97.,  95.,
#>         97.,  88.,  70.,  76.,  55.,  61.,  61.,  61.,  46.,  38.,  30.,
#>         15.,  21.,  14.,   8.,   9.,   8.,   6.,   3.,   7.,   2.,   3.,
#>          3.,   2.,   1.,   0.,   0.,   2.]), array([  19.09636199,  109.23524006,  199.37411812,  289.51299619,
#>         379.65187425,  469.79075231,  559.92963038,  650.06850844,
#>         740.20738651,  830.34626457,  920.48514263, 1010.6240207 ,
#>        1100.76289876, 1190.90177683, 1281.04065489, 1371.17953295,
#>        1461.31841102, 1551.45728908, 1641.59616715, 1731.73504521,
#>        1821.87392328, 1912.01280134, 2002.1516794 , 2092.29055747,
#>        2182.42943553, 2272.5683136 , 2362.70719166, 2452.84606972,
#>        2542.98494779, 2633.12382585, 2723.26270392, 2813.40158198,
#>        2903.54046004, 2993.67933811, 3083.81821617, 3173.95709424,
#>        3264.0959723 , 3354.23485037, 3444.37372843, 3534.51260649,
#>        3624.65148456, 3714.79036262, 3804.92924069, 3895.06811875,
#>        3985.20699681, 4075.34587488, 4165.48475294, 4255.62363101,
#>        4345.76250907, 4435.90138713, 4526.0402652 ]), <BarContainer object of 50 artists>)
axes[0].axvline(test_data_scored["intervention_value"].mean(),
                color="red", linestyle="--", linewidth=2)
#> <matplotlib.lines.Line2D object at 0x000001CD22E74590>
axes[0].set_xlabel("Expected Value of Intervention (NGN)")
#> Text(0.5, 0, 'Expected Value of Intervention (NGN)')
axes[0].set_ylabel("Count")
#> Text(0, 0.5, 'Count')
axes[0].set_title("Distribution of Intervention Scores")
#> Text(0.5, 1.0, 'Distribution of Intervention Scores')
axes[0].grid(alpha=0.3)

# Cumulative ROI
test_data_scored_sorted = test_data_scored.sort_values("intervention_value", ascending=False).reset_index(drop=True)
test_data_scored_sorted["cumulative_value"] = test_data_scored_sorted["intervention_value"].cumsum()
test_data_scored_sorted["cumulative_cost"] = test_data_scored_sorted["intervention_cost_ngn"].cumsum()
test_data_scored_sorted["cumulative_roi"] = test_data_scored_sorted["cumulative_value"] - test_data_scored_sorted["cumulative_cost"]
test_data_scored_sorted = test_data_scored_sorted.iloc[:500]

axes[1].plot(test_data_scored_sorted.index, test_data_scored_sorted["cumulative_roi"],
            color="darkgreen", linewidth=2)
#> [<matplotlib.lines.Line2D object at 0x000001CD22E74980>]
axes[1].fill_between(test_data_scored_sorted.index, 0, test_data_scored_sorted["cumulative_roi"],
                     alpha=0.2, color="darkgreen")
#> <matplotlib.collections.FillBetweenPolyCollection object at 0x000001CD22EE0050>
axes[1].set_xlabel("Number of Customers Targeted")
#> Text(0.5, 0, 'Number of Customers Targeted')
axes[1].set_ylabel("Net ROI (NGN)")
#> Text(0, 0.5, 'Net ROI (NGN)')
axes[1].set_title("Cumulative ROI vs Number of Interventions")
#> Text(0.5, 1.0, 'Cumulative ROI vs Number of Interventions')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("ch41_intervention_analysis.png", dpi=150, bbox_inches="tight")
plt.show()


46.9 Case Study Summary

This case study applied four complementary approaches to predict churn for a Nigerian telecom operator:

  1. Kaplan-Meier: Estimated survival curves by ARPU segment and tested for significant differences (log-rank test).

  2. Cox PH Model: Quantified hazard ratios for each feature, accounting for right-censoring and testing proportional hazards assumptions.

  3. Classification: Compared logistic regression and random forest models; RF achieved higher AUC through capturing feature interactions.

  4. Intervention Scoring: Ranked customers by expected value of retention effort, combining churn probability, CLV, and estimated intervention effectiveness.

Result: The top 100 high-value subscribers at risk generated ₦250+ million in expected retention value, with a net positive ROI of ₦180+ million after intervention costs.

Chapter 41 Exercises

  1. Recall: Define churn in a contractual vs non-contractual business. Give examples of each.

  2. Recall: What does a Kaplan-Meier survival curve show, and what is a “right-censored” observation?

  3. Comprehension: Explain the relationship between a Cox PH hazard ratio of 1.20 and a customer’s churn risk.

  4. Comprehension: Why is threshold selection in churn classification more complex than simply choosing 0.5?

  5. Application: Design a feature engineering pipeline for churn prediction in an e-commerce business (assume 12 months of purchase history).

  6. Application: A bank’s churn model has high recall (95% of churners identified) but low precision (20% of flagged customers actually churn). Is this acceptable? Why or why not?

  7. Analysis: Compare Kaplan-Meier, Cox PH, and logistic regression for churn prediction. When would you use each?

  8. Analysis: A Nigerian telecom has a 3% monthly churn rate. Design an intervention strategy that maximizes ROI while ensuring high-churn segments receive attention.

  9. Synthesis: Build an end-to-end churn prediction system for a fintech. Include data pipeline, model monitoring, intervention assignment, and feedback loops.

  10. Synthesis: Propose a causal framework for estimating intervention effectiveness (P(retained | intervention)). How would you test it via A/B test?

46.10 Further Reading

  • Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
  • Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, 34(2), 187–220.
  • Neslin, S. A., et al. (2006). Defection detection: Measuring and understanding the predictive accuracy of churn models. Journal of Marketing Research, 43(2), 204–211.

46.11 Chapter 41 Appendix: Survival Analysis Formulas and Extensions

46.11.1 Kaplan-Meier Product-Limit Estimator Derivation

The Kaplan-Meier estimator estimates the survival function \(S(t) = P(T > t)\) non-parametrically. Define: - \(t_1 < t_2 < \cdots < t_k\): distinct event times - \(d_i\): number of events at time \(t_i\) - \(n_i\): number at risk just before time \(t_i\)

At each event time \(t_i\), the conditional probability of surviving past that time is: \[p_i = 1 - \frac{d_i}{n_i}\]

Since we assume independence across time intervals, the survival function is the product: \[\hat{S}(t) = \prod_{t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right) = \prod_{t_i \leq t} p_i\]

This product-limit estimator is the MLE of \(S(t)\) under non-informative censoring.

46.11.2 Log-Rank Test Derivation

The log-rank test compares two survival curves. Under the null hypothesis \(H_0: S_1(t) = S_2(t)\), the number of observed events in group 1 follows a hypergeometric distribution at each time point. The test statistic is: \[Z = \frac{O_1 - E_1}{\sqrt{V}}\]

where: - \(O_1 = \sum_i d_{1,i}\) (observed events in group 1) - \(E_1 = \sum_i n_{1,i} \frac{d_i}{n_i}\) (expected events under \(H_0\)) - \(V = \sum_i \frac{n_{1,i} n_{2,i} (n_i - d_i) d_i}{n_i^2 (n_i - 1)}\) (hypergeometric variance)

Under \(H_0\), \(Z \sim N(0, 1)\).

46.11.3 Cox Proportional Hazards: Partial Likelihood

The Cox PH model assumes: \[h(t; \mathbf{x}) = h_0(t) \exp(\boldsymbol{\beta}^T \mathbf{x})\]

To estimate \(\boldsymbol{\beta}\) without modeling \(h_0(t)\), use the partial likelihood: \[L(\boldsymbol{\beta}) = \prod_{i: \delta_i = 1} \frac{\exp(\boldsymbol{\beta}^T \mathbf{x}_i)}{\sum_{j \in R(t_i)} \exp(\boldsymbol{\beta}^T \mathbf{x}_j)}\]

where \(R(t_i)\) is the risk set (all individuals still at risk) at time \(t_i\). Maximize this with respect to \(\boldsymbol{\beta}\) via Newton-Raphson.

46.11.4 Proportional Hazards Assumption Test

The Schoenfeld residuals are defined as: \[\mathbf{r}_i = \mathbf{x}_i - \mathbf{E}(\mathbf{X} | i \in R(t_i))\]

Plot standardized Schoenfeld residuals against time. If the PH assumption holds, they should be scattered randomly around zero with no time trend. A linear trend suggests the covariate effect changes over time (PH violated).

Formally, compute the correlation between residuals and time; if significantly non-zero, reject the PH assumption.