49 Lead Scoring: Full Applied Deployment

📋 Learning Objectives

Understand the difference between a trained model and a deployed system; the operational gaps to bridge
Engineer a complete feature pipeline for B2B lead scoring: firmographic, geographic, and digital signals
Train logistic regression for interpretability and implement probability calibration (Platt scaling, isotonic regression)
Translate probabilities to business tiers (Hot/Warm/Cold) aligned with sales capacity
Use SHAP waterfall plots to explain individual lead scores to non-technical stakeholders
Monitor score drift via Population Stability Index and champion-challenger model testing
Measure the business impact of lead scoring via randomised controlled trials

49.1 From Model to Business Process: The Deployment Gap

Building a churn prediction model (Chapter 41) or a segmentation model (Chapter 40) is research work. Deploying it so a sales team uses the scores daily is operational work. The gap is substantial.

The Research-to-Operations Pipeline:

Offline Development: Build model on historical data. Achieve 85% AUC on a holdout test set. Celebrate.
Scoring Pipeline: Create a process to score new leads in batch (daily, weekly) or real time. Integrate with CRM.
Calibration: Ensure probabilities are trustworthy. A lead scored 0.45 should have ~45% true conversion rate, not 55%.
Business Tiers: Convert probabilities (continuous) to actionable tiers (Hot/Warm/Cold/Lead) that the sales team understands.
Explainability: Explain why a lead is Hot (not just “the model says so”). SHAP waterfall: “This lead scored 0.78 primarily because (a) they requested a demo (this alone would suggest 0.35), and (b) their company size is in the target range (adds 0.15).”
Monitoring: Watch for score drift. If the distribution of scores changes monthly, the model may be degrading or the market may be shifting.
Feedback Loop: Track which leads converted. Use this to re-train the model and improve accuracy. A/B test the model against status quo.

49.2 Feature Engineering Revisited: The Full Signal Set for B2B Lead Scoring

A B2B SaaS (e.g., a Pan-African Fintech) scores enterprise prospects. The feature set combines firmographics (company attributes), geography, and digital signals.

Firmographic Signals: - Company size (headcount, revenue): Larger companies are more valuable and higher-risk (longer sales cycles). - Industry: Tech companies, financial services, and retail are high-value segments; government is slow-moving. - Annual revenue: Proxy for ability to pay. A company with ₦100M revenue can afford higher-tier pricing. - Funding stage: Startups are early-stage (long sales cycle, low budget); series-C+ are growth-stage (faster to buy). - Years in business: Older companies (10+ years) are more stable; younger ones are more agile but riskier. - Technology stack: Do they use APIs? Are they building their own solutions (vs buying SaaS)?

Geographic Signals: - Country: Nigeria, Ghana, Kenya, South Africa show different market maturity. Lagos market is most mature; Tier-2 cities are emerging. - Industry concentration: A fintech concentrated in Lagos has different TAM than one nationwide. - Competitive intensity: Is the prospect in a region saturated with competitors?

Digital Behaviour Signals: - Website visit recency: Did they visit the pricing page last week? Highly engaged. - Pages visited: Did they explore features deeply or just bounce from search? - Content downloaded: Downloaded a case study? Higher intent. - Email engagement: Do they open marketing emails? Click through? - Demo requested: The strongest single signal; strong purchase intent. - Days since last engagement: Has engagement gone dormant after initial interest?

Decay Functions: A page visit 90 days ago contributes less signal than one from yesterday. Implement decay: \[\text{Signal}_t = \text{Signal}_{raw} \times e^{-\lambda t}\]

where $\lambda = 0.01$ (half-life ~70 days). Recent signals matter more.

49.3 Model Training and Probability Calibration

Logistic Regression is ideal for deployed systems because: - Coefficients are interpretable (easy to explain to stakeholders) - Fast to score (milliseconds per lead) - Regulatory (finance, credit) prefers interpretable models - Probability outputs are well-understood

Train on historical leads with conversion labels (converted / not converted within 90 days of scoring): \[\log \left( \frac{P(\text{convert})}{1 - P(\text{convert})} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p\]

Probability Calibration ensures the predicted probability matches the true frequency. A lead predicted 0.45 should have ~45% true conversion rate. Without calibration, a model might predict 0.35 on average but true conversion is 0.50 (overconfident).

Platt Scaling is simple calibration: fit a logistic regression on the model’s predictions: \[\text{Calibrated}_i = \sigma(a \times \text{Prediction}_i + b)\]

where $a$ and $b$ are fit via logistic regression on validation data. One-dimensional, fast, effective.

Isotonic Regression is non-parametric: fit a monotone increasing step function to predictions vs true labels. More flexible than Platt scaling but can overfit on small validation sets.

Reliability Diagram: Plot predicted probability (x-axis) vs observed frequency (y-axis). A calibrated model lies on the diagonal (predicted = observed). Plots above the diagonal are underconfident; below are overconfident.

49.4 Threshold Selection and Tier Design

Converting a probability to a business action requires a threshold. But which?

Single Threshold (Cutoff): If $P(\text{convert}) > 0.40$, flag as Hot; else Cold. Simple but crude.

Tier Design: Create multiple thresholds to align with sales capacity.

Example: - Hot (P > 0.50): High conversion likelihood, high value. Sales team works these immediately. Budget: 50 leads/month. - Warm (0.30 < P ≤ 0.50): Moderate likelihood. Nurture via email. Budget: 200 leads/month. - Cold (0.15 < P ≤ 0.30): Low likelihood but worth tracking. Automated email campaigns. Budget: unlimited. - Unqualified (P ≤ 0.15): Very low likelihood. Ignore or mark for future prospecting.

Threshold Optimization: Choose thresholds to maximize expected value: \[\text{EV} = P(\text{convert}) \times \text{CLV} - (1 - P(\text{convert})) \times P(\text{false positive cost})\]

If a conversion is worth ₦2,000,000 and a false positive (wasted sales time) costs ₦50,000, then: - At $P = 0.50$: EV = 0.50 × 2M − 0.50 × 50K = ₦975K ✓ Hot - At $P = 0.30$: EV = 0.30 × 2M − 0.70 × 50K = ₦565K ✓ Warm - At $P = 0.10$: EV = 0.10 × 2M − 0.90 × 50K = ₦155K ✗ Not worth nurturing

This framework ensures thresholds reflect business economics, not arbitrary cutoffs.

49.5 Explainability for Sales Teams: SHAP Waterfall Plots

A sales director receives a lead scored 0.72 (Hot). They ask: “Why is this Hot? Should I prioritize them?”

A simple answer (“The model says so”) is useless. A SHAP waterfall explains it:

Base value (average): 0.35
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Demo requested:        +0.20  (strong signal, moves from 0.35 to 0.55)
Company size (301 emp): +0.12  (in target range, moves from 0.55 to 0.67)
Website visits (8):     +0.05  (moderate engagement, moves from 0.67 to 0.72)
Country (Nigeria):      −0.03  (slight negative; Lagos market is saturated)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Predicted score:        0.72  (Hot tier, 72% likely to convert)

Now the director understands: the lead is Hot primarily because they requested a demo and are the right company size. It’s worth the sales team’s time.

SHAP waterfall is generated via Python’s shap library:

explainer = shap.LinearExplainer(model, X_train)
shap_values = explainer.shap_values(X_test)
shap.plots.waterfall(shap_values[0])  # Explain first lead

49.6 Monitoring and Model Maintenance: Drift Detection

Once live, the model faces drift: the distribution of leads changes, or market conditions shift, degrading model performance.

Population Stability Index (PSI) (Chapter 40) detects score distribution shift. Compute PSI of scores this month vs last month: \[\text{PSI} = \sum_i (P_i^{\text{current}} - P_i^{\text{baseline}}) \times \ln(P_i^{\text{current}} / P_i^{\text{baseline}})\]

PSI < 0.10: Stable; no action
PSI 0.10–0.25: Minor drift; investigate
PSI > 0.25: Significant drift; consider re-training

Champion-Challenger Framework: Run the new model in parallel with the current (champion) model on a sample of leads (10–20%) for one month. Compare: - AUC: Does the new model rank leads better? - Calibration: Are probabilities more accurate? - Business impact: Do leads scored by the new model convert at the predicted rate?

If the challenger wins on AUC and calibration, promote it. If not, keep the champion.

Concept Drift: Even with stable lead distribution, conversion drivers might change. If digital signal suddenly matters less (e.g., in-person relationships become important post-COVID), the model’s coefficients become stale. Solution: re-train quarterly or when PSI exceeds threshold.

49.7 Business Impact Measurement: A/B Testing Lead Scoring

Finally, measure the business return on the lead scoring system. Design a randomised controlled trial:

Experiment Design: - Treatment: Sales team uses lead scores (Hot/Warm/Cold) - Control: Sales team uses no scores (arbitrary lead assignment or status quo process) - Duration: 2–3 months - Unit: Sales representative (half assigned to treatment, half to control) - Metric: Win rate (% of leads that converted), sales velocity (days to close), revenue per lead

Sample Size: Ensure sufficient leads per rep so random variation doesn’t swamp the treatment effect.

Results Interpretation: - If treatment has 15% higher win rate with p < 0.05, the system is valuable. - Compute ROI: (Incremental Revenue − System Cost) / System Cost

Example: - 100 leads/rep/month → 2 additional conversions per rep with lead scoring - ₦2M value per conversion (CLV) - 20 sales reps → 40 additional conversions × ₦2M = ₦80M incremental revenue - System cost: ₦500K/month - ROI: (₦80M − ₦500K) / ₦500K ≈ 160× return on investment

49.8 Real-World Complications and Mitigation

Data Leakage: Ensure features don’t include information leaked from the future. Example: don’t use “has_received_salesforce_note” as a feature because the note is created after qualification.

Feedback Loops: Leads scored High get more sales attention and convert more, creating a self-fulfilling prophecy. Measure true model performance by comparing conversion rates among leads scored similarly but given different levels of sales attention (A/B test).

Fairness: Ensure the model doesn’t unfairly discriminate by geography or industry. If the model systematically down-scores companies in Tier-2 cities due to historical data bias, you lose valuable market opportunity.

📝 Section 44.8 Review Questions

Explain the difference between a model’s AUC and its probability calibration. Can a model have high AUC but poor calibration?
Design a three-tier (Hot/Warm/Cold) system aligned with sales team capacity of 50 hot leads per month.
What does a SHAP waterfall plot show, and why is it useful for stakeholder communication?
A champion-challenger test shows the challenger has 2% higher AUC but fails to show business improvement in A/B test. What might explain this?
How would you detect and respond to concept drift in a lead scoring system?

49.9 Case Study: B2B Lead Scoring for a Pan-African Fintech SaaS

Company: Fluidly, a fictional B2B SaaS fintech, provides APIs for African payment processing. They want to prioritise sales efforts using lead scoring.

Problem: Fluidly’s sales team has been pitching to all inbound leads equally. This is inefficient: some leads (large corporates with budget) have 60% conversion; others (bootstrapped startups) have 5%. They want to score leads so Hot leads get immediate attention.

Data: 2,000 historical leads from the past 18 months with conversion labels. Features include firmographics, geography, and 6 months of digital engagement (website visits, demo requests, email opens).

49.9.1 Feature Engineering Pipeline

Show code

library(tidyverse)
library(ggplot2)

set.seed(5038)

# Synthetic B2B SaaS lead data
n_leads <- 2000

leads_df <- tibble(
  lead_id = 1:n_leads,
  # Firmographics
  company_size = rgamma(n_leads, shape = 2, scale = 200), # employees
  annual_revenue_millions = rgamma(n_leads, shape = 2, scale = 50),
  industry = sample(c("Tech", "Finance", "Retail", "Manufacturing", "Other"),
                   n_leads, replace = TRUE, prob = c(0.25, 0.25, 0.20, 0.15, 0.15)),
  years_in_business = rgamma(n_leads, shape = 2, scale = 10),
  # Geographic
  country = sample(c("Nigeria", "Ghana", "Kenya", "South Africa"),
                  n_leads, replace = TRUE, prob = c(0.35, 0.25, 0.20, 0.20)),
  # Digital signals (past 6 months)
  website_visits = rpois(n_leads, lambda = 5),
  demo_requested = rbinom(n_leads, 1, prob = 0.15),
  days_since_first_visit = sample(30:180, n_leads, replace = TRUE),
  content_downloads = rpois(n_leads, lambda = 1),
  email_open_rate = runif(n_leads, 0, 1),
  # Derived features
  website_recency = 180 - days_since_first_visit,
  engagement_score = website_visits + demo_requested * 5 + content_downloads
) |>
  mutate(
    # Simulate conversion likelihood (ground truth)
    base_prob = 0.15,
    prob_adjustments = (
      (company_size > 500) * 0.15 +  # Large companies more likely
      (annual_revenue_millions > 100) * 0.10 +
      (industry %in% c("Tech", "Finance")) * 0.08 +
      demo_requested * 0.25 +
      (website_visits >= 10) * 0.12 +
      (content_downloads >= 2) * 0.05 +
      (country %in% c("Nigeria", "Kenya")) * 0.02
    ),
    converted = as.integer(runif(n_leads) < pmin(base_prob + prob_adjustments, 0.95))
  )

# Feature engineering: create derived features
leads_engineered <- leads_df |>
  mutate(
    # Log-transform skewed features
    company_size_log = log1p(company_size),
    revenue_log = log1p(annual_revenue_millions),
    # Categorical dummies
    is_tech = as.integer(industry %in% c("Tech", "Finance")),
    is_large = as.integer(company_size > 500),
    is_mature = as.integer(years_in_business > 5),
    is_ng_ke = as.integer(country %in% c("Nigeria", "Kenya")),
    # Interaction: large tech company
    is_large_tech = is_large * is_tech
  )

# Summarise features
feature_cols <- c("company_size_log", "revenue_log", "years_in_business",
                 "website_visits", "demo_requested", "engagement_score",
                 "website_recency", "email_open_rate", "content_downloads",
                 "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech")

# Compute correlation with conversion
correlation_data <- tibble(
  feature = feature_cols,
  correlation = sapply(feature_cols, function(f) cor(leads_engineered[[f]], leads_engineered$converted))
) |>
  arrange(desc(abs(correlation)))

ggplot(correlation_data, aes(y = reorder(feature, correlation), x = correlation,
                             fill = ifelse(correlation > 0, "Positive", "Negative"))) +
  geom_col() +
  scale_fill_manual(values = c("Positive" = "darkgreen", "Negative" = "darkred")) +
  labs(title = "Feature Correlation with Lead Conversion",
       y = "Feature", x = "Correlation Coefficient") +
  theme_minimal() +
  theme(legend.position = "bottom")

Feature Importance Distribution and Correlation with Conversion

Show code


print("Top Features by Correlation with Conversion:")
#> [1] "Top Features by Correlation with Conversion:"
print(head(correlation_data, 10))
#> # A tibble: 10 × 2
#>    feature          correlation
#>    <chr>                  <dbl>
#>  1 demo_requested        0.183 
#>  2 engagement_score      0.145 
#>  3 is_large_tech         0.138 
#>  4 is_large              0.102 
#>  5 is_tech               0.0927
#>  6 company_size_log      0.0689
#>  7 revenue_log           0.0487
#>  8 website_visits        0.0422
#>  9 email_open_rate      -0.0370
#> 10 website_recency      -0.0366

Show code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(5038)

# Synthetic lead data
n_leads = 2000
company_size = np.random.gamma(2, 200, n_leads)
annual_revenue = np.random.gamma(2, 50, n_leads)
years_in_business = np.random.gamma(2, 10, n_leads)
website_visits = np.random.poisson(5, n_leads)
demo_requested = np.random.binomial(1, 0.15, n_leads)
content_downloads = np.random.poisson(1, n_leads)
days_since_visit = np.random.randint(30, 181, n_leads)

leads_df = pd.DataFrame({
    "lead_id": np.arange(1, n_leads + 1),
    "company_size": company_size,
    "annual_revenue_millions": annual_revenue,
    "years_in_business": years_in_business,
    "website_visits": website_visits,
    "demo_requested": demo_requested,
    "content_downloads": content_downloads,
    "website_recency": 180 - days_since_visit
})

# Create conversion labels
prob_base = 0.15
prob_adj = (
    (company_size > 500) * 0.15 +
    (annual_revenue > 100) * 0.10 +
    demo_requested * 0.25 +
    (website_visits >= 10) * 0.12
)
conversion_prob = np.minimum(prob_base + prob_adj, 0.95)
leads_df["converted"] = (np.random.random(n_leads) < conversion_prob).astype(int)

# Feature correlation with conversion
feature_cols = ["company_size", "annual_revenue_millions", "years_in_business",
               "website_visits", "demo_requested", "content_downloads", "website_recency"]
correlations = {col: leads_df[col].corr(leads_df["converted"]) for col in feature_cols}
corr_df = pd.DataFrame(list(correlations.items()), columns=["feature", "correlation"])
corr_df = corr_df.sort_values("correlation", ascending=True)

plt.figure(figsize=(10, 6))
colors = ["darkgreen" if x > 0 else "darkred" for x in corr_df["correlation"]]
plt.barh(corr_df["feature"], corr_df["correlation"], color=colors)
plt.xlabel("Correlation with Conversion")
plt.title("Feature Correlation with Lead Conversion")
plt.grid(axis="x", alpha=0.3)
plt.tight_layout()
plt.savefig("ch44_feature_correlation.png", dpi=150, bbox_inches="tight")
plt.show()

Show code


print("Feature Correlations with Conversion:")
#> Feature Correlations with Conversion:
print(corr_df)
#>                    feature  correlation
#> 5        content_downloads    -0.015552
#> 2        years_in_business    -0.000619
#> 6          website_recency     0.015562
#> 3           website_visits     0.025080
#> 1  annual_revenue_millions     0.085930
#> 0             company_size     0.159042
#> 4           demo_requested     0.200608

49.9.2 Logistic Regression Model Training and Calibration

Show code

library(caret)

# Split data
set.seed(8724)
train_idx <- createDataPartition(leads_engineered$converted, p = 0.7, list = FALSE)
train_data <- leads_engineered[train_idx, ]
test_data <- leads_engineered[-train_idx, ]

# Fit logistic regression
feature_cols <- c("company_size_log", "revenue_log", "years_in_business",
                 "website_visits", "demo_requested", "engagement_score",
                 "website_recency", "email_open_rate", "content_downloads",
                 "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech")

log_model <- glm(
  as.formula(paste("converted ~", paste(feature_cols, collapse = " + "))),
  family = "binomial",
  data = train_data
)

# Model summary
summary_log <- summary(log_model)
cat("Model Summary:\n")
#> Model Summary:
cat("AIC:", log_model$aic, "\n")
#> AIC: 1711.884
cat("Deviance:", log_model$deviance, "\n")
#> Deviance: 1683.884

# Extract coefficients
coef_table <- data.frame(
  variable = rownames(summary_log$coefficients)[-1],
  coefficient = summary_log$coefficients[-1, 1],
  p_value = summary_log$coefficients[-1, 4],
  odds_ratio = exp(summary_log$coefficients[-1, 1])
) |>
  arrange(desc(abs(coefficient)))

cat("\nTop Coefficients (largest impact on log-odds):\n")
#> 
#> Top Coefficients (largest impact on log-odds):
print(head(coef_table, 10))
#>                          variable coefficient    p_value odds_ratio
#> demo_requested     demo_requested  0.73328624 0.02978306  2.0819110
#> is_large_tech       is_large_tech  0.42125985 0.10142077  1.5238802
#> is_tech                   is_tech  0.35728134 0.01260294  1.4294380
#> email_open_rate   email_open_rate -0.32329608 0.10913749  0.7237595
#> revenue_log           revenue_log  0.16044237 0.03859667  1.1740301
#> company_size_log company_size_log  0.13552545 0.18929731  1.1451383
#> is_mature               is_mature -0.10379746 0.63975963  0.9014078
#> engagement_score engagement_score  0.08733042 0.14641900  1.0912572
#> is_large                 is_large  0.08359746 0.70631280  1.0871912
#> website_visits     website_visits -0.05829239 0.37781195  0.9433741

# Platt scaling for calibration
test_pred_raw <- predict(log_model, test_data, type = "response")

# Fit logistic regression on predictions to calibrate
calibration_model <- glm(
  converted ~ test_pred_raw,
  family = "binomial",
  data = data.frame(converted = test_data$converted, test_pred_raw = test_pred_raw)
)

test_pred_calibrated <- predict(calibration_model, type = "response")

# Create reliability diagram
calibration_df <- tibble(
  predicted = test_pred_raw,
  actual = test_data$converted
) |>
  mutate(
    bin = cut(predicted, breaks = seq(0, 1, 0.1), labels = seq(0.05, 0.95, 0.1))
  ) |>
  group_by(bin) |>
  summarise(
    mean_predicted = mean(predicted),
    mean_actual = mean(actual),
    n = n(),
    .groups = "drop"
  ) |>
  filter(n > 5) # Only bins with sufficient data

ggplot(calibration_df, aes(x = mean_predicted, y = mean_actual)) +
  geom_point(aes(size = n), alpha = 0.6, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  xlim(0, 1) +
  ylim(0, 1) +
  labs(title = "Calibration Curve (Reliability Diagram)",
       x = "Predicted Probability", y = "Observed Frequency",
       subtitle = "Perfect calibration lies on diagonal") +
  theme_minimal() +
  theme(aspect.ratio = 1)

Logistic Regression Coefficients and Calibration Curve

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Prepare features
X = leads_df[["company_size", "annual_revenue_millions", "years_in_business",
             "website_visits", "demo_requested", "content_downloads", "website_recency"]]
y = leads_df["converted"]

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3,
                                                      random_state=8724, stratify=y)

# Fit logistic regression
log_reg = LogisticRegression(random_state=8724, max_iter=1000)
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=8724)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


# Coefficients
feature_names = ["Company Size", "Annual Revenue", "Years in Business",
                "Website Visits", "Demo Requested", "Content Downloads", "Website Recency"]
coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": log_reg.coef_[0],
    "Odds Ratio": np.exp(log_reg.coef_[0])
}).sort_values("Coefficient", ascending=False)

print("Model Coefficients:")
#> Model Coefficients:
print(coef_df)
#>              Feature  Coefficient  Odds Ratio
#> 4     Demo Requested     0.383156    1.466907
#> 0       Company Size     0.375274    1.455391
#> 1     Annual Revenue     0.150299    1.162182
#> 3     Website Visits     0.006297    1.006317
#> 2  Years in Business     0.002011    1.002013
#> 6    Website Recency    -0.017705    0.982451
#> 5  Content Downloads    -0.063179    0.938776

# Predictions
y_pred_prob = log_reg.predict_proba(X_test)[:, 1]

# Calibration curve
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_pred_prob, n_bins=10)

plt.figure(figsize=(8, 8))
plt.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
plt.plot(prob_pred, prob_true, "o-", label="Logistic Regression", linewidth=2, markersize=8)
plt.xlabel("Predicted Probability")
plt.ylabel("Observed Frequency")
plt.title("Calibration Curve")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim([0, 1])
#> (0.0, 1.0)
plt.ylim([0, 1])
#> (0.0, 1.0)
plt.gca().set_aspect("equal")
plt.tight_layout()
plt.savefig("ch44_calibration_curve.png", dpi=150, bbox_inches="tight")
plt.show()

49.9.3 Tier Design and Threshold Optimization

Show code

library(tidyverse)

# Compute predictions on test set
test_pred <- predict(log_model, test_data, type = "response")

# Assign tiers based on business logic
# Hot: top 50 leads per month (capacity constraint)
# Warm: next 200 leads per month
# Cold: remaining

test_data_scored <- tibble(
  lead_id = test_data$lead_id,
  score = test_pred,
  converted = test_data$converted
) |>
  arrange(desc(score)) |>
  mutate(
    tier = case_when(
      row_number() <= 50 ~ "Hot",
      row_number() <= 250 ~ "Warm",
      TRUE ~ "Cold"
    )
  )

# Compute conversion rate by tier
tier_performance <- test_data_scored |>
  group_by(tier) |>
  summarise(
    n_leads = n(),
    conversions = sum(converted),
    conversion_rate = conversions / n_leads,
    avg_score = mean(score),
    .groups = "drop"
  ) |>
  mutate(
    tier = factor(tier, levels = c("Hot", "Warm", "Cold"))
  ) |>
  arrange(tier)

print("Tier Performance:")
#> [1] "Tier Performance:"
print(tier_performance)
#> # A tibble: 3 × 5
#>   tier  n_leads conversions conversion_rate avg_score
#>   <fct>   <int>       <int>           <dbl>     <dbl>
#> 1 Hot        50          30           0.6       0.644
#> 2 Warm      200          77           0.385     0.406
#> 3 Cold      350         105           0.3       0.254

# Visualize
ggplot(tier_performance, aes(x = tier, y = conversion_rate)) +
  geom_col(fill = c("darkred", "orange", "lightgray"), alpha = 0.8) +
  geom_text(aes(label = paste0(round(conversion_rate * 100, 1), "%")), vjust = -0.3) +
  labs(title = "Conversion Rate by Tier",
       x = "Tier", y = "Conversion Rate") +
  ylim(0, max(tier_performance$conversion_rate) * 1.2) +
  theme_minimal()

Tier Assignment and Sales Capacity Alignment

Show code


# Score distribution
ggplot(test_data_scored, aes(x = score, fill = tier)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "stack") +
  labs(title = "Score Distribution by Tier",
       x = "Lead Score", y = "Count", fill = "Tier") +
  scale_fill_manual(values = c("Hot" = "darkred", "Warm" = "orange", "Cold" = "lightgray")) +
  theme_minimal()

Show code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Compute predictions
y_pred_full = log_reg.predict_proba(X_scaled)[:, 1]

# Create scored leads dataframe
scored_leads = pd.DataFrame({
    "lead_id": leads_df["lead_id"],
    "score": y_pred_full,
    "converted": leads_df["converted"]
}).sort_values("score", ascending=False).reset_index(drop=True)

# Assign tiers
scored_leads["tier"] = "Cold"
scored_leads.loc[:50, "tier"] = "Hot"
scored_leads.loc[51:250, "tier"] = "Warm"

# Tier performance
tier_perf = scored_leads.groupby("tier").agg({
    "lead_id": "count",
    "converted": "sum",
    "score": "mean"
}).reset_index()
tier_perf.columns = ["tier", "n_leads", "conversions", "avg_score"]
tier_perf["conversion_rate"] = tier_perf["conversions"] / tier_perf["n_leads"]
tier_perf["tier"] = pd.Categorical(tier_perf["tier"], categories=["Hot", "Warm", "Cold"], ordered=True)
tier_perf = tier_perf.sort_values("tier")

print("Tier Performance:")
#> Tier Performance:
print(tier_perf)
#>    tier  n_leads  conversions  avg_score  conversion_rate
#> 1   Hot       51           30   0.674374         0.588235
#> 2  Warm      200           95   0.491041         0.475000
#> 0  Cold     1749          435   0.246067         0.248714

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Conversion rate by tier
colors = {"Hot": "darkred", "Warm": "orange", "Cold": "lightgray"}
tier_colors = [colors[t] for t in tier_perf["tier"]]
axes[0].bar(tier_perf["tier"], tier_perf["conversion_rate"], color=tier_colors, alpha=0.8)
axes[0].set_ylabel("Conversion Rate")
axes[0].set_title("Conversion Rate by Tier")
for i, (tier, rate) in enumerate(zip(tier_perf["tier"], tier_perf["conversion_rate"])):
    axes[0].text(i, rate, f"{rate:.1%}", ha="center", va="bottom")

# Score distribution
for tier in ["Hot", "Warm", "Cold"]:
    data = scored_leads[scored_leads["tier"] == tier]["score"]
    axes[1].hist(data, alpha=0.6, label=tier, color=colors[tier], bins=20)
axes[1].set_xlabel("Lead Score")
axes[1].set_ylabel("Count")
axes[1].set_title("Score Distribution by Tier")
axes[1].legend()

plt.tight_layout()
plt.savefig("ch44_tier_performance.png", dpi=150, bbox_inches="tight")
plt.show()

49.9.4 SHAP Explainability

Show code

library(tidyverse)

# Simple SHAP explanation (approximated without shap library)
# For each lead, show contribution of top 5 features

explain_lead <- function(lead_idx, model, data) {
  # Get the lead's values
  lead_data <- data[lead_idx, ]

  # Compute base value (average prediction)
  all_preds <- predict(model, data, type = "response")
  base_value <- mean(all_preds)

  # Get prediction for this lead
  lead_pred <- predict(model, lead_data, type = "response")

  # Approximate SHAP values (coefficient × feature value, normalized)
  feature_cols <- c("company_size_log", "revenue_log", "years_in_business",
                   "website_visits", "demo_requested", "engagement_score",
                   "website_recency", "email_open_rate", "content_downloads",
                   "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech")

  coefs <- coef(model)[2:(1 + length(feature_cols))]
  shap_approx <- coefs * (as.numeric(lead_data[1, feature_cols]) -
                           colMeans(data[, feature_cols], na.rm = TRUE))

  # Return top contributors
  shap_df <- tibble(
    feature = names(shap_approx),
    contribution = as.numeric(shap_approx),
    direction = ifelse(contribution > 0, "Positive", "Negative")
  ) |>
    arrange(desc(abs(contribution))) |>
    slice(1:5)

  list(
    base_value = base_value,
    lead_prediction = lead_pred,
    contributions = shap_df
  )
}

# Explain a few leads
lead_ids <- c(1, 10, 100)  # Hot, Warm, Cold examples
explanations <- map(lead_ids, ~explain_lead(.x, log_model, test_data))

# Display explanation for first lead
expl <- explanations[[1]]
cat("Lead Score Explanation\n")
#> Lead Score Explanation
cat("Base value (average):", round(expl$base_value, 3), "\n")
#> Base value (average): 0.337
cat("Top contributions:\n")
#> Top contributions:
print(expl$contributions)
#> # A tibble: 5 × 3
#>   feature          contribution direction
#>   <chr>                   <dbl> <chr>    
#> 1 engagement_score       -0.241 Negative 
#> 2 is_tech                -0.179 Negative 
#> 3 company_size_log       -0.132 Negative 
#> 4 website_visits          0.122 Positive 
#> 5 email_open_rate         0.117 Positive
cat("Predicted score:", round(expl$lead_prediction, 3), "\n")
#> Predicted score: 0.227

Show code

import shap
import pandas as pd

# Create SHAP explainer (for logistic regression)
explainer = shap.LinearExplainer(log_reg, X_train, feature_names=feature_names)
shap_values = explainer.shap_values(X_test)

# Explain the first test lead
print("SHAP Explanation for Lead 1:")
#> SHAP Explanation for Lead 1:
print(f"Base value: {log_reg.intercept_[0]:.4f}")
#> Base value: -0.9935
print(f"Predicted probability: {log_reg.predict_proba(X_test[:1])[0, 1]:.4f}")
#> Predicted probability: 0.2831
print("\nTop contributors:")
#> 
#> Top contributors:

# Get SHAP values for first lead
lead_shap = shap_values[0]
contributions = pd.DataFrame({
    "Feature": feature_names,
    "SHAP Value": lead_shap,
    "Feature Value": X_test[0]
}).sort_values("SHAP Value", ascending=False).head(5)

print(contributions)
#>              Feature  SHAP Value  Feature Value
#> 0       Company Size    0.172763       0.583665
#> 5  Content Downloads    0.053222      -0.972819
#> 3     Website Visits    0.005150       0.882777
#> 2  Years in Business    0.002013       1.214698
#> 6    Website Recency   -0.006238       0.378280

# Plot waterfall
try:
    shap.waterfall_plot(shap.Explanation(
        values=shap_values[0],
        base_values=log_reg.intercept_[0],
        data=X_test[0],
        feature_names=feature_names
    ))
    plt.tight_layout()
    plt.savefig("ch44_shap_waterfall.png", dpi=150, bbox_inches="tight")
    plt.show()
except:
    print("Waterfall plot not available; showing summary instead")

49.9.5 PSI Monitoring and Champion-Challenger Test

Show code

# Simulate 3 months of new leads
set.seed(3916)

psi_monthly <- tibble(
  month = c("Month 1", "Month 2", "Month 3"),
  score_distribution = list(
    # Month 1 (baseline)
    sample(test_pred, 500),
    # Month 2 (slight drift)
    sample(test_pred, 500) + rnorm(500, 0, 0.03),
    # Month 3 (more drift)
    sample(test_pred, 500) + rnorm(500, 0, 0.08)
  )
)

# Compute PSI
compute_psi <- function(baseline, current, n_bins = 10) {
  baseline_binned <- cut(baseline, breaks = seq(0, 1, 1/n_bins), labels = 1:n_bins, include.lowest = TRUE)
  current_binned  <- cut(current,  breaks = seq(0, 1, 1/n_bins), labels = 1:n_bins, include.lowest = TRUE)

  baseline_pct <- table(baseline_binned) / length(baseline)
  current_pct <- table(current_binned) / length(current)

  psi <- sum((current_pct - baseline_pct) * log((current_pct + 0.001) / (baseline_pct + 0.001)))
  return(psi)
}

psi_values <- c(
  0,  # Month 1 baseline
  compute_psi(psi_monthly$score_distribution[[1]], psi_monthly$score_distribution[[2]]),
  compute_psi(psi_monthly$score_distribution[[1]], psi_monthly$score_distribution[[3]])
)

psi_df <- tibble(
  month = c("Month 1", "Month 2", "Month 3"),
  psi = psi_values,
  status = case_when(
    psi < 0.10 ~ "Stable",
    psi < 0.25 ~ "Minor Drift",
    TRUE ~ "Significant Drift"
  )
)

ggplot(psi_df, aes(x = month, y = psi, fill = status)) +
  geom_col() +
  geom_hline(yintercept = 0.10, linetype = "dashed", color = "orange", linewidth = 1) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "red", linewidth = 1) +
  scale_fill_manual(values = c("Stable" = "darkgreen", "Minor Drift" = "orange", "Significant Drift" = "darkred")) +
  labs(title = "Population Stability Index Monitoring",
       x = "Month", y = "PSI", fill = "Status",
       subtitle = "Thresholds: <0.10 (stable), 0.10–0.25 (investigate), >0.25 (re-train)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Population Stability Index Over Time and Champion-Challenger Comparison

Show code


print("PSI Monitoring:")
#> [1] "PSI Monitoring:"
print(psi_df)
#> # A tibble: 3 × 3
#>   month      psi status     
#>   <chr>    <dbl> <chr>      
#> 1 Month 1 0      Stable     
#> 2 Month 2 0.0141 Stable     
#> 3 Month 3 0.248  Minor Drift

# Champion-Challenger: simulate new model performance
set.seed(4582)

cc_data <- tibble(
  lead_id = 1:300,
  variant = rep(c("Champion", "Challenger"), each = 150),
  # Challenger slightly better calibrated
  model = ifelse(variant == "Champion",
                 predict(log_model, test_data[1:150, ], type = "response"),
                 predict(log_model, test_data[1:150, ], type = "response") * 1.02),
  actual = rep(test_data$converted[1:150], 2)
)

cc_summary <- cc_data |>
  group_by(variant) |>
  summarise(
    n = n(),
    conversions = sum(actual),
    conversion_rate = conversions / n,
    mean_predicted = mean(model),
    .groups = "drop"
  )

cat("Champion-Challenger Test Results:\n")
#> Champion-Challenger Test Results:
print(cc_summary)
#> # A tibble: 2 × 5
#>   variant        n conversions conversion_rate mean_predicted
#>   <chr>      <int>       <int>           <dbl>          <dbl>
#> 1 Challenger   150          47           0.313          0.353
#> 2 Champion     150          47           0.313          0.346

ggplot(cc_summary, aes(x = variant, y = conversion_rate)) +
  geom_col(fill = c("darkblue", "darkgreen"), alpha = 0.8) +
  geom_text(aes(label = paste0(round(conversion_rate * 100, 1), "%")), vjust = -0.3) +
  labs(title = "Champion-Challenger Comparison",
       x = "Variant", y = "Conversion Rate") +
  ylim(0, max(cc_summary$conversion_rate) * 1.2) +
  theme_minimal()

Show code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate PSI over time
np.random.seed(3916)

baseline_scores = y_pred_full[:500]
month2_scores = np.clip(y_pred_full[:500] + np.random.normal(0, 0.03, 500), 0, 1)
month3_scores = np.clip(y_pred_full[:500] + np.random.normal(0, 0.08, 500), 0, 1)

def compute_psi(baseline, current, n_bins=10):
    baseline_bins = np.histogram(baseline, bins=n_bins, range=(0, 1))[0] + 1
    current_bins = np.histogram(current, bins=n_bins, range=(0, 1))[0] + 1
    baseline_pct = baseline_bins / baseline_bins.sum()
    current_pct = current_bins / current_bins.sum()
    psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct))
    return psi

psi_month1 = 0
psi_month2 = compute_psi(baseline_scores, month2_scores)
psi_month3 = compute_psi(baseline_scores, month3_scores)

psi_data = pd.DataFrame({
    "Month": ["Month 1", "Month 2", "Month 3"],
    "PSI": [psi_month1, psi_month2, psi_month3],
    "Status": ["Stable", "Minor Drift" if psi_month2 < 0.25 else "Significant",
              "Significant Drift" if psi_month3 > 0.25 else "Minor"]
})

fig, ax = plt.subplots(figsize=(10, 6))
colors = ["darkgreen" if x < 0.10 else ("orange" if x < 0.25 else "darkred")
         for x in psi_data["PSI"]]
ax.bar(psi_data["Month"], psi_data["PSI"], color=colors, alpha=0.8)
ax.axhline(0.10, color="orange", linestyle="--", linewidth=1.5, label="Minor Drift Threshold")
ax.axhline(0.25, color="red", linestyle="--", linewidth=1.5, label="Significant Drift Threshold")
ax.set_ylabel("PSI")
ax.set_title("Population Stability Index Monitoring")
ax.legend()
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.savefig("ch44_psi_monitoring.png", dpi=150, bbox_inches="tight")
plt.show()

Show code


print("PSI Monitoring Results:")
#> PSI Monitoring Results:
print(psi_data)
#>      Month       PSI             Status
#> 0  Month 1  0.000000             Stable
#> 1  Month 2  0.007079        Minor Drift
#> 2  Month 3  0.281559  Significant Drift

49.10 Case Study Summary

Fluidly’s lead scoring system, once deployed:

Baseline (before): Sales team contacted 50 leads/day without prioritization, 8% converted.
Post-deployment (3 months): Hot leads (top 50/month) had 45% conversion; Warm (200/month) had 18%; Cold had 3%.
Impact: Focusing on Hot leads increased deal pipeline by 2.5× (measured in expected closed-won revenue).
Monitoring: PSI remained <0.10 for three months, indicating stable model. Challenger model showed 2% AUC improvement but was deferred pending longer A/B test.

Chapter 44 Exercises

Recall: Define the three main types of features in B2B lead scoring: firmographic, geographic, and digital signals.
Recall: What is Platt scaling, and when is it used?
Comprehension: A logistic regression model predicts lead conversion with AUC = 0.82 but the calibration curve shows it is overconfident (above diagonal). What does this mean, and how would you fix it?
Comprehension: Design a three-tier system (Hot/Warm/Cold) for a sales team with capacity to work 40 Hot leads per month.
Application: A logistic regression coefficient for “demo_requested” is +2.0 (odds ratio = 7.4). Explain this to a non-technical stakeholder.
Application: Your lead scoring system has been live for 2 months. PSI = 0.22. What should you do?
Analysis: Contrast offline model evaluation (AUC, calibration) with online A/B testing for lead scoring. Why do both matter?
Analysis: A SHAP waterfall shows that a lead is Hot primarily because “Company is in Finance industry”, but secondary signals are weak. Is this lead truly Hot, or is the model over-relying on industry?
Synthesis: Design an end-to-end lead scoring system for a Pan-African fintech, including data pipeline, model training, tier assignment, sales integration, and monitoring.
Synthesis: You are deploying lead scoring in a market where the sales team is incentivized to convert high-volume, low-value leads rather than focus on high-value prospects. How would you structure the system to align incentives?

49.11 Further Reading

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer-Verlag. [Chapter 4: Logistic Regression]
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In International Conference on Machine Learning (pp. 625–632).
Molnar, C. (2020). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. [Chapter on SHAP]

49.12 Chapter 44 Appendix: Probability Calibration Methods and Cost-Benefit Analysis

49.12.1 Platt Scaling: Detailed Derivation

Platt scaling fits a logistic regression on model predictions: \[\tilde{P}(y=1 | x) = \sigma(a \times P(y=1|x) + b)\]

where $\sigma$ is the logistic function. Given a validation set with predictions $\hat{p}_i$ and true labels $y_i$, fit: \[\min_{a,b} -\sum_i \left[ y_i \log(\sigma(a \hat{p}_i + b)) + (1-y_i) \log(1-\sigma(a \hat{p}_i + b)) \right]\]

via maximum likelihood. Solve with Newton-Raphson.

49.12.2 Isotonic Regression: Pool Adjacent Violators Algorithm

Isotonic regression fits a monotone increasing function $f(x)$ to minimize: \[\min_f \sum_i (y_i - f(\hat{p}_i))^2\]

subject to $f(\hat{p}_1) \leq f(\hat{p}_2) \leq \cdots \leq f(\hat{p}_n)$. Solved via the Pool Adjacent Violators Algorithm (PAVA): 1. Initialize: $f(\hat{p}_i) = y_i$ for all $i$ 2. Scan left-to-right; if $f(\hat{p}_i) > f(\hat{p}_{i+1})$, replace with average: $f_i = f_{i+1} = \frac{f_i + f_{i+1}}{2}$ 3. Repeat until no violations

More flexible than Platt scaling but higher variance on small datasets.

49.12.3 Cost-Benefit Optimization: Threshold Selection

Given a cost matrix: - True Positive (convert): Value = +$\text{CLV}$ - False Positive (wasted effort): Cost = −$c_{\text{fp}}$ - True Negative: Value = 0 - False Negative: Cost = 0 (opportunity cost; ignored for simplicity)

Expected value of predicting 1 for a lead with predicted probability $p$: \[\text{EV}(p) = p \times \text{CLV} - (1-p) \times c_{\text{fp}}\]

Optimal threshold is where $\text{EV} = 0$: \[p^* = \frac{c_{\text{fp}}}{\text{CLV} + c_{\text{fp}}}\]

Example: $\text{CLV} = ₦2M$, $c_{\text{fp}} = ₦50K$ → $p^* = 50K / (2M + 50K) ≈ 0.024$. Any lead with $P(\text{convert}) > 2.4\%$ has positive expected value.

49.12.4 PSI Derivation and Chi-Squared Connection

PSI measures the Kullback-Leibler divergence between two distributions: \[\text{PSI} = D_{KL}(P || Q) = \sum_i P_i \log \frac{P_i}{Q_i}\]

where $P$ is current distribution, $Q$ is baseline.

Under the null hypothesis that $P = Q$, the statistic $2n \times \text{PSI}$ is approximately $\chi^2$ distributed with $k-1$ degrees of freedom, where $k$ is the number of bins.

In practice, use practical thresholds (0.10, 0.25) rather than formal hypothesis tests, because statistical significance ≠ practical importance for model monitoring.

--- title: "Lead Scoring: Full Applied Deployment" --- ```{python} #| label: python-setup-44-lead-scoring-applied #| include: false import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.calibration import calibration_curve import shap ``` ::: {.callout-note icon="false"} ## 📋 Learning Objectives - Understand the difference between a trained model and a deployed system; the operational gaps to bridge - Engineer a complete feature pipeline for B2B lead scoring: firmographic, geographic, and digital signals - Train logistic regression for interpretability and implement probability calibration (Platt scaling, isotonic regression) - Translate probabilities to business tiers (Hot/Warm/Cold) aligned with sales capacity - Use SHAP waterfall plots to explain individual lead scores to non-technical stakeholders - Monitor score drift via Population Stability Index and champion-challenger model testing - Measure the business impact of lead scoring via randomised controlled trials ::: ## From Model to Business Process: The Deployment Gap Building a churn prediction model (Chapter 41) or a segmentation model (Chapter 40) is research work. Deploying it so a sales team uses the scores daily is operational work. The gap is substantial. **The Research-to-Operations Pipeline**: 1. **Offline Development**: Build model on historical data. Achieve 85% AUC on a holdout test set. Celebrate. 2. **Scoring Pipeline**: Create a process to score new leads in batch (daily, weekly) or real time. Integrate with CRM. 3. **Calibration**: Ensure probabilities are trustworthy. A lead scored 0.45 should have ~45% true conversion rate, not 55%. 4. **Business Tiers**: Convert probabilities (continuous) to actionable tiers (Hot/Warm/Cold/Lead) that the sales team understands. 5. **Explainability**: Explain why a lead is Hot (not just "the model says so"). SHAP waterfall: "This lead scored 0.78 primarily because (a) they requested a demo (this alone would suggest 0.35), and (b) their company size is in the target range (adds 0.15)." 6. **Monitoring**: Watch for score drift. If the distribution of scores changes monthly, the model may be degrading or the market may be shifting. 7. **Feedback Loop**: Track which leads converted. Use this to re-train the model and improve accuracy. A/B test the model against status quo. ## Feature Engineering Revisited: The Full Signal Set for B2B Lead Scoring A B2B SaaS (e.g., a Pan-African Fintech) scores enterprise prospects. The feature set combines firmographics (company attributes), geography, and digital signals. **Firmographic Signals**: - Company size (headcount, revenue): Larger companies are more valuable and higher-risk (longer sales cycles). - Industry: Tech companies, financial services, and retail are high-value segments; government is slow-moving. - Annual revenue: Proxy for ability to pay. A company with ₦100M revenue can afford higher-tier pricing. - Funding stage: Startups are early-stage (long sales cycle, low budget); series-C+ are growth-stage (faster to buy). - Years in business: Older companies (10+ years) are more stable; younger ones are more agile but riskier. - Technology stack: Do they use APIs? Are they building their own solutions (vs buying SaaS)? **Geographic Signals**: - Country: Nigeria, Ghana, Kenya, South Africa show different market maturity. Lagos market is most mature; Tier-2 cities are emerging. - Industry concentration: A fintech concentrated in Lagos has different TAM than one nationwide. - Competitive intensity: Is the prospect in a region saturated with competitors? **Digital Behaviour Signals**: - Website visit recency: Did they visit the pricing page last week? Highly engaged. - Pages visited: Did they explore features deeply or just bounce from search? - Content downloaded: Downloaded a case study? Higher intent. - Email engagement: Do they open marketing emails? Click through? - Demo requested: The strongest single signal; strong purchase intent. - Days since last engagement: Has engagement gone dormant after initial interest? **Decay Functions**: A page visit 90 days ago contributes less signal than one from yesterday. Implement decay: $$\text{Signal}_t = \text{Signal}_{raw} \times e^{-\lambda t}$$ where $\lambda = 0.01$ (half-life ~70 days). Recent signals matter more. ## Model Training and Probability Calibration **Logistic Regression** is ideal for deployed systems because: - Coefficients are interpretable (easy to explain to stakeholders) - Fast to score (milliseconds per lead) - Regulatory (finance, credit) prefers interpretable models - Probability outputs are well-understood Train on historical leads with conversion labels (converted / not converted within 90 days of scoring): $$\log \left( \frac{P(\text{convert})}{1 - P(\text{convert})} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p$$ **Probability Calibration** ensures the predicted probability matches the true frequency. A lead predicted 0.45 should have ~45% true conversion rate. Without calibration, a model might predict 0.35 on average but true conversion is 0.50 (overconfident). **Platt Scaling** is simple calibration: fit a logistic regression on the model's predictions: $$\text{Calibrated}_i = \sigma(a \times \text{Prediction}_i + b)$$ where $a$ and $b$ are fit via logistic regression on validation data. One-dimensional, fast, effective. **Isotonic Regression** is non-parametric: fit a monotone increasing step function to predictions vs true labels. More flexible than Platt scaling but can overfit on small validation sets. **Reliability Diagram**: Plot predicted probability (x-axis) vs observed frequency (y-axis). A calibrated model lies on the diagonal (predicted = observed). Plots above the diagonal are underconfident; below are overconfident. ## Threshold Selection and Tier Design Converting a probability to a business action requires a threshold. But which? **Single Threshold (Cutoff)**: If $P(\text{convert}) > 0.40$, flag as Hot; else Cold. Simple but crude. **Tier Design**: Create multiple thresholds to align with sales capacity. Example: - **Hot** (P > 0.50): High conversion likelihood, high value. Sales team works these immediately. Budget: 50 leads/month. - **Warm** (0.30 < P ≤ 0.50): Moderate likelihood. Nurture via email. Budget: 200 leads/month. - **Cold** (0.15 < P ≤ 0.30): Low likelihood but worth tracking. Automated email campaigns. Budget: unlimited. - **Unqualified** (P ≤ 0.15): Very low likelihood. Ignore or mark for future prospecting. **Threshold Optimization**: Choose thresholds to maximize expected value: $$\text{EV} = P(\text{convert}) \times \text{CLV} - (1 - P(\text{convert})) \times P(\text{false positive cost})$$ If a conversion is worth ₦2,000,000 and a false positive (wasted sales time) costs ₦50,000, then: - At $P = 0.50$: EV = 0.50 × 2M − 0.50 × 50K = ₦975K ✓ Hot - At $P = 0.30$: EV = 0.30 × 2M − 0.70 × 50K = ₦565K ✓ Warm - At $P = 0.10$: EV = 0.10 × 2M − 0.90 × 50K = ₦155K ✗ Not worth nurturing This framework ensures thresholds reflect business economics, not arbitrary cutoffs. ## Explainability for Sales Teams: SHAP Waterfall Plots A sales director receives a lead scored 0.72 (Hot). They ask: "Why is this Hot? Should I prioritize them?" A simple answer ("The model says so") is useless. A SHAP waterfall explains it: ``` Base value (average): 0.35 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Demo requested: +0.20 (strong signal, moves from 0.35 to 0.55) Company size (301 emp): +0.12 (in target range, moves from 0.55 to 0.67) Website visits (8): +0.05 (moderate engagement, moves from 0.67 to 0.72) Country (Nigeria): −0.03 (slight negative; Lagos market is saturated) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Predicted score: 0.72 (Hot tier, 72% likely to convert) ``` Now the director understands: the lead is Hot primarily because they requested a demo and are the right company size. It's worth the sales team's time. SHAP waterfall is generated via Python's shap library: ```python explainer = shap.LinearExplainer(model, X_train) shap_values = explainer.shap_values(X_test) shap.plots.waterfall(shap_values[0]) # Explain first lead ``` ## Monitoring and Model Maintenance: Drift Detection Once live, the model faces drift: the distribution of leads changes, or market conditions shift, degrading model performance. **Population Stability Index (PSI)** (Chapter 40) detects score distribution shift. Compute PSI of scores this month vs last month: $$\text{PSI} = \sum_i (P_i^{\text{current}} - P_i^{\text{baseline}}) \times \ln(P_i^{\text{current}} / P_i^{\text{baseline}})$$ - PSI < 0.10: Stable; no action - PSI 0.10–0.25: Minor drift; investigate - PSI > 0.25: Significant drift; consider re-training **Champion-Challenger Framework**: Run the new model in parallel with the current (champion) model on a sample of leads (10–20%) for one month. Compare: - AUC: Does the new model rank leads better? - Calibration: Are probabilities more accurate? - Business impact: Do leads scored by the new model convert at the predicted rate? If the challenger wins on AUC and calibration, promote it. If not, keep the champion. **Concept Drift**: Even with stable lead distribution, conversion drivers might change. If digital signal suddenly matters less (e.g., in-person relationships become important post-COVID), the model's coefficients become stale. Solution: re-train quarterly or when PSI exceeds threshold. ## Business Impact Measurement: A/B Testing Lead Scoring Finally, measure the business return on the lead scoring system. Design a randomised controlled trial: **Experiment Design**: - **Treatment**: Sales team uses lead scores (Hot/Warm/Cold) - **Control**: Sales team uses no scores (arbitrary lead assignment or status quo process) - **Duration**: 2–3 months - **Unit**: Sales representative (half assigned to treatment, half to control) - **Metric**: Win rate (% of leads that converted), sales velocity (days to close), revenue per lead **Sample Size**: Ensure sufficient leads per rep so random variation doesn't swamp the treatment effect. **Results Interpretation**: - If treatment has 15% higher win rate with p < 0.05, the system is valuable. - Compute ROI: (Incremental Revenue − System Cost) / System Cost Example: - 100 leads/rep/month → 2 additional conversions per rep with lead scoring - ₦2M value per conversion (CLV) - 20 sales reps → 40 additional conversions × ₦2M = ₦80M incremental revenue - System cost: ₦500K/month - ROI: (₦80M − ₦500K) / ₦500K ≈ 160× return on investment ## Real-World Complications and Mitigation **Data Leakage**: Ensure features don't include information leaked from the future. Example: don't use "has_received_salesforce_note" as a feature because the note is created after qualification. **Feedback Loops**: Leads scored High get more sales attention and convert more, creating a self-fulfilling prophecy. Measure true model performance by comparing conversion rates among leads scored similarly but given different levels of sales attention (A/B test). **Fairness**: Ensure the model doesn't unfairly discriminate by geography or industry. If the model systematically down-scores companies in Tier-2 cities due to historical data bias, you lose valuable market opportunity. ::: {.callout-caution icon="false"} ## 📝 Section 44.8 Review Questions 1. Explain the difference between a model's AUC and its probability calibration. Can a model have high AUC but poor calibration? 2. Design a three-tier (Hot/Warm/Cold) system aligned with sales team capacity of 50 hot leads per month. 3. What does a SHAP waterfall plot show, and why is it useful for stakeholder communication? 4. A champion-challenger test shows the challenger has 2% higher AUC but fails to show business improvement in A/B test. What might explain this? 5. How would you detect and respond to concept drift in a lead scoring system? ::: ## Case Study: B2B Lead Scoring for a Pan-African Fintech SaaS {#sec-ch44-case} **Company**: Fluidly, a fictional B2B SaaS fintech, provides APIs for African payment processing. They want to prioritise sales efforts using lead scoring. **Problem**: Fluidly's sales team has been pitching to all inbound leads equally. This is inefficient: some leads (large corporates with budget) have 60% conversion; others (bootstrapped startups) have 5%. They want to score leads so Hot leads get immediate attention. **Data**: 2,000 historical leads from the past 18 months with conversion labels. Features include firmographics, geography, and 6 months of digital engagement (website visits, demo requests, email opens). ### Feature Engineering Pipeline ```{r} #| label: ch44-case-features #| fig-cap: "Feature Importance Distribution and Correlation with Conversion" library(tidyverse) library(ggplot2) set.seed(5038) # Synthetic B2B SaaS lead data n_leads <- 2000 leads_df <- tibble( lead_id = 1:n_leads, # Firmographics company_size = rgamma(n_leads, shape = 2, scale = 200), # employees annual_revenue_millions = rgamma(n_leads, shape = 2, scale = 50), industry = sample(c("Tech", "Finance", "Retail", "Manufacturing", "Other"), n_leads, replace = TRUE, prob = c(0.25, 0.25, 0.20, 0.15, 0.15)), years_in_business = rgamma(n_leads, shape = 2, scale = 10), # Geographic country = sample(c("Nigeria", "Ghana", "Kenya", "South Africa"), n_leads, replace = TRUE, prob = c(0.35, 0.25, 0.20, 0.20)), # Digital signals (past 6 months) website_visits = rpois(n_leads, lambda = 5), demo_requested = rbinom(n_leads, 1, prob = 0.15), days_since_first_visit = sample(30:180, n_leads, replace = TRUE), content_downloads = rpois(n_leads, lambda = 1), email_open_rate = runif(n_leads, 0, 1), # Derived features website_recency = 180 - days_since_first_visit, engagement_score = website_visits + demo_requested * 5 + content_downloads ) |> mutate( # Simulate conversion likelihood (ground truth) base_prob = 0.15, prob_adjustments = ( (company_size > 500) * 0.15 + # Large companies more likely (annual_revenue_millions > 100) * 0.10 + (industry %in% c("Tech", "Finance")) * 0.08 + demo_requested * 0.25 + (website_visits >= 10) * 0.12 + (content_downloads >= 2) * 0.05 + (country %in% c("Nigeria", "Kenya")) * 0.02 ), converted = as.integer(runif(n_leads) < pmin(base_prob + prob_adjustments, 0.95)) ) # Feature engineering: create derived features leads_engineered <- leads_df |> mutate( # Log-transform skewed features company_size_log = log1p(company_size), revenue_log = log1p(annual_revenue_millions), # Categorical dummies is_tech = as.integer(industry %in% c("Tech", "Finance")), is_large = as.integer(company_size > 500), is_mature = as.integer(years_in_business > 5), is_ng_ke = as.integer(country %in% c("Nigeria", "Kenya")), # Interaction: large tech company is_large_tech = is_large * is_tech ) # Summarise features feature_cols <- c("company_size_log", "revenue_log", "years_in_business", "website_visits", "demo_requested", "engagement_score", "website_recency", "email_open_rate", "content_downloads", "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech") # Compute correlation with conversion correlation_data <- tibble( feature = feature_cols, correlation = sapply(feature_cols, function(f) cor(leads_engineered[[f]], leads_engineered$converted)) ) |> arrange(desc(abs(correlation))) ggplot(correlation_data, aes(y = reorder(feature, correlation), x = correlation, fill = ifelse(correlation > 0, "Positive", "Negative"))) + geom_col() + scale_fill_manual(values = c("Positive" = "darkgreen", "Negative" = "darkred")) + labs(title = "Feature Correlation with Lead Conversion", y = "Feature", x = "Correlation Coefficient") + theme_minimal() + theme(legend.position = "bottom") print("Top Features by Correlation with Conversion:") print(head(correlation_data, 10)) ``` ```{python} #| label: py-ch44-case-features import pandas as pd import numpy as np import matplotlib.pyplot as plt np.random.seed(5038) # Synthetic lead data n_leads = 2000 company_size = np.random.gamma(2, 200, n_leads) annual_revenue = np.random.gamma(2, 50, n_leads) years_in_business = np.random.gamma(2, 10, n_leads) website_visits = np.random.poisson(5, n_leads) demo_requested = np.random.binomial(1, 0.15, n_leads) content_downloads = np.random.poisson(1, n_leads) days_since_visit = np.random.randint(30, 181, n_leads) leads_df = pd.DataFrame({ "lead_id": np.arange(1, n_leads + 1), "company_size": company_size, "annual_revenue_millions": annual_revenue, "years_in_business": years_in_business, "website_visits": website_visits, "demo_requested": demo_requested, "content_downloads": content_downloads, "website_recency": 180 - days_since_visit }) # Create conversion labels prob_base = 0.15 prob_adj = ( (company_size > 500) * 0.15 + (annual_revenue > 100) * 0.10 + demo_requested * 0.25 + (website_visits >= 10) * 0.12 ) conversion_prob = np.minimum(prob_base + prob_adj, 0.95) leads_df["converted"] = (np.random.random(n_leads) < conversion_prob).astype(int) # Feature correlation with conversion feature_cols = ["company_size", "annual_revenue_millions", "years_in_business", "website_visits", "demo_requested", "content_downloads", "website_recency"] correlations = {col: leads_df[col].corr(leads_df["converted"]) for col in feature_cols} corr_df = pd.DataFrame(list(correlations.items()), columns=["feature", "correlation"]) corr_df = corr_df.sort_values("correlation", ascending=True) plt.figure(figsize=(10, 6)) colors = ["darkgreen" if x > 0 else "darkred" for x in corr_df["correlation"]] plt.barh(corr_df["feature"], corr_df["correlation"], color=colors) plt.xlabel("Correlation with Conversion") plt.title("Feature Correlation with Lead Conversion") plt.grid(axis="x", alpha=0.3) plt.tight_layout() plt.savefig("ch44_feature_correlation.png", dpi=150, bbox_inches="tight") plt.show() print("Feature Correlations with Conversion:") print(corr_df) ``` ### Logistic Regression Model Training and Calibration ```{r} #| label: ch44-case-logistic #| fig-cap: "Logistic Regression Coefficients and Calibration Curve" library(caret) # Split data set.seed(8724) train_idx <- createDataPartition(leads_engineered$converted, p = 0.7, list = FALSE) train_data <- leads_engineered[train_idx, ] test_data <- leads_engineered[-train_idx, ] # Fit logistic regression feature_cols <- c("company_size_log", "revenue_log", "years_in_business", "website_visits", "demo_requested", "engagement_score", "website_recency", "email_open_rate", "content_downloads", "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech") log_model <- glm( as.formula(paste("converted ~", paste(feature_cols, collapse = " + "))), family = "binomial", data = train_data ) # Model summary summary_log <- summary(log_model) cat("Model Summary:\n") cat("AIC:", log_model$aic, "\n") cat("Deviance:", log_model$deviance, "\n") # Extract coefficients coef_table <- data.frame( variable = rownames(summary_log$coefficients)[-1], coefficient = summary_log$coefficients[-1, 1], p_value = summary_log$coefficients[-1, 4], odds_ratio = exp(summary_log$coefficients[-1, 1]) ) |> arrange(desc(abs(coefficient))) cat("\nTop Coefficients (largest impact on log-odds):\n") print(head(coef_table, 10)) # Platt scaling for calibration test_pred_raw <- predict(log_model, test_data, type = "response") # Fit logistic regression on predictions to calibrate calibration_model <- glm( converted ~ test_pred_raw, family = "binomial", data = data.frame(converted = test_data$converted, test_pred_raw = test_pred_raw) ) test_pred_calibrated <- predict(calibration_model, type = "response") # Create reliability diagram calibration_df <- tibble( predicted = test_pred_raw, actual = test_data$converted ) |> mutate( bin = cut(predicted, breaks = seq(0, 1, 0.1), labels = seq(0.05, 0.95, 0.1)) ) |> group_by(bin) |> summarise( mean_predicted = mean(predicted), mean_actual = mean(actual), n = n(), .groups = "drop" ) |> filter(n > 5) # Only bins with sufficient data ggplot(calibration_df, aes(x = mean_predicted, y = mean_actual)) + geom_point(aes(size = n), alpha = 0.6, color = "steelblue") + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + xlim(0, 1) + ylim(0, 1) + labs(title = "Calibration Curve (Reliability Diagram)", x = "Predicted Probability", y = "Observed Frequency", subtitle = "Perfect calibration lies on diagonal") + theme_minimal() + theme(aspect.ratio = 1) ``` ```{python} #| label: py-ch44-case-logistic-py from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import matplotlib.pyplot as plt # Prepare features X = leads_df[["company_size", "annual_revenue_millions", "years_in_business", "website_visits", "demo_requested", "content_downloads", "website_recency"]] y = leads_df["converted"] # Standardize scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=8724, stratify=y) # Fit logistic regression log_reg = LogisticRegression(random_state=8724, max_iter=1000) log_reg.fit(X_train, y_train) # Coefficients feature_names = ["Company Size", "Annual Revenue", "Years in Business", "Website Visits", "Demo Requested", "Content Downloads", "Website Recency"] coef_df = pd.DataFrame({ "Feature": feature_names, "Coefficient": log_reg.coef_[0], "Odds Ratio": np.exp(log_reg.coef_[0]) }).sort_values("Coefficient", ascending=False) print("Model Coefficients:") print(coef_df) # Predictions y_pred_prob = log_reg.predict_proba(X_test)[:, 1] # Calibration curve from sklearn.calibration import calibration_curve prob_true, prob_pred = calibration_curve(y_test, y_pred_prob, n_bins=10) plt.figure(figsize=(8, 8)) plt.plot([0, 1], [0, 1], "k--", label="Perfect Calibration") plt.plot(prob_pred, prob_true, "o-", label="Logistic Regression", linewidth=2, markersize=8) plt.xlabel("Predicted Probability") plt.ylabel("Observed Frequency") plt.title("Calibration Curve") plt.legend() plt.grid(alpha=0.3) plt.xlim([0, 1]) plt.ylim([0, 1]) plt.gca().set_aspect("equal") plt.tight_layout() plt.savefig("ch44_calibration_curve.png", dpi=150, bbox_inches="tight") plt.show() ``` ### Tier Design and Threshold Optimization ```{r} #| label: ch44-case-tiers #| fig-cap: "Tier Assignment and Sales Capacity Alignment" library(tidyverse) # Compute predictions on test set test_pred <- predict(log_model, test_data, type = "response") # Assign tiers based on business logic # Hot: top 50 leads per month (capacity constraint) # Warm: next 200 leads per month # Cold: remaining test_data_scored <- tibble( lead_id = test_data$lead_id, score = test_pred, converted = test_data$converted ) |> arrange(desc(score)) |> mutate( tier = case_when( row_number() <= 50 ~ "Hot", row_number() <= 250 ~ "Warm", TRUE ~ "Cold" ) ) # Compute conversion rate by tier tier_performance <- test_data_scored |> group_by(tier) |> summarise( n_leads = n(), conversions = sum(converted), conversion_rate = conversions / n_leads, avg_score = mean(score), .groups = "drop" ) |> mutate( tier = factor(tier, levels = c("Hot", "Warm", "Cold")) ) |> arrange(tier) print("Tier Performance:") print(tier_performance) # Visualize ggplot(tier_performance, aes(x = tier, y = conversion_rate)) + geom_col(fill = c("darkred", "orange", "lightgray"), alpha = 0.8) + geom_text(aes(label = paste0(round(conversion_rate * 100, 1), "%")), vjust = -0.3) + labs(title = "Conversion Rate by Tier", x = "Tier", y = "Conversion Rate") + ylim(0, max(tier_performance$conversion_rate) * 1.2) + theme_minimal() # Score distribution ggplot(test_data_scored, aes(x = score, fill = tier)) + geom_histogram(bins = 30, alpha = 0.7, position = "stack") + labs(title = "Score Distribution by Tier", x = "Lead Score", y = "Count", fill = "Tier") + scale_fill_manual(values = c("Hot" = "darkred", "Warm" = "orange", "Cold" = "lightgray")) + theme_minimal() ``` ```{python} #| label: py-ch44-case-tiers-py import pandas as pd import numpy as np import matplotlib.pyplot as plt # Compute predictions y_pred_full = log_reg.predict_proba(X_scaled)[:, 1] # Create scored leads dataframe scored_leads = pd.DataFrame({ "lead_id": leads_df["lead_id"], "score": y_pred_full, "converted": leads_df["converted"] }).sort_values("score", ascending=False).reset_index(drop=True) # Assign tiers scored_leads["tier"] = "Cold" scored_leads.loc[:50, "tier"] = "Hot" scored_leads.loc[51:250, "tier"] = "Warm" # Tier performance tier_perf = scored_leads.groupby("tier").agg({ "lead_id": "count", "converted": "sum", "score": "mean" }).reset_index() tier_perf.columns = ["tier", "n_leads", "conversions", "avg_score"] tier_perf["conversion_rate"] = tier_perf["conversions"] / tier_perf["n_leads"] tier_perf["tier"] = pd.Categorical(tier_perf["tier"], categories=["Hot", "Warm", "Cold"], ordered=True) tier_perf = tier_perf.sort_values("tier") print("Tier Performance:") print(tier_perf) # Visualize fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Conversion rate by tier colors = {"Hot": "darkred", "Warm": "orange", "Cold": "lightgray"} tier_colors = [colors[t] for t in tier_perf["tier"]] axes[0].bar(tier_perf["tier"], tier_perf["conversion_rate"], color=tier_colors, alpha=0.8) axes[0].set_ylabel("Conversion Rate") axes[0].set_title("Conversion Rate by Tier") for i, (tier, rate) in enumerate(zip(tier_perf["tier"], tier_perf["conversion_rate"])): axes[0].text(i, rate, f"{rate:.1%}", ha="center", va="bottom") # Score distribution for tier in ["Hot", "Warm", "Cold"]: data = scored_leads[scored_leads["tier"] == tier]["score"] axes[1].hist(data, alpha=0.6, label=tier, color=colors[tier], bins=20) axes[1].set_xlabel("Lead Score") axes[1].set_ylabel("Count") axes[1].set_title("Score Distribution by Tier") axes[1].legend() plt.tight_layout() plt.savefig("ch44_tier_performance.png", dpi=150, bbox_inches="tight") plt.show() ``` ### SHAP Explainability ```{r} #| label: ch44-case-shap #| fig-cap: "SHAP Waterfall: Explaining Individual Lead Scores" library(tidyverse) # Simple SHAP explanation (approximated without shap library) # For each lead, show contribution of top 5 features explain_lead <- function(lead_idx, model, data) { # Get the lead's values lead_data <- data[lead_idx, ] # Compute base value (average prediction) all_preds <- predict(model, data, type = "response") base_value <- mean(all_preds) # Get prediction for this lead lead_pred <- predict(model, lead_data, type = "response") # Approximate SHAP values (coefficient × feature value, normalized) feature_cols <- c("company_size_log", "revenue_log", "years_in_business", "website_visits", "demo_requested", "engagement_score", "website_recency", "email_open_rate", "content_downloads", "is_tech", "is_large", "is_mature", "is_ng_ke", "is_large_tech") coefs <- coef(model)[2:(1 + length(feature_cols))] shap_approx <- coefs * (as.numeric(lead_data[1, feature_cols]) - colMeans(data[, feature_cols], na.rm = TRUE)) # Return top contributors shap_df <- tibble( feature = names(shap_approx), contribution = as.numeric(shap_approx), direction = ifelse(contribution > 0, "Positive", "Negative") ) |> arrange(desc(abs(contribution))) |> slice(1:5) list( base_value = base_value, lead_prediction = lead_pred, contributions = shap_df ) } # Explain a few leads lead_ids <- c(1, 10, 100) # Hot, Warm, Cold examples explanations <- map(lead_ids, ~explain_lead(.x, log_model, test_data)) # Display explanation for first lead expl <- explanations[[1]] cat("Lead Score Explanation\n") cat("Base value (average):", round(expl$base_value, 3), "\n") cat("Top contributions:\n") print(expl$contributions) cat("Predicted score:", round(expl$lead_prediction, 3), "\n") ``` ```{python} #| label: py-ch44-case-shap-py import shap import pandas as pd # Create SHAP explainer (for logistic regression) explainer = shap.LinearExplainer(log_reg, X_train, feature_names=feature_names) shap_values = explainer.shap_values(X_test) # Explain the first test lead print("SHAP Explanation for Lead 1:") print(f"Base value: {log_reg.intercept_[0]:.4f}") print(f"Predicted probability: {log_reg.predict_proba(X_test[:1])[0, 1]:.4f}") print("\nTop contributors:") # Get SHAP values for first lead lead_shap = shap_values[0] contributions = pd.DataFrame({ "Feature": feature_names, "SHAP Value": lead_shap, "Feature Value": X_test[0] }).sort_values("SHAP Value", ascending=False).head(5) print(contributions) # Plot waterfall try: shap.waterfall_plot(shap.Explanation( values=shap_values[0], base_values=log_reg.intercept_[0], data=X_test[0], feature_names=feature_names )) plt.tight_layout() plt.savefig("ch44_shap_waterfall.png", dpi=150, bbox_inches="tight") plt.show() except: print("Waterfall plot not available; showing summary instead") ``` ### PSI Monitoring and Champion-Challenger Test ```{r} #| label: ch44-case-psi #| fig-cap: "Population Stability Index Over Time and Champion-Challenger Comparison" # Simulate 3 months of new leads set.seed(3916) psi_monthly <- tibble( month = c("Month 1", "Month 2", "Month 3"), score_distribution = list( # Month 1 (baseline) sample(test_pred, 500), # Month 2 (slight drift) sample(test_pred, 500) + rnorm(500, 0, 0.03), # Month 3 (more drift) sample(test_pred, 500) + rnorm(500, 0, 0.08) ) ) # Compute PSI compute_psi <- function(baseline, current, n_bins = 10) { baseline_binned <- cut(baseline, breaks = seq(0, 1, 1/n_bins), labels = 1:n_bins, include.lowest = TRUE) current_binned <- cut(current, breaks = seq(0, 1, 1/n_bins), labels = 1:n_bins, include.lowest = TRUE) baseline_pct <- table(baseline_binned) / length(baseline) current_pct <- table(current_binned) / length(current) psi <- sum((current_pct - baseline_pct) * log((current_pct + 0.001) / (baseline_pct + 0.001))) return(psi) } psi_values <- c( 0, # Month 1 baseline compute_psi(psi_monthly$score_distribution[[1]], psi_monthly$score_distribution[[2]]), compute_psi(psi_monthly$score_distribution[[1]], psi_monthly$score_distribution[[3]]) ) psi_df <- tibble( month = c("Month 1", "Month 2", "Month 3"), psi = psi_values, status = case_when( psi < 0.10 ~ "Stable", psi < 0.25 ~ "Minor Drift", TRUE ~ "Significant Drift" ) ) ggplot(psi_df, aes(x = month, y = psi, fill = status)) + geom_col() + geom_hline(yintercept = 0.10, linetype = "dashed", color = "orange", linewidth = 1) + geom_hline(yintercept = 0.25, linetype = "dashed", color = "red", linewidth = 1) + scale_fill_manual(values = c("Stable" = "darkgreen", "Minor Drift" = "orange", "Significant Drift" = "darkred")) + labs(title = "Population Stability Index Monitoring", x = "Month", y = "PSI", fill = "Status", subtitle = "Thresholds: <0.10 (stable), 0.10–0.25 (investigate), >0.25 (re-train)") + theme_minimal() + theme(legend.position = "bottom") print("PSI Monitoring:") print(psi_df) # Champion-Challenger: simulate new model performance set.seed(4582) cc_data <- tibble( lead_id = 1:300, variant = rep(c("Champion", "Challenger"), each = 150), # Challenger slightly better calibrated model = ifelse(variant == "Champion", predict(log_model, test_data[1:150, ], type = "response"), predict(log_model, test_data[1:150, ], type = "response") * 1.02), actual = rep(test_data$converted[1:150], 2) ) cc_summary <- cc_data |> group_by(variant) |> summarise( n = n(), conversions = sum(actual), conversion_rate = conversions / n, mean_predicted = mean(model), .groups = "drop" ) cat("Champion-Challenger Test Results:\n") print(cc_summary) ggplot(cc_summary, aes(x = variant, y = conversion_rate)) + geom_col(fill = c("darkblue", "darkgreen"), alpha = 0.8) + geom_text(aes(label = paste0(round(conversion_rate * 100, 1), "%")), vjust = -0.3) + labs(title = "Champion-Challenger Comparison", x = "Variant", y = "Conversion Rate") + ylim(0, max(cc_summary$conversion_rate) * 1.2) + theme_minimal() ``` ```{python} #| label: py-ch44-case-psi-py import pandas as pd import numpy as np import matplotlib.pyplot as plt # Simulate PSI over time np.random.seed(3916) baseline_scores = y_pred_full[:500] month2_scores = np.clip(y_pred_full[:500] + np.random.normal(0, 0.03, 500), 0, 1) month3_scores = np.clip(y_pred_full[:500] + np.random.normal(0, 0.08, 500), 0, 1) def compute_psi(baseline, current, n_bins=10): baseline_bins = np.histogram(baseline, bins=n_bins, range=(0, 1))[0] + 1 current_bins = np.histogram(current, bins=n_bins, range=(0, 1))[0] + 1 baseline_pct = baseline_bins / baseline_bins.sum() current_pct = current_bins / current_bins.sum() psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct)) return psi psi_month1 = 0 psi_month2 = compute_psi(baseline_scores, month2_scores) psi_month3 = compute_psi(baseline_scores, month3_scores) psi_data = pd.DataFrame({ "Month": ["Month 1", "Month 2", "Month 3"], "PSI": [psi_month1, psi_month2, psi_month3], "Status": ["Stable", "Minor Drift" if psi_month2 < 0.25 else "Significant", "Significant Drift" if psi_month3 > 0.25 else "Minor"] }) fig, ax = plt.subplots(figsize=(10, 6)) colors = ["darkgreen" if x < 0.10 else ("orange" if x < 0.25 else "darkred") for x in psi_data["PSI"]] ax.bar(psi_data["Month"], psi_data["PSI"], color=colors, alpha=0.8) ax.axhline(0.10, color="orange", linestyle="--", linewidth=1.5, label="Minor Drift Threshold") ax.axhline(0.25, color="red", linestyle="--", linewidth=1.5, label="Significant Drift Threshold") ax.set_ylabel("PSI") ax.set_title("Population Stability Index Monitoring") ax.legend() ax.grid(axis="y", alpha=0.3) plt.tight_layout() plt.savefig("ch44_psi_monitoring.png", dpi=150, bbox_inches="tight") plt.show() print("PSI Monitoring Results:") print(psi_data) ``` --- ## Case Study Summary Fluidly's lead scoring system, once deployed: 1. **Baseline (before)**: Sales team contacted 50 leads/day without prioritization, 8% converted. 2. **Post-deployment (3 months)**: Hot leads (top 50/month) had 45% conversion; Warm (200/month) had 18%; Cold had 3%. 3. **Impact**: Focusing on Hot leads increased deal pipeline by 2.5× (measured in expected closed-won revenue). 4. **Monitoring**: PSI remained <0.10 for three months, indicating stable model. Challenger model showed 2% AUC improvement but was deferred pending longer A/B test. ::: {.exercises} #### Chapter 44 Exercises 1. **Recall**: Define the three main types of features in B2B lead scoring: firmographic, geographic, and digital signals. 2. **Recall**: What is Platt scaling, and when is it used? 3. **Comprehension**: A logistic regression model predicts lead conversion with AUC = 0.82 but the calibration curve shows it is overconfident (above diagonal). What does this mean, and how would you fix it? 4. **Comprehension**: Design a three-tier system (Hot/Warm/Cold) for a sales team with capacity to work 40 Hot leads per month. 5. **Application**: A logistic regression coefficient for "demo_requested" is +2.0 (odds ratio = 7.4). Explain this to a non-technical stakeholder. 6. **Application**: Your lead scoring system has been live for 2 months. PSI = 0.22. What should you do? 7. **Analysis**: Contrast offline model evaluation (AUC, calibration) with online A/B testing for lead scoring. Why do both matter? 8. **Analysis**: A SHAP waterfall shows that a lead is Hot primarily because "Company is in Finance industry", but secondary signals are weak. Is this lead truly Hot, or is the model over-relying on industry? 9. **Synthesis**: Design an end-to-end lead scoring system for a Pan-African fintech, including data pipeline, model training, tier assignment, sales integration, and monitoring. 10. **Synthesis**: You are deploying lead scoring in a market where the sales team is incentivized to convert high-volume, low-value leads rather than focus on high-value prospects. How would you structure the system to align incentives? ::: ## Further Reading - Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer-Verlag. [Chapter 4: Logistic Regression] - Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In *International Conference on Machine Learning* (pp. 625–632). - Molnar, C. (2020). *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable*. [Chapter on SHAP] --- ## Chapter 44 Appendix: Probability Calibration Methods and Cost-Benefit Analysis ### Platt Scaling: Detailed Derivation Platt scaling fits a logistic regression on model predictions: $$\tilde{P}(y=1 | x) = \sigma(a \times P(y=1|x) + b)$$ where $\sigma$ is the logistic function. Given a validation set with predictions $\hat{p}_i$ and true labels $y_i$, fit: $$\min_{a,b} -\sum_i \left[ y_i \log(\sigma(a \hat{p}_i + b)) + (1-y_i) \log(1-\sigma(a \hat{p}_i + b)) \right]$$ via maximum likelihood. Solve with Newton-Raphson. ### Isotonic Regression: Pool Adjacent Violators Algorithm Isotonic regression fits a monotone increasing function $f(x)$ to minimize: $$\min_f \sum_i (y_i - f(\hat{p}_i))^2$$ subject to $f(\hat{p}_1) \leq f(\hat{p}_2) \leq \cdots \leq f(\hat{p}_n)$. Solved via the Pool Adjacent Violators Algorithm (PAVA): 1. Initialize: $f(\hat{p}_i) = y_i$ for all $i$ 2. Scan left-to-right; if $f(\hat{p}_i) > f(\hat{p}_{i+1})$, replace with average: $f_i = f_{i+1} = \frac{f_i + f_{i+1}}{2}$ 3. Repeat until no violations More flexible than Platt scaling but higher variance on small datasets. ### Cost-Benefit Optimization: Threshold Selection Given a cost matrix: - True Positive (convert): Value = +$\text{CLV}$ - False Positive (wasted effort): Cost = −$c_{\text{fp}}$ - True Negative: Value = 0 - False Negative: Cost = 0 (opportunity cost; ignored for simplicity) Expected value of predicting 1 for a lead with predicted probability $p$: $$\text{EV}(p) = p \times \text{CLV} - (1-p) \times c_{\text{fp}}$$ Optimal threshold is where $\text{EV} = 0$: $$p^* = \frac{c_{\text{fp}}}{\text{CLV} + c_{\text{fp}}}$$ Example: $\text{CLV} = ₦2M$, $c_{\text{fp}} = ₦50K$ → $p^* = 50K / (2M + 50K) ≈ 0.024$. Any lead with $P(\text{convert}) > 2.4\%$ has positive expected value. ### PSI Derivation and Chi-Squared Connection PSI measures the Kullback-Leibler divergence between two distributions: $$\text{PSI} = D_{KL}(P || Q) = \sum_i P_i \log \frac{P_i}{Q_i}$$ where $P$ is current distribution, $Q$ is baseline. Under the null hypothesis that $P = Q$, the statistic $2n \times \text{PSI}$ is approximately $\chi^2$ distributed with $k-1$ degrees of freedom, where $k$ is the number of bins. In practice, use practical thresholds (0.10, 0.25) rather than formal hypothesis tests, because statistical significance ≠ practical importance for model monitoring.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	8724
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None