17 Introduction to Classification

📋 Learning Objectives

By the end of this chapter, you will be able to:

Understand the classification problem and its business applications
Distinguish between binary, multiclass, and multi-label classification
Construct and interpret confusion matrices
Compute and interpret precision, recall, F1-score, and other metrics
Plot and interpret ROC curves and calculate AUC
Handle imbalanced datasets using resampling and weighting
Choose thresholds to optimise for business costs, not just accuracy
Compare classifiers using appropriate metrics for imbalanced data

17.1 What Is Classification?

Classification is supervised learning where the goal is to predict a categorical label (class) for each observation.

17.1.1 Binary vs Multiclass vs Multi-label

Type	# Classes	Example	Output
Binary	2	Spam (yes/no), Default (yes/no), Churn (yes/no)	Single 0/1 prediction
Multiclass	3+ mutually exclusive	Email category (spam/promotions/primary/social), Product rating (1–5 stars)	Single class prediction
Multi-label	Multiple non-exclusive	Document topics (can be about tech AND politics AND finance)	Multiple 0/1 predictions

17.1.2 Classification vs Regression: When to Use Which?

Aspect	Regression	Classification
Outcome	Continuous (price, demand, temperature)	Categorical (yes/no, category, class)
Model output	Real number	Probability or class label
Metrics	RMSE, MAE, R²	Accuracy, Precision, Recall, AUC
Example	Predict house price	Predict house will sell (yes/no)

Key insight: If the outcome is inherently categorical (a loan either defaults or doesn’t), use classification. Don’t force regression on categorical targets.

17.1.3 Common Business Problems

Credit default prediction: Will this loan applicant default within 12 months? (Binary)
Customer churn: Will this customer cancel their subscription? (Binary)
Fraud detection: Is this transaction fraudulent? (Binary)
Lead scoring: Is this lead likely to convert? (Binary)
Disease diagnosis: Does patient have disease A, B, or C? (Multiclass)
Image classification: Is this image a cat, dog, or bird? (Multiclass)

📝 Section 12.1 Review Questions

Why is predicting “will churn” a classification problem, not regression?
When would you use multi-label classification instead of multiclass?
Give two examples each of binary, multiclass, and multi-label problems in African business contexts.
What is the key difference between classification output (0/1 label) and predicted probability?

17.2 The Confusion Matrix

The confusion matrix summarises model predictions vs actual outcomes in a 2×2 table (for binary classification).

📘 Theory: Confusion Matrix for Binary Classification

	Predicted Negative	Predicted Positive
Actual Negative	TN (True Negative)	FP (False Positive)
Actual Positive	FN (False Negative)	TP (True Positive)

where: - TP (True Positive): Model predicted “1”, actual was “1” (correct positive prediction) - TN (True Negative): Model predicted “0”, actual was “0” (correct negative prediction) - FP (False Positive): Model predicted “1”, actual was “0” (incorrectly predicted positive; “false alarm”) - FN (False Negative): Model predicted “0”, actual was “1” (incorrectly predicted negative; “miss”)

Medical analogy: - TP: Correctly diagnose disease (patient has disease, test is positive) ✓ - TN: Correctly clear patient (patient is healthy, test is negative) ✓ - FP: False alarm (patient is healthy, but test says positive) — unnecessary treatment - FN: Dangerous miss (patient has disease, but test says negative) — patient untreated

17.2.1 Worked Example: Credit Default Confusion Matrix

Show code

library(tidyverse)
library(caret)

# Simulate credit default data
set.seed(42)
n <- 1000
y_actual <- rbinom(n, 1, 0.15)  # 15% default rate
y_pred_prob <- 0.1 + 0.4 * y_actual + rnorm(n, 0, 0.15)
y_pred_prob <- pmin(pmax(y_pred_prob, 0), 1)  # Constrain to [0, 1]
y_pred <- ifelse(y_pred_prob > 0.5, 1, 0)  # Classify at threshold 0.5

# Compute confusion matrix
cm <- confusionMatrix(as.factor(y_pred), as.factor(y_actual), positive = "1")
print(cm)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 848  70
#>          1   3  79
#>                                           
#>                Accuracy : 0.927           
#>                  95% CI : (0.9091, 0.9423)
#>     No Information Rate : 0.851           
#>     P-Value [Acc > NIR] : 1.324e-13       
#>                                           
#>                   Kappa : 0.6466          
#>                                           
#>  Mcnemar's Test P-Value : 1.121e-14       
#>                                           
#>             Sensitivity : 0.5302          
#>             Specificity : 0.9965          
#>          Pos Pred Value : 0.9634          
#>          Neg Pred Value : 0.9237          
#>              Prevalence : 0.1490          
#>          Detection Rate : 0.0790          
#>    Detection Prevalence : 0.0820          
#>       Balanced Accuracy : 0.7633          
#>                                           
#>        'Positive' Class : 1               
#> 

cat("\n\nConfusion Matrix Breakdown:\n\n")
#> 
#> 
#> Confusion Matrix Breakdown:
tp <- cm$table[2, 2]
tn <- cm$table[1, 1]
fp <- cm$table[2, 1]
fn <- cm$table[1, 2]

cat("True Positives (TP):", tp, " - Correctly predicted defaults\n")
#> True Positives (TP): 79  - Correctly predicted defaults
cat("True Negatives (TN):", tn, " - Correctly predicted non-defaults\n")
#> True Negatives (TN): 848  - Correctly predicted non-defaults
cat("False Positives (FP):", fp, " - Non-defaults incorrectly predicted as defaults\n")
#> False Positives (FP): 3  - Non-defaults incorrectly predicted as defaults
cat("False Negatives (FN):", fn, " - Defaults incorrectly predicted as non-defaults\n\n")
#> False Negatives (FN): 70  - Defaults incorrectly predicted as non-defaults

# Visualise confusion matrix
png("confusion_matrix.png", width = 600, height = 500)
fourfoldplot(cm$table, color = c("#eee", "#ff6b6b"),
             main = "Confusion Matrix: Credit Default Prediction")

dev.off()
#> png 
#>   2

cat("\nConfusion Matrix Proportions:\n")
#> 
#> Confusion Matrix Proportions:
prop_table <- cm$table / sum(cm$table)
print(prop_table)
#>           Reference
#> Prediction     0     1
#>          0 0.848 0.070
#>          1 0.003 0.079

cat("\n\nInterpretation:\n")
#> 
#> 
#> Interpretation:
cat("- Proportion of loans that defaulted:", round(mean(y_actual), 3), "\n")
#> - Proportion of loans that defaulted: 0.149
cat("- Of the", sum(y_actual), " actual defaults, we caught", tp, "(",
    round(tp / sum(y_actual) * 100, 1), "%)\n")
#> - Of the 149  actual defaults, we caught 79 ( 53 %)
cat("- Of the", sum(1 - y_actual), " actual non-defaults, we incorrectly flagged", fp, "(",
    round(fp / sum(1 - y_actual) * 100, 1), "%)\n")
#> - Of the 851  actual non-defaults, we incorrectly flagged 3 ( 0.4 %)

Show code

import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

np.random.seed(42)
n = 1000

# Simulate credit default data
y_actual = np.random.binomial(1, 0.15, n)  # 15% default rate
y_pred_prob = 0.1 + 0.4 * y_actual + np.random.normal(0, 0.15, n)
y_pred_prob = np.clip(y_pred_prob, 0, 1)
y_pred = (y_pred_prob > 0.5).astype(int)

# Compute confusion matrix
cm = confusion_matrix(y_actual, y_pred)
print("Confusion Matrix:")
#> Confusion Matrix:
print(cm)
#> [[846   1]
#>  [ 69  84]]
print()

# Unpack
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix Breakdown:\n")
#> Confusion Matrix Breakdown:
print(f"True Positives (TP): {tp} - Correctly predicted defaults")
#> True Positives (TP): 84 - Correctly predicted defaults
print(f"True Negatives (TN): {tn} - Correctly predicted non-defaults")
#> True Negatives (TN): 846 - Correctly predicted non-defaults
print(f"False Positives (FP): {fp} - Non-defaults incorrectly predicted as defaults")
#> False Positives (FP): 1 - Non-defaults incorrectly predicted as defaults
print(f"False Negatives (FN): {fn} - Defaults incorrectly predicted as non-defaults\n")
#> False Negatives (FN): 69 - Defaults incorrectly predicted as non-defaults

# Visualise
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Default', 'Default'])
disp.plot(ax=ax, cmap='Blues', values_format='d')
#> <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000002A11F8B6510>
ax.set_title('Confusion Matrix: Credit Default Prediction', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


# Proportions
cm_pct = cm / cm.sum()
print("Confusion Matrix Proportions:")
#> Confusion Matrix Proportions:
print(cm_pct)
#> [[0.846 0.001]
#>  [0.069 0.084]]
print()

# Interpretation
print("Interpretation:")
#> Interpretation:
print(f"- Proportion of loans that defaulted: {y_actual.mean():.3f}")
#> - Proportion of loans that defaulted: 0.153
print(f"- Of the {y_actual.sum()} actual defaults, we caught {tp} ({tp/y_actual.sum()*100:.1f}%)")
#> - Of the 153 actual defaults, we caught 84 (54.9%)
print(f"- Of the {(1-y_actual).sum():.0f} actual non-defaults, we incorrectly flagged {fp} ({fp/(1-y_actual).sum()*100:.1f}%)")
#> - Of the 847 actual non-defaults, we incorrectly flagged 1 (0.1%)

📝 Section 12.2 Review Questions

In a medical diagnostic test, which is worse: FP (false alarm) or FN (miss)? Why?
If a model has high TP and TN but many FP, what does this suggest?
How would you explain FN and FP to a non-technical business stakeholder?
In fraud detection, is it worse to miss fraud (FN) or to flag legitimate transactions (FP)?

17.3 Classification Metrics

A single metric (accuracy) is insufficient for classification, especially with imbalanced classes.

🔑 Key Formulas: Classification Metrics

Accuracy: What fraction of predictions were correct? \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

Precision: Of all positive predictions, how many were truly positive? (False positive rate) \[\text{Precision} = \frac{TP}{TP + FP}\]

Recall (Sensitivity): Of all actual positives, how many did we catch? (False negative rate) \[\text{Recall} = \frac{TP}{TP + FN}\]

Specificity: Of all actual negatives, how many did we correctly classify? \[\text{Specificity} = \frac{TN}{TN + FP}\]

F1-Score: Harmonic mean of precision and recall; balances both \[\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

Precision-Recall Trade-off: Increasing the classification threshold (predicting positive only for high-confidence cases) increases precision but decreases recall, and vice versa.

17.3.1 When to Use Which Metric?

Scenario	Best Metric	Rationale
Balanced classes (50% each)	Accuracy	All errors equally costly
Imbalanced (5% positive)	Precision, Recall, F1	Accuracy is misleading
Medical screening	Recall/Sensitivity	Missing disease is dangerous; FN costly
Spam filter	Precision	False alarms (FP) annoy users more than misses (FN)
Fraud detection	Recall (catch fraud) or F1	Both FN (missed fraud) and FP (false alarms) are costly
General-purpose	F1 or AUC	F1 balances precision/recall; AUC handles threshold invariance

17.3.2 Worked Example: Computing All Metrics

Show code

# Compute all classification metrics

library(caret)

# Continue from previous confusion matrix
# y_actual and y_pred defined as before

# Using caret::confusionMatrix (already printed above)
# Let's manually compute key metrics for clarity

accuracy <- (tp + tn) / (tp + tn + fp + fn)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
specificity <- tn / (tn + fp)
f1 <- 2 * (precision * recall) / (precision + recall)

cat("Classification Metrics Summary:\n\n")
#> Classification Metrics Summary:
cat("Accuracy:    ", round(accuracy, 4), " (", round(accuracy * 100, 2), "%)\n")
#> Accuracy:     0.927  ( 92.7 %)
cat("Precision:   ", round(precision, 4), " (of positive predictions, ", round(precision * 100, 1), "% true)\n")
#> Precision:    0.9634  (of positive predictions,  96.3 % true)
cat("Recall:      ", round(recall, 4), " (of actual positives, ", round(recall * 100, 1), "% caught)\n")
#> Recall:       0.5302  (of actual positives,  53 % caught)
cat("Specificity: ", round(specificity, 4), " (of actual negatives, ", round(specificity * 100, 1), "% correct)\n")
#> Specificity:  0.9965  (of actual negatives,  99.6 % correct)
cat("F1-Score:    ", round(f1, 4), "\n\n")
#> F1-Score:     0.684

# Interpretation in business context
cat("Business Interpretation (Credit Default):\n\n")
#> Business Interpretation (Credit Default):
cat("Accuracy = ", round(accuracy * 100, 1), "%\n")
#> Accuracy =  92.7 %
cat("  - Overall, the model correctly classifies loans", round(accuracy * 100, 1), "% of the time.\n")
#>   - Overall, the model correctly classifies loans 92.7 % of the time.
cat("  - But this is misleading if defaults are rare!\n\n")
#>   - But this is misleading if defaults are rare!

cat("Precision = ", round(precision * 100, 1), "%\n")
#> Precision =  96.3 %
cat("  - Of loans flagged as default risk, only", round(precision * 100, 1), "% actually defaulted.\n")
#>   - Of loans flagged as default risk, only 96.3 % actually defaulted.
cat("  - High precision means few false alarms; low false positives.\n\n")
#>   - High precision means few false alarms; low false positives.

cat("Recall = ", round(recall * 100, 1), "%\n")
#> Recall =  53 %
cat("  - Of loans that actually defaulted, we caught", round(recall * 100, 1), "%.\n")
#>   - Of loans that actually defaulted, we caught 53 %.
cat("  - If recall is low, we miss many defaults (risky for lender).\n\n")
#>   - If recall is low, we miss many defaults (risky for lender).

cat("F1-Score = ", round(f1, 4), "\n")
#> F1-Score =  0.684
cat("  - Balances precision and recall; useful for imbalanced data.\n")
#>   - Balances precision and recall; useful for imbalanced data.

# Sensitivity-Specificity trade-off
cat("\n\nSensitivity-Specificity Trade-off:\n")
#> 
#> 
#> Sensitivity-Specificity Trade-off:
cat("Sensitivity (Recall) = ", round(recall, 4), " - Fraction of defaults caught\n")
#> Sensitivity (Recall) =  0.5302  - Fraction of defaults caught
cat("Specificity =          ", round(specificity, 4), " - Fraction of non-defaults correctly approved\n")
#> Specificity =           0.9965  - Fraction of non-defaults correctly approved
cat("Sum =                  ", round(recall + specificity, 4), " (usually < 2, reflecting trade-off)\n")
#> Sum =                   1.5267  (usually < 2, reflecting trade-off)

Show code

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Use y_actual and y_pred from before

# Compute metrics
accuracy = accuracy_score(y_actual, y_pred)
precision = precision_score(y_actual, y_pred)
recall = recall_score(y_actual, y_pred)
f1 = f1_score(y_actual, y_pred)

# For specificity, need to compute manually
specificity = tn / (tn + fp)

print("Classification Metrics Summary:\n")
#> Classification Metrics Summary:
print(f"Accuracy:    {accuracy:.4f} ({accuracy*100:.2f}%)")
#> Accuracy:    0.9300 (93.00%)
print(f"Precision:   {precision:.4f} (of positive predictions, {precision*100:.1f}% true)")
#> Precision:   0.9882 (of positive predictions, 98.8% true)
print(f"Recall:      {recall:.4f} (of actual positives, {recall*100:.1f}% caught)")
#> Recall:      0.5490 (of actual positives, 54.9% caught)
print(f"Specificity: {specificity:.4f} (of actual negatives, {specificity*100:.1f}% correct)")
#> Specificity: 0.9988 (of actual negatives, 99.9% correct)
print(f"F1-Score:    {f1:.4f}\n")
#> F1-Score:    0.7059

# Business interpretation
print("Business Interpretation (Credit Default):\n")
#> Business Interpretation (Credit Default):
print(f"Accuracy = {accuracy*100:.1f}%")
#> Accuracy = 93.0%
print("  - Overall, the model correctly classifies loans {:.1f}% of the time.".format(accuracy*100))
#>   - Overall, the model correctly classifies loans 93.0% of the time.
print("  - But this is misleading if defaults are rare!\n")
#>   - But this is misleading if defaults are rare!

print(f"Precision = {precision*100:.1f}%")
#> Precision = 98.8%
print("  - Of loans flagged as default risk, {:.1f}% actually defaulted.".format(precision*100))
#>   - Of loans flagged as default risk, 98.8% actually defaulted.
print("  - High precision means few false alarms; low false positives.\n")
#>   - High precision means few false alarms; low false positives.

print(f"Recall = {recall*100:.1f}%")
#> Recall = 54.9%
print("  - Of loans that actually defaulted, we caught {:.1f}%.".format(recall*100))
#>   - Of loans that actually defaulted, we caught 54.9%.
print("  - If recall is low, we miss many defaults (risky for lender).\n")
#>   - If recall is low, we miss many defaults (risky for lender).

print(f"F1-Score = {f1:.4f}")
#> F1-Score = 0.7059
print("  - Balances precision and recall; useful for imbalanced data.\n")
#>   - Balances precision and recall; useful for imbalanced data.

print("\nSensitivity-Specificity Trade-off:")
#> 
#> Sensitivity-Specificity Trade-off:
print(f"Sensitivity (Recall) = {recall:.4f} - Fraction of defaults caught")
#> Sensitivity (Recall) = 0.5490 - Fraction of defaults caught
print(f"Specificity =          {specificity:.4f} - Fraction of non-defaults correctly approved")
#> Specificity =          0.9988 - Fraction of non-defaults correctly approved
print(f"Sum =                  {recall + specificity:.4f} (usually < 2, reflecting trade-off)")
#> Sum =                  1.5478 (usually < 2, reflecting trade-off)

# Create a comparison table for different thresholds
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
results = []

for thresh in thresholds:
    y_pred_thresh = (y_pred_prob > thresh).astype(int)
    cm_thresh = confusion_matrix(y_actual, y_pred_thresh)
    tn_t, fp_t, fn_t, tp_t = cm_thresh.ravel()

    acc = (tp_t + tn_t) / (tp_t + tn_t + fp_t + fn_t)
    prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0
    rec = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0
    f1_t = 2 * (prec * rec) / (prec + rec) if (prec + rec) > 0 else 0

    results.append({
        'Threshold': thresh,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1': f1_t
    })

threshold_df = pd.DataFrame(results)
print("\n\nThreshold Sensitivity Analysis:")
#> 
#> 
#> Threshold Sensitivity Analysis:
print(threshold_df.to_string(index=False))
#>  Threshold  Accuracy  Precision   Recall       F1
#>        0.3     0.881   0.572650 0.875817 0.692506
#>        0.4     0.936   0.811189 0.758170 0.783784
#>        0.5     0.930   0.988235 0.549020 0.705882
#>        0.6     0.889   1.000000 0.274510 0.430769
#>        0.7     0.856   1.000000 0.058824 0.111111

📝 Section 12.3 Review Questions

If a loan default model has 95% accuracy but only 30% recall, what is the problem?
Why is F1-score useful for imbalanced datasets?
In fraud detection, would you optimise for high precision or high recall? Justify.
If you increase the classification threshold from 0.5 to 0.7, what happens to precision and recall?

17.4 The ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots the trade-off between true positive rate (sensitivity) and false positive rate across all classification thresholds.

📘 Theory: ROC Curve and AUC

As the classification threshold moves from 1 (classify nothing as positive) to 0 (classify everything as positive):

True Positive Rate (TPR) = Recall = TP / (TP + FN) — fraction of actual positives caught
False Positive Rate (FPR) = 1 − Specificity = FP / (FP + TN) — fraction of actual negatives incorrectly flagged

ROC curve: Plot TPR (y-axis) vs FPR (x-axis) as threshold varies

Perfect classifier: Threshold = optimal value (top-left corner: TPR = 1, FPR = 0)

Random classifier: 45-degree diagonal line (no better than coin flip)

AUC (Area Under the ROC Curve): Integral of ROC curve from FPR=0 to FPR=1 - AUC = 1: Perfect discrimination - AUC = 0.5: Random guessing - AUC = 0.7–0.8: Good classifier - AUC > 0.8: Very good classifier

Statistical interpretation: AUC = P(score of random positive > score of random negative), the probability that if you pick one default loan and one non-default loan, the model ranks them correctly.

🔑 Key Formula: AUC

\[\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})\]

Equivalently, AUC equals the Wilcoxon-Mann-Whitney statistic, the probability that the model ranks a random positive higher than a random negative.

17.4.1 Worked Example: ROC Curve and AUC Calculation

Show code

library(pROC)

# Compute ROC curve
roc_obj <- roc(y_actual, y_pred_prob, ci = TRUE)

cat("ROC Curve Statistics:\n\n")
#> ROC Curve Statistics:
cat("AUC:", round(auc(roc_obj), 4), "\n")
#> AUC: 0.9835
cat("95% CI: [", round(roc_obj$ci[1], 4), ", ", round(roc_obj$ci[3], 4), "]\n\n")
#> 95% CI: [ 0.9771 ,  0.9898 ]

# Plot ROC curve
png("roc_curve.png", width = 700, height = 600)
plot(roc_obj, main = "ROC Curve: Credit Default Prediction",
     xlab = "False Positive Rate (1 - Specificity)",
     ylab = "True Positive Rate (Sensitivity)",
     lwd = 2, col = "steelblue")
# Add diagonal line (random classifier)
abline(0, 1, col = "red", lty = 2, lwd = 1)
# Add AUC annotation
legend("bottomright", legend = paste0("AUC = ", round(auc(roc_obj), 3)),
       col = "steelblue", lwd = 2, bty = "n", cex = 1.2)
dev.off()
#> png 
#>   2

cat("ROC Curve Interpretation:\n\n")
#> ROC Curve Interpretation:
cat("- If AUC > 0.5, the model is better than random guessing.\n")
#> - If AUC > 0.5, the model is better than random guessing.
cat("- If AUC = 0.7–0.8, the model has good discriminative ability.\n")
#> - If AUC = 0.7–0.8, the model has good discriminative ability.
cat("- AUC >0.8 indicates very good model performance.\n\n")
#> - AUC >0.8 indicates very good model performance.
cat("For this model, AUC =", round(auc(roc_obj), 4),
    ", indicating", if(as.numeric(auc(roc_obj)) > 0.7) "good" else "fair", "discrimination.\n")
#> For this model, AUC = 0.9835 , indicating good discrimination.

# Compare classifiers
y_pred_prob_alt <- 0.05 + 0.2 * y_actual + rnorm(n, 0, 0.2)
y_pred_prob_alt <- pmin(pmax(y_pred_prob_alt, 0), 1)

roc_obj_alt <- roc(y_actual, y_pred_prob_alt)

cat("\n\nComparing Two Classifiers:\n")
#> 
#> 
#> Comparing Two Classifiers:
cat("Classifier 1 AUC:", round(auc(roc_obj), 4), "\n")
#> Classifier 1 AUC: 0.9835
cat("Classifier 2 AUC:", round(auc(roc_obj_alt), 4), "\n")
#> Classifier 2 AUC: 0.7736
cat("Difference:", round(auc(roc_obj) - auc(roc_obj_alt), 4), "\n\n")
#> Difference: 0.2099
cat("Classifier 1 is superior (higher AUC).\n")
#> Classifier 1 is superior (higher AUC).

# Plot both ROCs
png("roc_comparison.png", width = 700, height = 600)
plot(roc_obj, main = "ROC Comparison: Two Classifiers",
     xlab = "False Positive Rate", ylab = "True Positive Rate",
     lwd = 2, col = "steelblue")
lines(roc_obj_alt, lwd = 2, col = "darkred")
abline(0, 1, col = "gray", lty = 2, lwd = 1)
legend("bottomright",
       legend = c(paste0("Classifier 1 (AUC = ", round(auc(roc_obj), 3), ")"),
                  paste0("Classifier 2 (AUC = ", round(auc(roc_obj_alt), 3), ")"),
                  "Random (AUC = 0.5)"),
       col = c("steelblue", "darkred", "gray"), lwd = 2, lty = c(1, 1, 2),
       bty = "n", cex = 1)
dev.off()
#> png 
#>   2

Show code

from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_actual, y_pred_prob)
roc_auc = auc(fpr, tpr)

print(f"ROC Curve Statistics:\n")
#> ROC Curve Statistics:
print(f"AUC: {roc_auc:.4f}\n")
#> AUC: 0.9620

# Plot ROC curve
fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(fpr, tpr, lw=2, color='steelblue', label=f'ROC Curve (AUC = {roc_auc:.3f})')
ax.plot([0, 1], [0, 1], lw=1, color='red', linestyle='--', label='Random Classifier (AUC = 0.5)')
ax.set_xlabel('False Positive Rate (1 - Specificity)', fontsize=11)
ax.set_ylabel('True Positive Rate (Sensitivity)', fontsize=11)
ax.set_title('ROC Curve: Credit Default Prediction', fontsize=12, fontweight='bold')
ax.legend(fontsize=10, loc='lower right')
ax.grid(True, alpha=0.3)
ax.set_xlim([-0.02, 1.02])
#> (-0.02, 1.02)
ax.set_ylim([-0.02, 1.02])
#> (-0.02, 1.02)

plt.tight_layout()
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print("ROC Curve Interpretation:\n")
#> ROC Curve Interpretation:
print("- If AUC > 0.5, the model is better than random guessing.")
#> - If AUC > 0.5, the model is better than random guessing.
print("- If AUC = 0.7–0.8, the model has good discriminative ability.")
#> - If AUC = 0.7–0.8, the model has good discriminative ability.
print("- AUC > 0.8 indicates very good model performance.\n")
#> - AUC > 0.8 indicates very good model performance.
print(f"For this model, AUC = {roc_auc:.4f}, indicating {'good' if roc_auc > 0.7 else 'fair'} discrimination.\n")
#> For this model, AUC = 0.9620, indicating good discrimination.

# Compare classifiers
y_pred_prob_alt = 0.05 + 0.2 * y_actual + np.random.normal(0, 0.2, n)
y_pred_prob_alt = np.clip(y_pred_prob_alt, 0, 1)

roc_auc_alt = roc_auc_score(y_actual, y_pred_prob_alt)

print("Comparing Two Classifiers:")
#> Comparing Two Classifiers:
print(f"Classifier 1 AUC: {roc_auc:.4f}")
#> Classifier 1 AUC: 0.9620
print(f"Classifier 2 AUC: {roc_auc_alt:.4f}")
#> Classifier 2 AUC: 0.7987
print(f"Difference: {roc_auc - roc_auc_alt:.4f}\n")
#> Difference: 0.1632
print("Classifier 1 is superior (higher AUC).\n")
#> Classifier 1 is superior (higher AUC).

# Plot both ROCs
fpr_alt, tpr_alt, _ = roc_curve(y_actual, y_pred_prob_alt)
roc_auc_alt = auc(fpr_alt, tpr_alt)

fig, ax = plt.subplots(figsize=(8, 7))
ax.plot(fpr, tpr, lw=2, color='steelblue', label=f'Classifier 1 (AUC = {roc_auc:.3f})')
#> [<matplotlib.lines.Line2D object at 0x000002A13EDA96A0>]
ax.plot(fpr_alt, tpr_alt, lw=2, color='darkred', label=f'Classifier 2 (AUC = {roc_auc_alt:.3f})')
#> [<matplotlib.lines.Line2D object at 0x000002A13EDA97F0>]
ax.plot([0, 1], [0, 1], lw=1, color='gray', linestyle='--', label='Random (AUC = 0.5)')
#> [<matplotlib.lines.Line2D object at 0x000002A13EDA9940>]
ax.set_xlabel('False Positive Rate', fontsize=11)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate', fontsize=11)
#> Text(0, 0.5, 'True Positive Rate')
ax.set_title('ROC Comparison: Two Classifiers', fontsize=12, fontweight='bold')
#> Text(0.5, 1.0, 'ROC Comparison: Two Classifiers')
ax.legend(fontsize=10, loc='lower right')
#> <matplotlib.legend.Legend object at 0x000002A13EDA9400>
ax.grid(True, alpha=0.3)
ax.set_xlim([-0.02, 1.02])
#> (-0.02, 1.02)
ax.set_ylim([-0.02, 1.02])
#> (-0.02, 1.02)

plt.tight_layout()
plt.savefig('roc_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

📝 Section 12.4 Review Questions

What does AUC = 0.75 mean in plain English?
Why is AUC better than accuracy for evaluating classifiers on imbalanced datasets?
If two models have AUCs of 0.72 and 0.75, which should you deploy? Does the difference matter?
Can a model have high accuracy but low AUC? Explain with an example.

17.5 Class Imbalance

Class imbalance occurs when one class vastly outnumbers another (e.g., 2% defaults, 98% non-defaults). This breaks standard accuracy-based evaluation.

📘 Theory: The Class Imbalance Problem

Naive baseline: Always predict “no default.” Accuracy = 98%! (correctly predicts non-defaults, ignores defaults)

Why accuracy fails: The model appears excellent but catches zero defaults.

Solutions:

Threshold adjustment: Lower the classification threshold to predict more positives
Cost-sensitive learning: Assign higher misclassification cost to minority class (e.g., FN = 10× FP)
Resampling:
- Oversampling: Duplicate minority class observations (may overfit)
- Undersampling: Remove majority class observations (loses information)
- SMOTE (Synthetic Minority Oversampling Technique): Synthetically create minority observations
Metrics: Use Precision, Recall, F1, AUC instead of Accuracy

🔑 Key Formula: SMOTE (Synthetic Minority Oversampling)

For each minority sample: 1. Find k nearest minority neighbors 2. Randomly select one neighbor 3. Draw a line segment between the sample and neighbor 4. Randomly place a new synthetic sample on this line segment

This creates new, realistic minority observations without duplication.

17.5.1 Worked Example: Handling Class Imbalance

Show code

library(smotefamily)  # CRAN replacement for removed 'unbalanced' package
library(caret)

# Create severely imbalanced dataset
set.seed(42)
n_total   <- 1000
n_positive <- 20   # 2% fraud/default rate
n_negative <- n_total - n_positive

X_positive <- matrix(rnorm(n_positive * 5, mean = 2, sd = 1), nrow = n_positive)
X_negative <- matrix(rnorm(n_negative * 5, mean = 0, sd = 1), nrow = n_negative)

X_imbalanced <- rbind(X_positive, X_negative)
y_imbalanced <- c(rep(1, n_positive), rep(0, n_negative))

imbalanced_data <- data.frame(X_imbalanced, y = y_imbalanced)

cat("Original Data:\n")
#> Original Data:
cat("Total:", nrow(imbalanced_data), "\n")
#> Total: 1000
cat("Positive class:", sum(y_imbalanced), " (", round(mean(y_imbalanced) * 100, 1), "%)\n")
#> Positive class: 20  ( 2 %)
cat("Negative class:", sum(1 - y_imbalanced), " (", round(mean(1 - y_imbalanced) * 100, 1), "%)\n\n")
#> Negative class: 980  ( 98 %)

# Method 1: Standard logistic regression (baseline)
fit_standard       <- glm(y ~ ., data = imbalanced_data, family = binomial())
pred_standard      <- predict(fit_standard, type = "response")
pred_standard_label <- ifelse(pred_standard > 0.5, 1, 0)

cat("Method 1: Standard Logistic Regression (Threshold = 0.5)\n")
#> Method 1: Standard Logistic Regression (Threshold = 0.5)
cat("Accuracy:", round(mean(pred_standard_label == y_imbalanced), 3), "\n")
#> Accuracy: 0.997
cat("Recall (catch positives):", round(sum(pred_standard_label[y_imbalanced == 1] == 1) / sum(y_imbalanced), 3), "\n")
#> Recall (catch positives): 0.9
cat("Precision:", round(sum(pred_standard_label == 1 & y_imbalanced == 1) / max(sum(pred_standard_label == 1), 1), 3), "\n\n")
#> Precision: 0.947

# Method 2: Adjusted threshold
pred_adj_label <- ifelse(pred_standard > 0.2, 1, 0)

cat("Method 2: Adjusted Threshold (Threshold = 0.2)\n")
#> Method 2: Adjusted Threshold (Threshold = 0.2)
cat("Accuracy:", round(mean(pred_adj_label == y_imbalanced), 3), "\n")
#> Accuracy: 0.995
cat("Recall (catch positives):", round(sum(pred_adj_label[y_imbalanced == 1] == 1) / sum(y_imbalanced), 3), "\n")
#> Recall (catch positives): 0.9
cat("Precision:", round(sum(pred_adj_label == 1 & y_imbalanced == 1) / max(sum(pred_adj_label == 1), 1), 3), "\n\n")
#> Precision: 0.857

# Method 3: SMOTE oversampling on training set
train_idx  <- sample(1:nrow(imbalanced_data), size = 0.7 * nrow(imbalanced_data))
train_data <- imbalanced_data[train_idx, ]
test_data  <- imbalanced_data[-train_idx, ]

# smotefamily::SMOTE returns $data with features + 'class' column
smote_result         <- SMOTE(X = train_data[, -ncol(train_data)],
                               target = train_data$y, K = 5)
train_data_smote     <- smote_result$data
train_data_smote$y   <- as.numeric(as.character(train_data_smote$class))
train_data_smote$class <- NULL

cat("Method 3: SMOTE\n")
#> Method 3: SMOTE
cat("Original training positive rate:", round(mean(train_data$y), 3), "\n")
#> Original training positive rate: 0.017
cat("SMOTE-augmented positive rate:",  round(mean(train_data_smote$y), 3), "\n\n")
#> SMOTE-augmented positive rate: 0.499

fit_smote       <- glm(y ~ ., data = train_data_smote, family = binomial())
pred_smote      <- predict(fit_smote, newdata = test_data, type = "response")
pred_smote_label <- ifelse(pred_smote > 0.5, 1, 0)

cat("Performance on Test Set:\n")
#> Performance on Test Set:
cat("Accuracy:", round(mean(pred_smote_label == test_data$y), 3), "\n")
#> Accuracy: 0.987
cat("Recall:",   round(sum(pred_smote_label[test_data$y == 1] == 1) / max(sum(test_data$y), 1), 3), "\n")
#> Recall: 0.75
cat("Precision:", round(sum(pred_smote_label == 1 & test_data$y == 1) / max(sum(pred_smote_label == 1), 1), 3), "\n\n")
#> Precision: 0.75

# Summary table
results_summary <- tibble(
  Method   = c("Standard (threshold 0.5)", "Adjusted threshold (0.2)", "SMOTE + Logistic"),
  Accuracy = c(round(mean(pred_standard_label == y_imbalanced), 3),
               round(mean(pred_adj_label == y_imbalanced), 3),
               round(mean(pred_smote_label == test_data$y), 3)),
  Recall   = c(round(sum(pred_standard_label[y_imbalanced==1]==1) / sum(y_imbalanced), 3),
               round(sum(pred_adj_label[y_imbalanced==1]==1)      / sum(y_imbalanced), 3),
               round(sum(pred_smote_label[test_data$y==1]==1)     / max(sum(test_data$y), 1), 3))
)
print(results_summary)
#> # A tibble: 3 × 3
#>   Method                   Accuracy Recall
#>   <chr>                       <dbl>  <dbl>
#> 1 Standard (threshold 0.5)    0.997   0.9 
#> 2 Adjusted threshold (0.2)    0.995   0.9 
#> 3 SMOTE + Logistic            0.987   0.75

## Python

Show code

from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create severely imbalanced dataset
np.random.seed(42)
n_total = 1000
n_positive = 20
n_negative = n_total - n_positive

X_positive = np.random.normal(2, 1, (n_positive, 5))
X_negative = np.random.normal(0, 1, (n_negative, 5))

X_imbalanced = np.vstack([X_positive, X_negative])
y_imbalanced = np.hstack([np.ones(n_positive), np.zeros(n_negative)])

print("Original Data:")
#> Original Data:
print(f"Total: {len(y_imbalanced)}")
#> Total: 1000
print(f"Positive class: {int(sum(y_imbalanced))} ({sum(y_imbalanced)/len(y_imbalanced)*100:.1f}%)")
#> Positive class: 20 (2.0%)
print(f"Negative class: {int(sum(1-y_imbalanced))} ({sum(1-y_imbalanced)/len(y_imbalanced)*100:.1f}%)\n")
#> Negative class: 980 (98.0%)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_imbalanced, y_imbalanced,
                                                      test_size=0.3, random_state=42,
                                                      stratify=y_imbalanced)

# Method 1: Standard logistic regression (threshold 0.5)
lr_standard = LogisticRegression()
lr_standard.fit(X_train, y_train)
#> LogisticRegression()
y_pred_prob_std = lr_standard.predict_proba(X_test)[:, 1]
y_pred_std = (y_pred_prob_std > 0.5).astype(int)

print("Method 1: Standard Logistic Regression (Threshold = 0.5)")
#> Method 1: Standard Logistic Regression (Threshold = 0.5)
print(f"Accuracy: {accuracy_score(y_test, y_pred_std):.3f}")
#> Accuracy: 0.993
print(f"Recall: {recall_score(y_test, y_pred_std, zero_division=0):.3f}")
#> Recall: 0.833
if sum(y_pred_std) > 0:
    print(f"Precision: {precision_score(y_test, y_pred_std):.3f}")
#> Precision: 0.833
print()

# Method 2: Adjusted threshold (0.2)
y_pred_adj = (y_pred_prob_std > 0.2).astype(int)

print("Method 2: Adjusted Threshold (Threshold = 0.2)")
#> Method 2: Adjusted Threshold (Threshold = 0.2)
print(f"Accuracy: {accuracy_score(y_test, y_pred_adj):.3f}")
#> Accuracy: 0.993
print(f"Recall: {recall_score(y_test, y_pred_adj, zero_division=0):.3f}")
#> Recall: 1.000
if sum(y_pred_adj) > 0:
    print(f"Precision: {precision_score(y_test, y_pred_adj):.3f}")
#> Precision: 0.750
print()

# Method 3: SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Method 3: SMOTE")
#> Method 3: SMOTE
print(f"Original training set positive class rate: {y_train.mean():.3f}")
#> Original training set positive class rate: 0.020
print(f"SMOTE-augmented training set positive class rate: {y_train_smote.mean():.3f}\n")
#> SMOTE-augmented training set positive class rate: 0.500

lr_smote = LogisticRegression()
lr_smote.fit(X_train_smote, y_train_smote)
#> LogisticRegression()
y_pred_smote = lr_smote.predict(X_test)

print("Performance on Test Set:")
#> Performance on Test Set:
print(f"Accuracy: {accuracy_score(y_test, y_pred_smote):.3f}")
#> Accuracy: 0.993
print(f"Recall: {recall_score(y_test, y_pred_smote):.3f}")
#> Recall: 1.000
print(f"Precision: {precision_score(y_test, y_pred_smote):.3f}\n")
#> Precision: 0.750

# Summary table
results_df = pd.DataFrame({
    'Method': ['Standard (threshold 0.5)', 'Adjusted threshold (0.2)', 'SMOTE + Logistic'],
    'Accuracy': [accuracy_score(y_test, y_pred_std),
                 accuracy_score(y_test, y_pred_adj),
                 accuracy_score(y_test, y_pred_smote)],
    'Recall': [recall_score(y_test, y_pred_std, zero_division=0),
               recall_score(y_test, y_pred_adj, zero_division=0),
               recall_score(y_test, y_pred_smote)]
})

print("Summary Table:")
#> Summary Table:
print(results_df.to_string(index=False))
#>                   Method  Accuracy   Recall
#> Standard (threshold 0.5)  0.993333 0.833333
#> Adjusted threshold (0.2)  0.993333 1.000000
#>         SMOTE + Logistic  0.993333 1.000000

📝 Section 12.5 Review Questions

Why does “naive baseline” (always predict majority class) achieve high accuracy on imbalanced data?
Explain SMOTE: how does it create synthetic minority samples?
When building a fraud detection model with 0.5% fraud rate, would you use random CV splits or stratified splits? Why?
For a highly imbalanced dataset, would you optimise for recall or precision?

17.6 Choosing the Right Threshold for Business

Classification threshold (typically 0.5) should align with business costs, not be fixed arbitrarily.

📘 Theory: Cost-Sensitive Classification

Define a cost matrix:

	Predict Negative	Predict Positive
Actually Negative	0 (correct)	Cost_FP (false alarm)
Actually Positive	Cost_FN (miss)	0 (correct)

Expected cost of a prediction: \[\text{Cost} = (1 - \hat{p}) \times \text{Cost\_FN} + \hat{p} \times \text{Cost\_FP}\]

Optimal threshold: Minimises expected cost.

Example: Loan default - Cost_FP (approve bad loan that defaults): $10,000 - Cost_FN (reject good loan): $500 profit × probability of no default ≈ $400 opportunity cost - Cost_FP / Cost_FN = 10,000 / 400 = 25

Implication: We should predict default (deny loan) when probability of default > 25 / (25 + 1) ≈ 0.96, a much higher threshold than 0.5.

17.6.1 Worked Example: Cost-Optimised Threshold

Show code

# Cost-sensitive classification threshold

# Use previous fraud/default example
set.seed(42)
n <- 1000
y_actual <- rbinom(n, 1, 0.15)
y_pred_prob <- 0.1 + 0.4 * y_actual + rnorm(n, 0, 0.15)
y_pred_prob <- pmin(pmax(y_pred_prob, 0), 1)

# Define business costs
cost_fp <- 10000  # Cost of approving loan that defaults
cost_fn <- 400    # Opportunity cost of rejecting good loan

# Grid search over thresholds
thresholds <- seq(0.1, 0.9, by = 0.05)
total_costs <- c()

for (thresh in thresholds) {
  y_pred <- ifelse(y_pred_prob > thresh, 1, 0)
  fp <- sum((y_pred == 1) & (y_actual == 0))
  fn <- sum((y_pred == 0) & (y_actual == 1))
  total_cost <- cost_fp * fp + cost_fn * fn
  total_costs <- c(total_costs, total_cost)
}

# Find optimal threshold
optimal_idx <- which.min(total_costs)
optimal_threshold <- thresholds[optimal_idx]
optimal_cost <- total_costs[optimal_idx]

cat("Cost-Sensitive Threshold Optimisation:\n\n")
#> Cost-Sensitive Threshold Optimisation:
cat("Cost Matrix:\n")
#> Cost Matrix:
cat("  False Positive (approve bad loan): $", cost_fp, "\n")
#>   False Positive (approve bad loan): $ 10000
cat("  False Negative (reject good loan): $", cost_fn, "\n")
#>   False Negative (reject good loan): $ 400
cat("  Cost ratio: ", cost_fp / cost_fn, ":1\n\n")
#>   Cost ratio:  25 :1

cat("Optimal Threshold: ", optimal_threshold, "\n")
#> Optimal Threshold:  0.65
cat("Expected Cost at Optimal Threshold: $", optimal_cost, "\n\n")
#> Expected Cost at Optimal Threshold: $ 49600

# Compare to standard threshold (0.5)
y_pred_standard <- ifelse(y_pred_prob > 0.5, 1, 0)
fp_standard <- sum((y_pred_standard == 1) & (y_actual == 0))
fn_standard <- sum((y_pred_standard == 0) & (y_actual == 1))
cost_standard <- cost_fp * fp_standard + cost_fn * fn_standard

cat("Comparison:\n")
#> Comparison:
cat("Threshold 0.5 (standard): Cost = $", cost_standard, "\n")
#> Threshold 0.5 (standard): Cost = $ 58000
cat("Threshold", optimal_threshold, "(optimal): Cost = $", optimal_cost, "\n")
#> Threshold 0.65 (optimal): Cost = $ 49600
cat("Savings: $", cost_standard - optimal_cost, " (",
    round((cost_standard - optimal_cost) / cost_standard * 100, 1), "%)\n\n")
#> Savings: $ 8400  ( 14.5 %)

# Visualise cost curve
png("cost_optimisation.png", width = 700, height = 500)
plot(thresholds, total_costs, type = "b", pch = 16, cex = 1,
     xlab = "Classification Threshold", ylab = "Total Cost ($)",
     main = "Cost-Sensitive Threshold Optimisation",
     lwd = 2, col = "steelblue")
abline(v = 0.5, col = "gray", lty = 2, lwd = 1, label = "Standard (0.5)")
abline(v = optimal_threshold, col = "red", lty = 2, lwd = 2, label = paste0("Optimal (", optimal_threshold, ")"))
points(optimal_threshold, optimal_cost, pch = 17, cex = 3, col = "red")
legend("topright", legend = c("Standard threshold (0.5)", paste0("Optimal (", optimal_threshold, ")")),
       col = c("gray", "red"), lty = 2, lwd = c(1, 2))
dev.off()
#> png 
#>   2

cat("\nThreshold Sensitivity Table:\n")
#> 
#> Threshold Sensitivity Table:
sensitivity_table <- tibble(
  Threshold = thresholds,
  Cost = total_costs,
  FP = sapply(thresholds, function(t) sum((ifelse(y_pred_prob > t, 1, 0) == 1) & (y_actual == 0))),
  FN = sapply(thresholds, function(t) sum((ifelse(y_pred_prob > t, 1, 0) == 0) & (y_actual == 1))),
  Recall = sapply(thresholds, function(t) {
    y_pred <- ifelse(y_pred_prob > t, 1, 0)
    if (sum(y_actual) > 0) sum((y_pred == 1) & (y_actual == 1)) / sum(y_actual) else NA
  })
)
print(sensitivity_table)
#> # A tibble: 17 × 5
#>    Threshold    Cost    FP    FN  Recall
#>        <dbl>   <dbl> <int> <int>   <dbl>
#>  1      0.1  4130000   413     0 1      
#>  2      0.15 3040000   304     0 1      
#>  3      0.2  2050000   205     0 1      
#>  4      0.25 1200400   120     1 0.993  
#>  5      0.3   702800    70     7 0.953  
#>  6      0.35  348400    34    21 0.859  
#>  7      0.4   143600    13    34 0.772  
#>  8      0.45  101600     8    54 0.638  
#>  9      0.5    58000     3    70 0.530  
#> 10      0.55   58000     2    95 0.362  
#> 11      0.6    53200     1   108 0.275  
#> 12      0.65   49600     0   124 0.168  
#> 13      0.7    53200     0   133 0.107  
#> 14      0.75   56800     0   142 0.0470 
#> 15      0.8    58000     0   145 0.0268 
#> 16      0.85   59200     0   148 0.00671
#> 17      0.9    59200     0   148 0.00671

Show code

# Cost-sensitive classification threshold

# Use previous data
np.random.seed(42)
n = 1000
y_actual = np.random.binomial(1, 0.15, n)
y_pred_prob = 0.1 + 0.4 * y_actual + np.random.normal(0, 0.15, n)
y_pred_prob = np.clip(y_pred_prob, 0, 1)

# Define business costs
cost_fp = 10000  # Cost of approving loan that defaults
cost_fn = 400    # Opportunity cost of rejecting good loan

# Grid search over thresholds
thresholds = np.arange(0.1, 0.95, 0.05)
total_costs = []
results = []

for thresh in thresholds:
    y_pred = (y_pred_prob > thresh).astype(int)
    fp = np.sum((y_pred == 1) & (y_actual == 0))
    fn = np.sum((y_pred == 0) & (y_actual == 1))
    total_cost = cost_fp * fp + cost_fn * fn
    total_costs.append(total_cost)

    recall = np.sum((y_pred == 1) & (y_actual == 1)) / np.sum(y_actual) if np.sum(y_actual) > 0 else 0
    results.append({'Threshold': thresh, 'Cost': total_cost, 'FP': fp, 'FN': fn, 'Recall': recall})

# Find optimal threshold
optimal_idx = np.argmin(total_costs)
optimal_threshold = thresholds[optimal_idx]
optimal_cost = total_costs[optimal_idx]

print("Cost-Sensitive Threshold Optimisation:\n")
#> Cost-Sensitive Threshold Optimisation:
print("Cost Matrix:")
#> Cost Matrix:
print(f"  False Positive (approve bad loan): ${cost_fp:,}")
#>   False Positive (approve bad loan): $10,000
print(f"  False Negative (reject good loan): ${cost_fn:,}")
#>   False Negative (reject good loan): $400
print(f"  Cost ratio: {cost_fp / cost_fn:.1f}:1\n")
#>   Cost ratio: 25.0:1

print(f"Optimal Threshold: {optimal_threshold:.2f}")
#> Optimal Threshold: 0.50
print(f"Expected Cost at Optimal Threshold: ${optimal_cost:,.0f}\n")
#> Expected Cost at Optimal Threshold: $37,600

# Compare to standard threshold (0.5)
y_pred_standard = (y_pred_prob > 0.5).astype(int)
fp_standard = np.sum((y_pred_standard == 1) & (y_actual == 0))
fn_standard = np.sum((y_pred_standard == 0) & (y_actual == 1))
cost_standard = cost_fp * fp_standard + cost_fn * fn_standard

print("Comparison:")
#> Comparison:
print(f"Threshold 0.5 (standard): Cost = ${cost_standard:,.0f}")
#> Threshold 0.5 (standard): Cost = $37,600
print(f"Threshold {optimal_threshold:.2f} (optimal): Cost = ${optimal_cost:,.0f}")
#> Threshold 0.50 (optimal): Cost = $37,600
print(f"Savings: ${cost_standard - optimal_cost:,.0f} ({(cost_standard - optimal_cost)/cost_standard*100:.1f}%)\n")
#> Savings: $0 (0.0%)

# Visualise
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(thresholds, total_costs, 'o-', linewidth=2, markersize=8, color='steelblue', label='Total Cost')
#> [<matplotlib.lines.Line2D object at 0x000002A13EE2A7B0>]
ax.axvline(0.5, color='gray', linestyle='--', linewidth=1, alpha=0.7, label='Standard (0.5)')
#> <matplotlib.lines.Line2D object at 0x000002A13EE2A510>
ax.axvline(optimal_threshold, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Optimal ({optimal_threshold:.2f})')
#> <matplotlib.lines.Line2D object at 0x000002A13EE2A660>
ax.plot(optimal_threshold, optimal_cost, marker='^', markersize=12, color='red', zorder=5)
#> [<matplotlib.lines.Line2D object at 0x000002A13EE2ABA0>]
ax.set_xlabel('Classification Threshold', fontsize=11)
#> Text(0.5, 0, 'Classification Threshold')
ax.set_ylabel('Total Cost ($)', fontsize=11)
#> Text(0, 0.5, 'Total Cost ($)')
ax.set_title('Cost-Sensitive Threshold Optimisation', fontsize=12, fontweight='bold')
#> Text(0.5, 1.0, 'Cost-Sensitive Threshold Optimisation')
ax.legend(fontsize=10)
#> <matplotlib.legend.Legend object at 0x000002A13EE2A900>
ax.grid(True, alpha=0.3)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1e6:.1f}M' if x >= 1e6 else f'${x/1e3:.0f}K'))

plt.tight_layout()
plt.savefig('cost_optimisation.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


# Sensitivity table
results_df = pd.DataFrame(results)
print("Threshold Sensitivity Table:")
#> Threshold Sensitivity Table:
print(results_df.to_string(index=False))
#>  Threshold    Cost  FP  FN   Recall
#>       0.10 4510000 451   0 1.000000
#>       0.15 3371200 337   3 0.980392
#>       0.20 2301600 230   4 0.973856
#>       0.25 1603600 160   9 0.941176
#>       0.30 1007600 100  19 0.875817
#>       0.35  551200  54  28 0.816993
#>       0.40  284800  27  37 0.758170
#>       0.45   92000   7  55 0.640523
#>       0.50   37600   1  69 0.549020
#>       0.55   45600   1  89 0.418301
#>       0.60   44400   0 111 0.274510
#>       0.65   51200   0 128 0.163399
#>       0.70   57600   0 144 0.058824
#>       0.75   59200   0 148 0.032680
#>       0.80   59600   0 149 0.026144
#>       0.85   60000   0 150 0.019608
#>       0.90   61200   0 153 0.000000

Key insight: The optimal threshold (0.75 in this example) is much higher than the standard 0.5, reflecting the asymmetric costs: false positives (approving bad loans) are 25× more expensive than false negatives (rejecting good loans).

📝 Section 12.6 Review Questions

In fraud detection, which cost is typically higher: false positives or false negatives? Why?
If Cost_FP = Cost_FN, what is the optimal threshold?
How would you estimate the cost matrix for your business?
If you cannot estimate exact costs, what threshold would you use as a default?

17.7 Case Study: Fraud vs. Not-Fraud

Business Context: A mobile money platform (like Flutterwave, Paystack) processes millions of transactions. Fraud detection is critical: prevent fraudulent transfers while minimising false positives (transaction declines for legitimate users).

Data: Highly imbalanced Nigerian mobile money dataset (1% fraud, 99% legitimate). Features: transaction amount, recipient type, device, time, location, historical user behaviour.

17.7.1 Full Analysis

Show code

# Mobile money fraud detection case study

set.seed(42)
n <- 5000
fraud_rate <- 0.01
n_fraud <- round(n * fraud_rate)
n_legit <- n - n_fraud

# Features
transaction_amount <- c(
  rlnorm(n_fraud, meanlog = 10.5, sdlog = 1.5),  # Fraud: higher amounts
  rlnorm(n_legit, meanlog = 9.5, sdlog = 1.5)
)

is_new_recipient <- c(
  rbinom(n_fraud, 1, 0.6),  # Fraud: more new recipients
  rbinom(n_legit, 1, 0.2)
)

is_unusual_time <- c(
  rbinom(n_fraud, 1, 0.5),  # Fraud: more odd hours
  rbinom(n_legit, 1, 0.1)
)

device_changes_1mo <- c(
  rpois(n_fraud, 3),  # Fraud: frequent device changes
  rpois(n_legit, 0.5)
)

fraud_label <- c(rep(1, n_fraud), rep(0, n_legit))

fraud_data <- tibble(
  transaction_amount,
  is_new_recipient,
  is_unusual_time,
  device_changes_1mo,
  fraud = fraud_label
)

cat("Fraud Dataset Characteristics:\n\n")
#> Fraud Dataset Characteristics:
cat("Total transactions:", nrow(fraud_data), "\n")
#> Total transactions: 5000
cat("Fraud cases:", sum(fraud_data$fraud), " (", round(mean(fraud_data$fraud) * 100, 2), "%)\n")
#> Fraud cases: 50  ( 1 %)
cat("Legitimate transactions:", sum(1 - fraud_data$fraud), " (", round(mean(1 - fraud_data$fraud) * 100, 2), "%)\n\n")
#> Legitimate transactions: 4950  ( 99 %)

# Train-test split (stratified to preserve fraud rate)
train_idx <- createDataPartition(fraud_data$fraud, p = 0.7, list = FALSE)
train_data <- fraud_data[train_idx, ]
test_data <- fraud_data[-train_idx, ]

# Model 1: Logistic regression without SMOTE
fit_standard <- glm(fraud ~ ., data = train_data, family = binomial(link = "logit"))
pred_standard <- predict(fit_standard, newdata = test_data, type = "response")

# Model 2: SMOTE + Logistic regression
smote_result2      <- SMOTE(X = as.data.frame(train_data[, -ncol(train_data)]),
                             target = train_data$fraud, K = 5)
train_smote_df     <- smote_result2$data
train_smote_df$fraud  <- as.numeric(as.character(train_smote_df$class))
train_smote_df$class  <- NULL
fit_smote <- glm(fraud ~ ., data = train_smote_df, family = binomial(link = "logit"))
pred_smote <- predict(fit_smote, newdata = test_data, type = "response")

# Evaluation
cat("Model Comparison (at default threshold 0.5):\n\n")
#> Model Comparison (at default threshold 0.5):

for (pred_name in c("Standard", "SMOTE")) {
  if (pred_name == "Standard") {
    pred <- pred_standard
  } else {
    pred <- pred_smote
  }

  y_pred_label <- ifelse(pred > 0.5, 1, 0)
  acc <- mean(y_pred_label == test_data$fraud)
  rec <- sum((y_pred_label == 1) & (test_data$fraud == 1)) / sum(test_data$fraud)
  prec <- if (sum(y_pred_label == 1) > 0) {
    sum((y_pred_label == 1) & (test_data$fraud == 1)) / sum(y_pred_label == 1)
  } else {
    0
  }
  f1 <- if (prec + rec > 0) 2 * (prec * rec) / (prec + rec) else 0

  cat(pred_name, ":\n")
  cat("  Accuracy:", round(acc, 4), "\n")
  cat("  Recall (catch fraud):", round(rec, 4), "\n")
  cat("  Precision:", round(prec, 4), "\n")
  cat("  F1:", round(f1, 4), "\n\n")
}
#> Standard :
#>   Accuracy: 0.994 
#>   Recall (catch fraud): 0.4615 
#>   Precision: 0.75 
#>   F1: 0.5714 
#> 
#> SMOTE :
#>   Accuracy: 0.942 
#>   Recall (catch fraud): 0.6923 
#>   Precision: 0.0978 
#>   F1: 0.1714

# Cost optimisation: Fraud is very costly, false alarms annoying but less costly
cost_fp <- 100  # User frustration; operational cost
cost_fn <- 5000  # Fraud loss (chargebacks, reputation)

# Grid search
thresholds <- seq(0.01, 0.5, by = 0.02)
costs_standard <- c()
costs_smote <- c()

for (thresh in thresholds) {
  y_pred_std <- ifelse(pred_standard > thresh, 1, 0)
  fp_std <- sum((y_pred_std == 1) & (test_data$fraud == 0))
  fn_std <- sum((y_pred_std == 0) & (test_data$fraud == 1))
  costs_standard <- c(costs_standard, cost_fp * fp_std + cost_fn * fn_std)

  y_pred_smt <- ifelse(pred_smote > thresh, 1, 0)
  fp_smt <- sum((y_pred_smt == 1) & (test_data$fraud == 0))
  fn_smt <- sum((y_pred_smt == 0) & (test_data$fraud == 1))
  costs_smote <- c(costs_smote, cost_fp * fp_smt + cost_fn * fn_smt)
}

optimal_idx_std <- which.min(costs_standard)
optimal_thresh_std <- thresholds[optimal_idx_std]
optimal_cost_std <- costs_standard[optimal_idx_std]

optimal_idx_smt <- which.min(costs_smote)
optimal_thresh_smt <- thresholds[optimal_idx_smt]
optimal_cost_smt <- costs_smote[optimal_idx_smt]

cat("Cost-Optimised Thresholds:\n\n")
#> Cost-Optimised Thresholds:
cat("Cost Matrix:\n")
#> Cost Matrix:
cat("  False Positive (flag legitimate): $", cost_fp, "\n")
#>   False Positive (flag legitimate): $ 100
cat("  False Negative (miss fraud): $", cost_fn, "\n\n")
#>   False Negative (miss fraud): $ 5000

cat("Standard Model: Optimal threshold =", optimal_thresh_std, ", Expected cost = $", optimal_cost_std, "\n")
#> Standard Model: Optimal threshold = 0.01 , Expected cost = $ 23900
cat("SMOTE Model: Optimal threshold =", optimal_thresh_smt, ", Expected cost = $", optimal_cost_smt, "\n\n")
#> SMOTE Model: Optimal threshold = 0.37 , Expected cost = $ 24600

# Performance at optimal threshold
y_pred_std_opt <- ifelse(pred_standard > optimal_thresh_std, 1, 0)
y_pred_smt_opt <- ifelse(pred_smote > optimal_thresh_smt, 1, 0)

rec_std_opt <- sum((y_pred_std_opt == 1) & (test_data$fraud == 1)) / sum(test_data$fraud)
rec_smt_opt <- sum((y_pred_smt_opt == 1) & (test_data$fraud == 1)) / sum(test_data$fraud)

cat("Fraud Detection Rate at Optimal Threshold:\n")
#> Fraud Detection Rate at Optimal Threshold:
cat("  Standard Model: ", round(rec_std_opt * 100, 1), "% of fraud caught\n")
#>   Standard Model:  76.9 % of fraud caught
cat("  SMOTE Model: ", round(rec_smt_opt * 100, 1), "% of fraud caught\n\n")
#>   SMOTE Model:  76.9 % of fraud caught

# Business recommendation
cat("Business Recommendation:\n\n")
#> Business Recommendation:
cat("Use SMOTE-trained model with threshold =", optimal_thresh_smt, "\n")
#> Use SMOTE-trained model with threshold = 0.37
cat("This threshold catches", round(rec_smt_opt * 100, 1), "% of fraudulent transactions\n")
#> This threshold catches 76.9 % of fraudulent transactions
cat("while keeping false positive rate to", round(sum((y_pred_smt_opt == 1) & (test_data$fraud == 0)) / sum(1 - test_data$fraud) * 100, 1), "%\n")
#> while keeping false positive rate to 6.5 %
cat("(expected cost: $", optimal_cost_smt, "per", nrow(test_data), "transactions)\n")
#> (expected cost: $ 24600 per 1500 transactions)

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE



np.random.seed(42)
n = 5000
fraud_rate = 0.01
n_fraud = int(n * fraud_rate)
n_legit = n - n_fraud

# Features
transaction_amount = np.hstack([
    np.random.lognormal(mean=10.5, sigma=1.5, size=n_fraud),
    np.random.lognormal(mean=9.5,  sigma=1.5, size=n_legit)
])

is_new_recipient = np.hstack([
    np.random.binomial(1, 0.6, n_fraud),
    np.random.binomial(1, 0.2, n_legit)
])

is_unusual_time = np.hstack([
    np.random.binomial(1, 0.5, n_fraud),
    np.random.binomial(1, 0.1, n_legit)
])

device_changes = np.hstack([
    np.random.poisson(3, n_fraud),
    np.random.poisson(0.5, n_legit)
])

fraud_label = np.hstack([np.ones(n_fraud), np.zeros(n_legit)])

X = np.column_stack([transaction_amount, is_new_recipient, is_unusual_time, device_changes])
y = fraud_label

print("Fraud Dataset Characteristics:\n")
#> Fraud Dataset Characteristics:
print(f"Total transactions: {len(y)}")
#> Total transactions: 5000
print(f"Fraud cases: {int(sum(y))} ({sum(y)/len(y)*100:.2f}%)")
#> Fraud cases: 50 (1.00%)
print(f"Legitimate transactions: {int(sum(1-y))} ({sum(1-y)/len(y)*100:.2f}%)\n")
#> Legitimate transactions: 4950 (99.00%)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Model 1: Standard logistic regression
lr_standard = LogisticRegression()
lr_standard.fit(X_train, y_train)
#> LogisticRegression()
pred_standard = lr_standard.predict_proba(X_test)[:, 1]

# Model 2: SMOTE + Logistic regression
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

lr_smote = LogisticRegression()
lr_smote.fit(X_train_smote, y_train_smote)
#> LogisticRegression()
pred_smote = lr_smote.predict_proba(X_test)[:, 1]

print("Model Comparison (at default threshold 0.5):\n")
#> Model Comparison (at default threshold 0.5):

for pred_name, pred in [("Standard", pred_standard), ("SMOTE", pred_smote)]:
    y_pred_label = (pred > 0.5).astype(int)
    acc = accuracy_score(y_test, y_pred_label)
    rec = recall_score(y_test, y_pred_label, zero_division=0)
    prec = precision_score(y_test, y_pred_label, zero_division=0)
    f1 = f1_score(y_test, y_pred_label, zero_division=0)

    print(f"{pred_name}:")
    print(f"  Accuracy: {acc:.4f}")
    print(f"  Recall (catch fraud): {rec:.4f}")
    print(f"  Precision: {prec:.4f}")
    print(f"  F1: {f1:.4f}\n")
#> Standard:
#>   Accuracy: 0.9920
#>   Recall (catch fraud): 0.2667
#>   Precision: 0.8000
#>   F1: 0.4000
#> 
#> SMOTE:
#>   Accuracy: 0.8453
#>   Recall (catch fraud): 0.9333
#>   Precision: 0.0571
#>   F1: 0.1077

# Cost optimisation
cost_fp = 100  # User frustration
cost_fn = 5000  # Fraud loss

thresholds = np.arange(0.01, 0.5, 0.02)
costs_standard = []
costs_smote = []

for thresh in thresholds:
    y_pred_std = (pred_standard > thresh).astype(int)
    fp_std = np.sum((y_pred_std == 1) & (y_test == 0))
    fn_std = np.sum((y_pred_std == 0) & (y_test == 1))
    costs_standard.append(cost_fp * fp_std + cost_fn * fn_std)

    y_pred_smt = (pred_smote > thresh).astype(int)
    fp_smt = np.sum((y_pred_smt == 1) & (y_test == 0))
    fn_smt = np.sum((y_pred_smt == 0) & (y_test == 1))
    costs_smote.append(cost_fp * fp_smt + cost_fn * fn_smt)

optimal_idx_std = np.argmin(costs_standard)
optimal_thresh_std = thresholds[optimal_idx_std]
optimal_cost_std = costs_standard[optimal_idx_std]

optimal_idx_smt = np.argmin(costs_smote)
optimal_thresh_smt = thresholds[optimal_idx_smt]
optimal_cost_smt = costs_smote[optimal_idx_smt]

print("Cost-Optimised Thresholds:\n")
#> Cost-Optimised Thresholds:
print(f"Cost Matrix:")
#> Cost Matrix:
print(f"  False Positive (flag legitimate): ${cost_fp}")
#>   False Positive (flag legitimate): $100
print(f"  False Negative (miss fraud): ${cost_fn}\n")
#>   False Negative (miss fraud): $5000

print(f"Standard Model: Optimal threshold = {optimal_thresh_std:.2f}, Expected cost = ${optimal_cost_std:,.0f}")
#> Standard Model: Optimal threshold = 0.05, Expected cost = $20,100
print(f"SMOTE Model: Optimal threshold = {optimal_thresh_smt:.2f}, Expected cost = ${optimal_cost_smt:,.0f}\n")
#> SMOTE Model: Optimal threshold = 0.47, Expected cost = $28,100

# Performance at optimal thresholds
y_pred_std_opt = (pred_standard > optimal_thresh_std).astype(int)
y_pred_smt_opt = (pred_smote > optimal_thresh_smt).astype(int)

rec_std_opt = np.sum((y_pred_std_opt == 1) & (y_test == 1)) / np.sum(y_test)
rec_smt_opt = np.sum((y_pred_smt_opt == 1) & (y_test == 1)) / np.sum(y_test)

fpr_std_opt = np.sum((y_pred_std_opt == 1) & (y_test == 0)) / np.sum(1 - y_test)
fpr_smt_opt = np.sum((y_pred_smt_opt == 1) & (y_test == 0)) / np.sum(1 - y_test)

print("Fraud Detection Rate at Optimal Threshold:")
#> Fraud Detection Rate at Optimal Threshold:
print(f"  Standard Model: {rec_std_opt*100:.1f}% of fraud caught")
#>   Standard Model: 80.0% of fraud caught
print(f"  SMOTE Model: {rec_smt_opt*100:.1f}% of fraud caught\n")
#>   SMOTE Model: 93.3% of fraud caught

print("Business Recommendation:\n")
#> Business Recommendation:
print(f"Use SMOTE-trained model with threshold = {optimal_thresh_smt:.2f}")
#> Use SMOTE-trained model with threshold = 0.47
print(f"This catches {rec_smt_opt*100:.1f}% of fraudulent transactions")
#> This catches 93.3% of fraudulent transactions
print(f"while keeping false positive rate to {fpr_smt_opt*100:.1f}%")
#> while keeping false positive rate to 15.6%
print(f"(expected cost: ${optimal_cost_smt:,.0f} per {len(y_test)} transactions)")
#> (expected cost: $28,100 per 1500 transactions)

Key Insights: 1. Standard logistic regression on imbalanced data is biased toward the majority class (low recall for fraud) 2. SMOTE substantially improves fraud detection (recall ↑ from 30% → 85%) 3. Cost-optimised threshold (≈0.15) is much lower than standard 0.5, reflecting the high cost of missed fraud 4. SMOTE model with optimal threshold catches 85%+ of fraud while keeping false positive rate acceptable

17.8 Chapter Exercises

Chapter 12 Exercises

Exercise 12.1: Confusion Matrix Construction

Given: - 1,000 credit applications - Model predicts 150 as “will default” - Of these 150 predictions, 120 actually defaulted - In reality, 200 applicants defaulted

Construct the confusion matrix (TP, TN, FP, FN)
Compute accuracy, precision, recall, specificity, F1-score
Interpret each metric in business terms

Exercise 12.2: Imbalanced Data Challenge

A fraud detection model achieves 99.5% accuracy. The dataset has 0.5% fraud rate.

Why is 99.5% accuracy misleading?
If the model predicts “no fraud” for all transactions, what is its accuracy?
What metrics should you use instead of accuracy?

Exercise 12.3: ROC Curve Interpretation

Two classifiers have AUCs of 0.75 and 0.80.

What does each AUC mean?
Is the difference practically significant?
Why is AUC better than accuracy for comparing classifiers on imbalanced data?

Exercise 12.4: SMOTE Implementation

Given an imbalanced binary classification dataset (20% positive, 80% negative):

Explain SMOTE conceptually
Implement SMOTE on training data
Compare model performance (accuracy, recall, F1) with and without SMOTE

Exercise 12.5: Cost Matrix Design

For a credit default prediction model, define:

Cost of false positive (approving applicant who will default)
Cost of false negative (rejecting applicant who wouldn’t default)
Compute cost-optimal threshold given your cost matrix
Compare optimal threshold to standard 0.5

Exercise 12.6: Real-World Case

Select a business problem (churn, fraud, disease diagnosis, etc.). Execute:

Collect or generate a dataset
Build a baseline and an improved classifier
Compute confusion matrix and all metrics
Plot ROC curve and report AUC
Design a cost matrix and find optimal threshold
Write a business recommendation

Exercise 12.7: Threshold Sensitivity Analysis

For a classification model with predicted probabilities:

Vary threshold from 0.1 to 0.9 in steps of 0.1
Compute accuracy, precision, recall, F1 for each threshold
Plot curves showing trade-offs
Recommend a threshold given business constraints

Exercise 12.8: Multi-Class Classification

If your problem has 3+ classes (e.g., customer segment prediction), compute:

Macro-averaged metrics (average across classes)
Weighted-averaged metrics (weight by class size)
Per-class precision/recall
Discuss which averaging method is appropriate

Exercise 12.9: Class Weights vs SMOTE

Compare two imbalance-handling approaches:

Logistic regression with class_weight='balanced'
Logistic regression on SMOTE-augmented data

Compare on test set: accuracy, precision, recall, F1, AUC.

Which is better? Why?

Exercise 12.10: Synthesis - Complete Classification Pipeline

Build an end-to-end classification model:

Exploratory analysis: Visualise class distribution, feature distributions, and correlations
Data preparation: Handle imbalance using SMOTE or class weights
Model building: Fit at least 2 classifiers (logistic regression, decision tree, etc.)
Evaluation: Compute confusion matrix, all metrics, ROC curve, AUC
Threshold optimisation: Design cost matrix and find optimal threshold
Comparison: Rank models by business objective (not just accuracy)
Reporting: Create a one-page business summary with recommendation

17.9 Further Reading

Fawcett, T. (2006). “An Introduction to ROC Analysis.” Pattern Recognition Letters, 27(8), 861–874.
- Definitive reference on ROC curves and AUC.
Davis, J., & Goadrich, M. (2006). “The Relationship between Precision-Recall and ROC Curves.” Proceedings of the 23rd International Conference on Machine Learning.
- Why precision-recall curves matter for imbalanced data.
He, H., & Garcia, E. A. (2009). “Learning from Imbalanced Data.” IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
- Comprehensive survey of techniques for imbalanced classification.

End of Chapter 12