15 Regularised Regression and Feature Selection – AI-Powered Business Analytics

15.1 The Overfitting Problem

Imagine you are studying for an examination. You have access to five years of past exam papers. One approach: memorise every past question and its answer. You ace the practice papers with a perfect score. Then the real exam arrives with slightly different questions — and you flounder, because you memorised the past papers instead of understanding the underlying concepts.

This is exactly what overfitting means in machine learning. A model that has memorised its training data can reproduce training examples with near-perfect accuracy, but it has not learned the true underlying pattern — it has learned the noise, the quirks, the idiosyncrasies of that particular dataset. When new data arrives that it has never seen before, the model fails because the noise in the training data does not appear in the new data.

Here is a concrete business example. A bank wants to predict which customers will default on a personal loan. They have 1,000 training records and a junior data scientist who adds 50 features to the model — income, age, credit score, number of transactions, days since last deposit, and 45 other variables. The model fits the training data beautifully, with 98% accuracy. Excited, the bank deploys it. Six months later, the actual default rate on customers the model marked as “safe” is far higher than expected. The model had learned that customer number 487 defaulted AND lived near Ikorodu AND had exactly 12 transactions in March — a coincidence specific to the training data that does not predict anything about future customers.

The solution is regularisation: a mathematical technique that deliberately prevents the model from fitting the training data too perfectly. By adding a penalty for complexity, regularisation forces the model to find simpler, more general patterns — patterns that are more likely to hold when new data arrives.

15.1.1 The Bias-Variance Trade-off: Finding the Sweet Spot

Understanding regularisation requires understanding a fundamental tension at the heart of statistics called the bias-variance trade-off. To grasp this intuitively, think of an archery competition.

An archer with high bias always shoots to the left of the target. Their arrows are consistently wrong in the same direction — they have a systematic error. The bow might be miscalibrated. In modelling terms, a high-bias model is too simple: it misses real patterns because it cannot represent them. A linear model fitted to data that is genuinely curved will always underestimate at the extremes and overestimate in the middle — a systematic, consistent error.

An archer with high variance is inconsistent: sometimes they shoot left, sometimes right, sometimes high, sometimes low. Their arrows are scattered unpredictably across the target. In modelling terms, a high-variance model changes dramatically when retrained on slightly different data. It is too sensitive to the specific sample it was trained on. A model with 50 features fitted to 100 data points is in this territory — small changes in which 100 points you happened to sample could lead to wildly different coefficient estimates.

The ideal archer is both accurate (low bias) and consistent (low variance). The ideal model hits the bull’s-eye reliably. But here is the cruel reality: as you make a model more complex (adding more features, adding interaction terms, using higher-degree polynomials), bias goes down but variance goes up. A complex model fits the training data better, but it becomes more sensitive to noise. Regularisation is the tool that tames this trade-off — it deliberately increases bias slightly to massively reduce variance, resulting in a model that generalises better to new data.

Every estimator’s total prediction error can be decomposed as:

Total error = Bias² + Variance + Irreducible error

The irreducible error is the noise in the data that no model can ever eliminate — the fundamental randomness in outcomes. The goal of regularisation is to minimise bias² + variance together, even though reducing one tends to increase the other.

📘 Theory: Learning Curves and Regularisation

Training error: Error on the data used to fit the model; decreases with model complexity Test (generalisation) error: Error on held-out data; initially decreases, then increases (the “U-shape”)

Overfitting signal: Training error << Test error

Regularisation adds a penalty to the objective function, discouraging extreme coefficient values and encouraging simpler models. This trades a small increase in bias for a larger reduction in variance, shifting the bias-variance optimum toward lower complexity.

📝 Section 10.1 Review Questions

Draw the typical U-shaped learning curve showing training and test error. Label the overfitting region.
If a model has very high training R² but low test R², what is the likely problem?
Explain: “Regularisation introduces bias to reduce variance.” Is this always desirable?
In a high-dimensional problem (p >> n), why is regularisation essential?

15.2 Ridge Regression (L2 Regularisation)

Ridge regression adds an L2 penalty (sum of squared coefficients) to the OLS objective, shrinking coefficients toward zero without eliminating any.

📘 Theory: Ridge Objective and Solution

Ridge minimises: \[S_{\text{Ridge}}(\beta) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2\]

The second term is the L2 penalty, controlled by λ (lambda), the regularisation parameter: - λ = 0: OLS (no penalty) - λ → ∞: all coefficients → 0

Closed-form solution: \[\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}\]

The added λI term makes the matrix inversion stable even when \(\mathbf{X}^T \mathbf{X}\) is near-singular (multicollinearity).

🔑 Key Formula: Ridge Regression

\[\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}\]

Properties: - Coefficients are never exactly zero (shrinkage, but not elimination) - Symmetric shrinkage: larger coefficients shrink more in absolute value - λ controls the amount of shrinkage - More interpretable than OLS when multicollinearity is severe

15.2.1 Choosing λ via Cross-Validation

The optimal λ is found by K-fold cross-validation:

Fit Ridge with a sequence of λ values (e.g., λ ∈ {0.01, 0.1, 1, 10, 100})
For each λ, compute average CV error across k folds
Select λ with lowest CV error
Refit on full data with this λ

15.2.2 Worked Example: Ridge for Nigerian SME Credit Data

Show code

library(glmnet)
library(tidyverse)

set.seed(42)
n <- 300

# Synthetic Nigerian SME credit data: 20 correlated financial ratios
# Features: debt-to-equity, leverage, ROA, profitability, liquidity, etc.
X_raw <- matrix(rnorm(n * 20, mean = 0, sd = 1), nrow = n, ncol = 20)

# Add multicollinearity: each variable is partly a linear combination of others
for (j in 1:20) {
  if (j > 1) {
    X_raw[, j] <- 0.7 * X_raw[, j] + 0.3 * X_raw[, j - 1] + rnorm(n, 0, 0.2)
  }
}

# Outcome: credit amount (thousands NGN)
y <- 100 + 15 * X_raw[, 1] + 12 * X_raw[, 2] - 8 * X_raw[, 5] + rnorm(n, 0, 30)
y <- pmax(y, 0)

colnames(X_raw) <- paste0("Ratio_", 1:20)

# Standardise features (required for glmnet)
X_scaled <- scale(X_raw)

# Fit Ridge with range of λ
ridge_fit <- glmnet(X_scaled, y, alpha = 0, lambda = 10^seq(3, -2, length = 100))

# Cross-validation to find optimal λ
ridge_cv <- cv.glmnet(X_scaled, y, alpha = 0, nfolds = 5)

cat("Cross-Validation Results for Ridge Regression:\n\n")
#> Cross-Validation Results for Ridge Regression:
cat("Optimal λ (minimum CV error):", round(ridge_cv$lambda.min, 4), "\n")
#> Optimal λ (minimum CV error): 14.9793
cat("λ (one standard error rule):", round(ridge_cv$lambda.1se, 4), "\n")
#> λ (one standard error rule): 45.7445
cat("Minimum CV MSE:", round(min(ridge_cv$cvm), 2), "\n\n")
#> Minimum CV MSE: 1053.9

# Coefficients at optimal λ
coef_ridge <- coef(ridge_fit, s = ridge_cv$lambda.min)[-1]  # exclude intercept

cat("Ridge coefficients (at λ =", round(ridge_cv$lambda.min, 4), "):\n")
#> Ridge coefficients (at λ = 14.9793 ):
print(round(coef_ridge, 3))
#>  [1] 11.360  8.131  1.461  0.366 -5.072 -0.637  0.499 -1.139 -0.625  0.902
#> [11]  2.836 -0.654 -1.048 -1.537  1.789 -0.678 -1.476 -0.693 -0.387 -0.830

# Compare with OLS
fit_ols <- lm(y ~ X_scaled)
coef_ols <- coef(fit_ols)[-1]

# Visualise coefficient paths
# Use line = 4 to push titles above glmnet's top-axis variable-count annotation
par(mfrow = c(1, 2), mar = c(5, 5, 5, 1), oma = c(0, 0, 2, 0))

plot(ridge_cv, main = "")
title(main = "Cross-Validation Error", line = 4)
abline(v = log(ridge_cv$lambda.min), col = "red", lty = 2, lwd = 2)

plot(ridge_fit, xvar = "lambda", main = "")
title(main = "Coefficient Paths", line = 4)
abline(v = log(ridge_cv$lambda.min), col = "red", lty = 2, lwd = 2)

mtext("Ridge Regression", outer = TRUE, cex = 1.2, font = 2)

Show code

par(mfrow = c(1, 1), oma = c(0, 0, 0, 0))

# Comparison: OLS vs Ridge coefficients
comparison <- tibble(
  Variable = colnames(X_scaled),
  OLS = coef_ols,
  Ridge = coef_ridge,
  Ratio = Ridge / OLS
)

cat("\n\nComparison of OLS vs Ridge Coefficients:\n")
#> 
#> 
#> Comparison of OLS vs Ridge Coefficients:
print(comparison |> arrange(desc(abs(OLS))) |> head(10))
#> # A tibble: 10 × 4
#>    Variable   OLS  Ridge Ratio
#>    <chr>    <dbl>  <dbl> <dbl>
#>  1 Ratio_1  15.2  11.4   0.746
#>  2 Ratio_2   9.71  8.13  0.838
#>  3 Ratio_5  -7.49 -5.07  0.677
#>  4 Ratio_11  4.26  2.84  0.666
#>  5 Ratio_15  2.63  1.79  0.681
#>  6 Ratio_14 -2.58 -1.54  0.595
#>  7 Ratio_17 -2.27 -1.48  0.650
#>  8 Ratio_8  -1.67 -1.14  0.684
#>  9 Ratio_12 -1.66 -0.654 0.395
#> 10 Ratio_3   1.37  1.46  1.06

# Performance: OLS vs Ridge on training data
pred_ols <- predict(fit_ols, newdata = data.frame(X_scaled))
pred_ridge <- predict(ridge_fit, newx = X_scaled, s = ridge_cv$lambda.min)

rmse_ols <- sqrt(mean((y - pred_ols)^2))
rmse_ridge <- sqrt(mean((y - pred_ridge)^2))

cat("\n\nTraining RMSE:\n")
#> 
#> 
#> Training RMSE:
cat("OLS:  ", round(rmse_ols, 2), "\n")
#> OLS:   29.99
cat("Ridge:", round(rmse_ridge, 2), "\n")
#> Ridge: 30.5

# Cross-validation RMSE
cv_rmse_ridge <- ridge_cv$cvm[which.min(ridge_cv$cvm)]
cat("Ridge CV RMSE:", round(sqrt(cv_rmse_ridge), 2), "\n")
#> Ridge CV RMSE: 32.46

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

np.random.seed(42)
n = 300

# Generate correlated financial ratios
X_raw = np.random.normal(0, 1, (n, 20))
for j in range(1, 20):
    X_raw[:, j] = 0.7 * X_raw[:, j] + 0.3 * X_raw[:, j - 1] + np.random.normal(0, 0.2, n)

# Outcome
y = 100 + 15 * X_raw[:, 0] + 12 * X_raw[:, 1] - 8 * X_raw[:, 4] + np.random.normal(0, 30, n)
y = np.maximum(y, 0)

# Standardise
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

# Find optimal λ via cross-validation
alphas = np.logspace(-2, 3, 100)
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_scaled, y)

RidgeCV(alphas=array([1.00000000e-02, 1.12332403e-02, 1.26185688e-02, 1.41747416e-02,
       1.59228279e-02, 1.78864953e-02, 2.00923300e-02, 2.25701972e-02,
       2.53536449e-02, 2.84803587e-02, 3.19926714e-02, 3.59381366e-02,
       4.03701726e-02, 4.53487851e-02, 5.09413801e-02, 5.72236766e-02,
       6.42807312e-02, 7.22080902e-02, 8.11130831e-02, 9.11162756e-02,
       1.02353102e-01, 1.14975700e-0...
       6.89261210e+01, 7.74263683e+01, 8.69749003e+01, 9.77009957e+01,
       1.09749877e+02, 1.23284674e+02, 1.38488637e+02, 1.55567614e+02,
       1.74752840e+02, 1.96304065e+02, 2.20513074e+02, 2.47707636e+02,
       2.78255940e+02, 3.12571585e+02, 3.51119173e+02, 3.94420606e+02,
       4.43062146e+02, 4.97702356e+02, 5.59081018e+02, 6.28029144e+02,
       7.05480231e+02, 7.92482898e+02, 8.90215085e+02, 1.00000000e+03]),
        cv=5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


print(f"Cross-Validation Results for Ridge Regression:\n")
#> Cross-Validation Results for Ridge Regression:
print(f"Optimal α (λ): {ridge_cv.alpha_:.4f}")
#> Optimal α (λ): 97.7010
print(f"CV score (R²): {ridge_cv.score(X_scaled, y):.4f}\n")
#> CV score (R²): 0.2906

# Fit Ridge at optimal α
ridge_final = Ridge(alpha=ridge_cv.alpha_)
ridge_final.fit(X_scaled, y)

Ridge(alpha=np.float64(97.70099572992247))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


# OLS for comparison
ols = LinearRegression()
ols.fit(X_scaled, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


# Coefficients
coef_ridge = ridge_final.coef_
coef_ols = ols.coef_

# Create comparison dataframe
comparison = pd.DataFrame({
    'Variable': [f'Ratio_{i+1}' for i in range(20)],
    'OLS': coef_ols,
    'Ridge': coef_ridge
})
comparison['Shrinkage'] = 1 - (comparison['Ridge'] / comparison['OLS']).abs()

print("Comparison of OLS vs Ridge Coefficients (top 10 by OLS magnitude):")
#> Comparison of OLS vs Ridge Coefficients (top 10 by OLS magnitude):
print(comparison.sort_values('OLS', key=abs, ascending=False).head(10).to_string(index=False))
#> Variable       OLS     Ridge  Shrinkage
#>  Ratio_1 11.867982  9.404326   0.207588
#>  Ratio_2  9.713013  7.821754   0.194714
#>  Ratio_5 -5.222015 -3.936290   0.246212
#> Ratio_19 -3.099205 -2.316865   0.252433
#> Ratio_13  2.378263  1.987786   0.164186
#> Ratio_17 -1.909567 -1.314453   0.311649
#> Ratio_14 -1.821574 -1.498215   0.177516
#> Ratio_12 -1.625962 -0.972599   0.401832
#> Ratio_16  1.006905  0.423388   0.579515
#> Ratio_11  0.988707  0.690756   0.301354

# Plot coefficient paths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Coefficient shrinkage
ax = axes[0]
x_pos = np.arange(20)
ax.scatter(x_pos, coef_ols, alpha=0.7, s=80, label='OLS', color='steelblue')
ax.scatter(x_pos, coef_ridge, alpha=0.7, s=80, label=f'Ridge (α={ridge_cv.alpha_:.3f})', color='darkred')
for i in range(20):
    ax.plot([i, i], [coef_ols[i], coef_ridge[i]], 'k--', alpha=0.3, linewidth=0.5)
ax.set_xlabel('Feature Index', fontsize=11)
ax.set_ylabel('Coefficient Value', fontsize=11)
ax.set_title('OLS vs Ridge: Coefficient Shrinkage', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Coefficient magnitude comparison
ax = axes[1]
ax.scatter(np.abs(coef_ols), np.abs(coef_ridge), alpha=0.7, s=80, color='steelblue')
ax.plot([0, np.abs(coef_ols).max()], [0, np.abs(coef_ols).max()], 'k--', linewidth=1, alpha=0.5)
ax.set_xlabel('|OLS coefficient|', fontsize=11)
ax.set_ylabel('|Ridge coefficient|', fontsize=11)
ax.set_title('Magnitude of Coefficients: OLS vs Ridge', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ridge_vs_ols.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


# Cross-validation RMSE
ols_cv_scores = cross_val_score(LinearRegression(), X_scaled, y, cv=5, scoring='neg_mean_squared_error')
ridge_cv_scores = cross_val_score(Ridge(alpha=ridge_cv.alpha_), X_scaled, y, cv=5, scoring='neg_mean_squared_error')

ols_cv_rmse = np.sqrt(-ols_cv_scores.mean())
ridge_cv_rmse = np.sqrt(-ridge_cv_scores.mean())

print(f"\n\nCross-Validation RMSE:")
#> 
#> 
#> Cross-Validation RMSE:
print(f"OLS:  {ols_cv_rmse:.2f}")
#> OLS:  32.66
print(f"Ridge: {ridge_cv_rmse:.2f}")
#> Ridge: 32.17

Key insights: - Ridge shrinks all coefficients toward zero, with larger coefficients shrinking more - The optimal λ balances fit and complexity - Ridge is more stable than OLS when features are correlated - All features are retained; none is eliminated

📝 Section 10.2 Review Questions

Why does Ridge add a penalty on β² rather than |β|?
If λ = 0, Ridge reduces to OLS. If λ is very large, what happens to coefficients?
In the Nigerian SME example, why might Ridge perform better than OLS on new loan applications?
What is the trade-off between using a smaller vs larger λ?

15.3 Lasso Regression (L1 Regularisation)

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty (sum of absolute coefficients), which can shrink coefficients exactly to zero, providing automatic feature selection.

📘 Theory: Lasso Objective

Lasso minimises: \[S_{\text{Lasso}}(\beta) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]

Unlike Ridge, there is no closed-form solution. The solution is found via coordinate descent, updating one coefficient at a time while holding others fixed.

Soft-thresholding: For a univariate problem, the Lasso solution has a simple form: \[\hat{\beta}_{\text{Lasso}} = \text{sign}(\hat{\beta}_{\text{OLS}}) \max(|\hat{\beta}_{\text{OLS}}| - \lambda, 0)\]

If |β̂_OLS| ≤ λ, then β̂_Lasso = 0 (coefficient is eliminated)
If |β̂_OLS| > λ, then β̂_Lasso = β̂_OLS − λ · sign(β̂_OLS) (coefficient is shrunk)

🔑 Key Formula: Lasso Objective and Sparsity

\[S_{\text{Lasso}}(\beta) = \text{RSS} + \lambda \sum_{j=1}^{p} |\beta_j|\]

Why Lasso produces sparsity (geometric intuition):

In Ridge, the penalty is an ellipse (β₁² + β₂² = c), which smoothly touches the contours of the objective. In Lasso, the penalty is a diamond (|β₁| + |β₂| = c), whose corners coincide with the axes (β_j = 0). Thus, the optimum often occurs at a corner, setting some coefficients exactly to zero.

15.3.1 Ridge vs Lasso: When to Use Which

Aspect	Ridge	Lasso
Penalty	L2 (β²)	L1 (\|β\|)
Feature selection	Shrinks all, eliminates none	Shrinks and eliminates
Use case	Many relevant predictors, moderate correlation	Want a sparse model, high-dimensional data
Interpretation	All features retained; harder to explain	Only non-zero features matter; easier to explain
Computation	Fast (closed form)	Slower (iterative, but still polynomial)

15.3.2 Worked Example: Lasso for Feature Selection

Show code

# Lasso regression with automatic feature selection

# Use same data as Ridge example
# Fit Lasso with range of λ
lasso_fit <- glmnet(X_scaled, y, alpha = 1, lambda = 10^seq(3, -2, length = 100))

# Cross-validation
lasso_cv <- cv.glmnet(X_scaled, y, alpha = 1, nfolds = 5)

cat("Cross-Validation Results for Lasso Regression:\n\n")
#> Cross-Validation Results for Lasso Regression:
cat("Optimal λ (minimum CV error):", round(lasso_cv$lambda.min, 4), "\n")
#> Optimal λ (minimum CV error): 2.4987
cat("λ (one standard error rule):", round(lasso_cv$lambda.1se, 4), "\n")
#> λ (one standard error rule): 5.2595
cat("Number of non-zero coefficients at λ_min:", sum(coef(lasso_fit, s = lasso_cv$lambda.min) != 0) - 1, "\n")
#> Number of non-zero coefficients at λ_min: 5
cat("Number of non-zero coefficients at λ_1se:", sum(coef(lasso_fit, s = lasso_cv$lambda.1se) != 0) - 1, "\n\n")
#> Number of non-zero coefficients at λ_1se: 3

# Coefficients at both λ values
coef_lasso_min <- coef(lasso_fit, s = lasso_cv$lambda.min)[-1]
coef_lasso_1se <- coef(lasso_fit, s = lasso_cv$lambda.1se)[-1]

cat("Selected features (λ_min):\n")
#> Selected features (λ_min):
selected_min <- which(coef_lasso_min != 0)
if (length(selected_min) > 0) {
  for (j in selected_min) {
    cat(sprintf("  Ratio_%d: %.3f\n", j, coef_lasso_min[j]))
  }
} else {
  cat("  (none)\n")
}
#>   Ratio_1: 13.532
#>   Ratio_2: 8.408
#>   Ratio_5: -4.750
#>   Ratio_11: 1.068
#>   Ratio_14: -0.002

cat("\nSelected features (λ_1se):\n")
#> 
#> Selected features (λ_1se):
selected_1se <- which(coef_lasso_1se != 0)
if (length(selected_1se) > 0) {
  for (j in selected_1se) {
    cat(sprintf("  Ratio_%d: %.3f\n", j, coef_lasso_1se[j]))
  }
} else {
  cat("  (none)\n")
}
#>   Ratio_1: 11.508
#>   Ratio_2: 6.561
#>   Ratio_5: -1.953

# Visualise
par(mfrow = c(1, 2), mar = c(5, 5, 3, 1))

# Plot 1: CV curve
plot(lasso_cv, main = "Lasso Regression: Cross-Validation")
abline(v = log(lasso_cv$lambda.min), col = "red", lty = 2, lwd = 2)
abline(v = log(lasso_cv$lambda.1se), col = "orange", lty = 2, lwd = 2)
legend("topleft", legend = c("λ_min", "λ_1se"),
       col = c("red", "orange"), lty = 2, lwd = 2)

# Plot 2: Coefficient path (showing sparsity)
plot(lasso_fit, xvar = "lambda", main = "Lasso Regression: Coefficient Path (Sparsity)")
abline(v = log(lasso_cv$lambda.min), col = "red", lty = 2, lwd = 2)
abline(v = log(lasso_cv$lambda.1se), col = "orange", lty = 2, lwd = 2)

Show code


par(mfrow = c(1, 1))

# Compare Ridge and Lasso
cat("\n\nComparison of Ridge vs Lasso (coefficient paths):\n\n")
#> 
#> 
#> Comparison of Ridge vs Lasso (coefficient paths):
ridge_cv <- cv.glmnet(X_scaled, y, alpha = 0, nfolds = 5)
coef_ridge_opt <- coef(ridge_fit, s = ridge_cv$lambda.min)[-1]

comparison_rl <- tibble(
  Variable = colnames(X_scaled),
  Ridge = coef_ridge_opt,
  Lasso_min = coef_lasso_min,
  Lasso_1se = coef_lasso_1se
)

cat("Features with non-zero Lasso coefficients:\n")
#> Features with non-zero Lasso coefficients:
print(comparison_rl |> filter(Lasso_min != 0 | Lasso_1se != 0) |>
  arrange(desc(abs(Lasso_min))))
#> # A tibble: 5 × 4
#>   Variable Ridge Lasso_min Lasso_1se
#>   <chr>    <dbl>     <dbl>     <dbl>
#> 1 Ratio_1  12.9   13.5         11.5 
#> 2 Ratio_2   8.86   8.41         6.56
#> 3 Ratio_5  -5.97  -4.75        -1.95
#> 4 Ratio_11  3.34   1.07         0   
#> 5 Ratio_14 -1.92  -0.00235      0

# Visualization: Ridge vs Lasso paths
par(mfrow = c(1, 2), mar = c(5, 5, 3, 1))

plot(ridge_fit, xvar = "lambda", main = "Ridge: All coefficients retained")
abline(v = log(ridge_cv$lambda.min), col = "blue", lty = 2, lwd = 2)

plot(lasso_fit, xvar = "lambda", main = "Lasso: Sparse solution (some coefficients zero)")
abline(v = log(lasso_cv$lambda.min), col = "red", lty = 2, lwd = 2)

Show code


par(mfrow = c(1, 1))

Show code

from sklearn.linear_model import Lasso, LassoCV
import numpy as np

# Fit Lasso with cross-validation
alphas = np.logspace(-2, 3, 100)
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=10000)
lasso_cv.fit(X_scaled, y)

LassoCV(alphas=array([1.00000000e-02, 1.12332403e-02, 1.26185688e-02, 1.41747416e-02,
       1.59228279e-02, 1.78864953e-02, 2.00923300e-02, 2.25701972e-02,
       2.53536449e-02, 2.84803587e-02, 3.19926714e-02, 3.59381366e-02,
       4.03701726e-02, 4.53487851e-02, 5.09413801e-02, 5.72236766e-02,
       6.42807312e-02, 7.22080902e-02, 8.11130831e-02, 9.11162756e-02,
       1.02353102e-01, 1.14975700e-0...
       6.89261210e+01, 7.74263683e+01, 8.69749003e+01, 9.77009957e+01,
       1.09749877e+02, 1.23284674e+02, 1.38488637e+02, 1.55567614e+02,
       1.74752840e+02, 1.96304065e+02, 2.20513074e+02, 2.47707636e+02,
       2.78255940e+02, 3.12571585e+02, 3.51119173e+02, 3.94420606e+02,
       4.43062146e+02, 4.97702356e+02, 5.59081018e+02, 6.28029144e+02,
       7.05480231e+02, 7.92482898e+02, 8.90215085e+02, 1.00000000e+03]),
        cv=5, max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


print(f"Cross-Validation Results for Lasso Regression:\n")
#> Cross-Validation Results for Lasso Regression:
print(f"Optimal α (λ): {lasso_cv.alpha_:.4f}")
#> Optimal α (λ): 1.8738
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}\n")
#> Number of non-zero coefficients: 5

# Selected features
coef_lasso = lasso_cv.coef_
selected_features = np.where(coef_lasso != 0)[0]

print(f"Selected features (non-zero Lasso coefficients):")
#> Selected features (non-zero Lasso coefficients):
for j in selected_features:
    print(f"  Ratio_{j+1}: {coef_lasso[j]:.4f}")
#>   Ratio_1: 10.9667
#>   Ratio_2: 8.4401
#>   Ratio_5: -3.5819
#>   Ratio_13: 0.3075
#>   Ratio_19: -1.5577

if len(selected_features) == 0:
    print("  (all coefficients shrunk to zero)")

# Compare Ridge and Lasso
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Coefficient comparison
ax = axes[0]
x_pos = np.arange(20)
ax.scatter(x_pos, coef_ridge, alpha=0.7, s=80, label='Ridge', color='steelblue')
ax.scatter(x_pos, coef_lasso, alpha=0.7, s=80, label='Lasso', color='darkred')
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.5)
ax.set_xlabel('Feature Index', fontsize=11)
ax.set_ylabel('Coefficient Value', fontsize=11)
ax.set_title('Ridge vs Lasso Coefficients', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Number of non-zero coefficients
ax = axes[1]
lambda_seq = np.logspace(-2, 3, 50)
ridge_nonzero = np.full(len(lambda_seq), 20)  # Ridge never zeros out
lasso_nonzero = []
for lam in lambda_seq:
    lasso_temp = Lasso(alpha=lam, max_iter=10000)
    lasso_temp.fit(X_scaled, y)
    lasso_nonzero.append(np.sum(lasso_temp.coef_ != 0))

Lasso(alpha=np.float64(1000.0), max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


ax.semilogx(lambda_seq, ridge_nonzero, 'o-', label='Ridge', color='steelblue', linewidth=2)
ax.semilogx(lambda_seq, lasso_nonzero, 's-', label='Lasso', color='darkred', linewidth=2)
ax.set_xlabel('λ (regularisation parameter)', fontsize=11)
ax.set_ylabel('Number of Non-Zero Coefficients', fontsize=11)
ax.set_title('Ridge vs Lasso: Feature Selection Path', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 22])
#> (0.0, 22.0)

plt.tight_layout()
plt.savefig('ridge_vs_lasso.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print(f"\n\nLasso Sparse Solution:")
#> 
#> 
#> Lasso Sparse Solution:
print(f"Features eliminated: {np.sum(coef_lasso == 0)} out of 20")
#> Features eliminated: 15 out of 20
print(f"Features retained: {np.sum(coef_lasso != 0)} out of 20")
#> Features retained: 5 out of 20

Key insight: At λ_1se (one standard error rule), Lasso selects only a handful of true drivers, dramatically simplifying the model while maintaining predictive accuracy.

📝 Section 10.3 Review Questions

Why does the Lasso objective’s diamond-shaped constraint produce sparsity, while Ridge’s ellipse does not?
In the SME credit example, suppose Lasso selects only Ratio_1 and Ratio_2. How would you use this for business decisions?
Compare interpretation difficulty: Ridge (all features) vs Lasso (sparse features).
Can Lasso with a given λ select more features than Ridge with the same λ? Explain.

15.4 Elastic Net: Combining L1 and L2

Elastic Net combines Ridge (L2) and Lasso (L1) penalties, capturing benefits of both.

📘 Theory: Elastic Net Objective

\[S_{\text{EN}}(\beta) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j} \beta_j x_{ij})^2 + \lambda \left[ (1 - \alpha) \frac{1}{2} \sum_{j} \beta_j^2 + \alpha \sum_{j} |\beta_j| \right]\]

where: - α ∈ [0, 1] is the mixing parameter - α = 0: Ridge - α = 1: Lasso - α ∈ (0, 1): combination of both

When Elastic Net excels:

Grouping effect: If several features are correlated and relevant, Lasso may arbitrarily select one. Elastic Net’s L2 component encourages all to be included together.
High-dimensional data: When p >> n, Elastic Net is more stable than pure Lasso.
Unknown feature relationships: When you don’t know if features are important individually or as groups, Elastic Net hedges bets.

🔑 Key Formula: Elastic Net

\[S_{\text{EN}}(\beta) = \text{RSS} + \lambda \left[ (1 - \alpha) \frac{||β||_2^2}{2} + \alpha ||β||_1 \right]\]

Intuition: - L2 component: shrinks all coefficients, encourages grouping of correlated predictors - L1 component: performs feature selection, sets some coefficients to zero - Balance via α: tune based on domain knowledge or CV

15.4.1 Worked Example: Elastic Net

Show code

# Elastic Net with varying α (mixing parameter)

# Fit Elastic Net models with different α values
en_models <- list()
en_cv_results <- list()

alpha_values <- c(0, 0.25, 0.5, 0.75, 1)

for (alpha_val in alpha_values) {
  fit <- glmnet(X_scaled, y, alpha = alpha_val, lambda = 10^seq(3, -2, length = 100))
  fit_cv <- cv.glmnet(X_scaled, y, alpha = alpha_val, nfolds = 5)

  en_models[[as.character(alpha_val)]] <- fit
  en_cv_results[[as.character(alpha_val)]] <- fit_cv
}

cat("Elastic Net: Model Comparison Across α values\n\n")
#> Elastic Net: Model Comparison Across α values

comparison_en <- tibble(
  Alpha = alpha_values,
  Method = c("Ridge", "EN (0.25)", "EN (0.5)", "EN (0.75)", "Lasso"),
  Lambda_min = sapply(en_cv_results, function(x) x$lambda.min),
  CV_MSE = sapply(en_cv_results, function(x) min(x$cvm)),
  Nonzero_coefs = sapply(alpha_values, function(a) {
    sum(coef(en_models[[as.character(a)]], s = en_cv_results[[as.character(a)]]$lambda.min) != 0) - 1
  })
)

print(comparison_en)
#> # A tibble: 5 × 5
#>   Alpha Method    Lambda_min CV_MSE Nonzero_coefs
#>   <dbl> <chr>          <dbl>  <dbl>         <dbl>
#> 1  0    Ridge          15.0   1061.            20
#> 2  0.25 EN (0.25)       4.75   988.            13
#> 3  0.5  EN (0.5)        4.15  1004.             6
#> 4  0.75 EN (0.75)       3.04   990.             6
#> 5  1    Lasso           2.50  1003.             5

# Visualise coefficient paths for different α
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))

method_labels <- c("Ridge", "EN", "EN", "EN", "Lasso")
for (i in seq_along(alpha_values)) {
  alpha_val <- alpha_values[i]
  plot(en_models[[as.character(alpha_val)]], xvar = "lambda",
       main = sprintf("α = %s (%s)", alpha_val, method_labels[i]))
  abline(v = log(en_cv_results[[as.character(alpha_val)]]$lambda.min),
         col = "red", lty = 2, lwd = 1)
}

par(mfrow = c(1, 1))

Show code


cat("\n\nOptimal models by α:\n")
#> 
#> 
#> Optimal models by α:
for (alpha_val in alpha_values) {
  cv_result <- en_cv_results[[as.character(alpha_val)]]
  coefs <- coef(en_models[[as.character(alpha_val)]], s = cv_result$lambda.min)[-1]
  nonzero <- sum(coefs != 0)

  cat(sprintf("α = %s: λ = %.4f, Non-zero coefficients = %d\n",
             alpha_val, cv_result$lambda.min, nonzero))
}
#> α = 0: λ = 14.9793, Non-zero coefficients = 20
#> α = 0.25: λ = 4.7483, Non-zero coefficients = 13
#> α = 0.5: λ = 4.1489, Non-zero coefficients = 6
#> α = 0.75: λ = 3.0356, Non-zero coefficients = 6
#> α = 1: λ = 2.4987, Non-zero coefficients = 5

Show code

from sklearn.linear_model import ElasticNetCV
import numpy as np

# Elastic Net with varying α
alpha_values = [0, 0.25, 0.5, 0.75, 1]
l1_ratios = alpha_values  # α in ElasticNetCV is the l1_ratio

en_models = {}
en_results = []

for l1_ratio in l1_ratios:
    en_cv = ElasticNetCV(l1_ratio=l1_ratio, alphas=np.logspace(-2, 3, 100), cv=5, max_iter=10000)
    en_cv.fit(X_scaled, y)
    en_models[l1_ratio] = en_cv

    en_results.append({
        'Alpha (L1 ratio)': l1_ratio,
        'Method': ['Ridge', 'EN (0.25)', 'EN (0.5)', 'EN (0.75)', 'Lasso'][int(l1_ratio * 4)],
        'Lambda': en_cv.alpha_,
        'CV R²': en_cv.score(X_scaled, y),
        'Non-zero coefs': np.sum(en_cv.coef_ != 0)
    })

ElasticNetCV(alphas=array([1.00000000e-02, 1.12332403e-02, 1.26185688e-02, 1.41747416e-02,
       1.59228279e-02, 1.78864953e-02, 2.00923300e-02, 2.25701972e-02,
       2.53536449e-02, 2.84803587e-02, 3.19926714e-02, 3.59381366e-02,
       4.03701726e-02, 4.53487851e-02, 5.09413801e-02, 5.72236766e-02,
       6.42807312e-02, 7.22080902e-02, 8.11130831e-02, 9.11162756e-02,
       1.02353102e-01, 1.149757...
       6.89261210e+01, 7.74263683e+01, 8.69749003e+01, 9.77009957e+01,
       1.09749877e+02, 1.23284674e+02, 1.38488637e+02, 1.55567614e+02,
       1.74752840e+02, 1.96304065e+02, 2.20513074e+02, 2.47707636e+02,
       2.78255940e+02, 3.12571585e+02, 3.51119173e+02, 3.94420606e+02,
       4.43062146e+02, 4.97702356e+02, 5.59081018e+02, 6.28029144e+02,
       7.05480231e+02, 7.92482898e+02, 8.90215085e+02, 1.00000000e+03]),
             cv=5, l1_ratio=1, max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


en_comparison = pd.DataFrame(en_results)
print("Elastic Net: Model Comparison Across α values\n")
#> Elastic Net: Model Comparison Across α values
print(en_comparison.to_string(index=False))
#>  Alpha (L1 ratio)    Method   Lambda    CV R²  Non-zero coefs
#>              0.00     Ridge 0.367838 0.288151              20
#>              0.25 EN (0.25) 0.521401 0.285104              17
#>              0.50  EN (0.5) 0.739072 0.283013              15
#>              0.75 EN (0.75) 1.176812 0.279651              10
#>              1.00     Lasso 1.873817 0.283622               5

# Visualise sparsity vs α
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Non-zero coefficients vs α
ax = axes[0]
nonzero_counts = [np.sum(en_models[lr].coef_ != 0) for lr in l1_ratios]
ax.plot(alpha_values, nonzero_counts, 'o-', linewidth=2, markersize=8, color='steelblue')
ax.set_xlabel('α (mixing parameter: 0=Ridge, 1=Lasso)', fontsize=11)
ax.set_ylabel('Number of Non-Zero Coefficients', fontsize=11)
ax.set_title('Sparsity vs α in Elastic Net', fontsize=12, fontweight='bold')
ax.set_xticks(alpha_values)
ax.grid(True, alpha=0.3)

# Plot 2: CV score vs α
ax = axes[1]
cv_scores = [en_models[lr].score(X_scaled, y) for lr in l1_ratios]
ax.plot(alpha_values, cv_scores, 's-', linewidth=2, markersize=8, color='darkred')
ax.set_xlabel('α (mixing parameter: 0=Ridge, 1=Lasso)', fontsize=11)
ax.set_ylabel('Cross-Validation R²', fontsize=11)
ax.set_title('Model Performance vs α', fontsize=12, fontweight='bold')
ax.set_xticks(alpha_values)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('elasticnet_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print(f"\n\nInterpretation:")
#> 
#> 
#> Interpretation:
for l1_ratio in l1_ratios:
    coef = en_models[l1_ratio].coef_
    nonzero = np.sum(coef != 0)
    method = ['Ridge', 'EN', 'EN', 'EN', 'Lasso'][int(l1_ratio * 4)]
    print(f"{method} (α={l1_ratio}): {nonzero} features selected, R² = {en_models[l1_ratio].score(X_scaled, y):.4f}")
#> Ridge (α=0): 20 features selected, R² = 0.2882
#> EN (α=0.25): 17 features selected, R² = 0.2851
#> EN (α=0.5): 15 features selected, R² = 0.2830
#> EN (α=0.75): 10 features selected, R² = 0.2797
#> Lasso (α=1): 5 features selected, R² = 0.2836

📝 Section 10.4 Review Questions

When would you choose α = 0.5 over pure Ridge or pure Lasso?
If your data has many correlated features and you want a sparse solution, why might Elastic Net (α ≈ 0.7) outperform Lasso?
How would you select α: via grid search or prior knowledge?

15.5 Choosing the Penalty: Cross-Validation

The regularisation parameter λ (and α in Elastic Net) must be selected using cross-validation, not by minimising training error.

15.5.1 The One-Standard-Error Rule

When CV error curves are flat near the minimum, the one-standard-error rule suggests:

Find λ with minimum CV error: λ_min
Find the simplest model within one SE of λ_min: λ_1se
Use λ_1se for parsimony (fewer non-zero coefficients, similar accuracy)

📘 Theory: Cross-Validation for Regularisation

K-Fold CV procedure:

Randomly divide data into k equal folds
For each fold i = 1, …, k:
- Leave fold i out (test set)
- Fit model on remaining k−1 folds (training set)
- Compute error on fold i
Average error across k folds: CV error
Repeat for each λ in a sequence
Choose λ that minimises CV error (or λ_1se for parsimony)

Why this works: CV error estimates generalisation error on truly held-out data, avoiding overfitting.

15.5.2 Worked Example: CV Grid Search

Show code

# Comprehensive CV for hyperparameter tuning

# Already computed above, but let's visualise the full grid

# Lasso CV with detailed output
lasso_cv_full <- cv.glmnet(X_scaled, y, alpha = 1, nfolds = 5, nlambda = 100)

# Extract CV results
lambda_seq <- lasso_cv_full$lambda
cv_mean <- lasso_cv_full$cvm
cv_sd <- lasso_cv_full$cvsd

# Plot with confidence bands
par(mar = c(5, 5, 3, 1))

plot(log(lambda_seq), cv_mean, type = "b", pch = 16, cex = 0.5,
     xlab = "log(λ)", ylab = "Cross-Validation MSE",
     main = "Lasso: Cross-Validation Error vs λ",
     ylim = range(cv_mean - cv_sd, cv_mean + cv_sd))

# Add confidence band
segments(log(lambda_seq), cv_mean - cv_sd, log(lambda_seq), cv_mean + cv_sd, col = "lightgray")

# Mark λ_min and λ_1se
abline(v = log(lasso_cv_full$lambda.min), col = "red", lty = 2, lwd = 2, label = "λ_min")
abline(v = log(lasso_cv_full$lambda.1se), col = "orange", lty = 2, lwd = 2, label = "λ_1se")

legend("topleft", legend = c("λ_min (minimum error)", "λ_1se (one SE rule)"),
       col = c("red", "orange"), lty = 2, lwd = 2)

Show code


# Summary
cat("Cross-Validation Summary (Lasso):\n\n")
#> Cross-Validation Summary (Lasso):
cat("λ_min =", round(lasso_cv_full$lambda.min, 4), "\n")
#> λ_min = 2.2767
cat("CV error at λ_min =", round(min(lasso_cv_full$cvm), 2), "\n")
#> CV error at λ_min = 987.58
cat("Number of non-zero coefs at λ_min =", sum(coef(lasso_cv_full, s = "lambda.min") != 0) - 1, "\n\n")
#> Number of non-zero coefs at λ_min = 5

cat("λ_1se =", round(lasso_cv_full$lambda.1se, 4), "\n")
#> λ_1se = 6.9528
cat("CV error at λ_1se =", round(lasso_cv_full$cvm[lasso_cv_full$lambda == lasso_cv_full$lambda.1se], 2), "\n")
#> CV error at λ_1se = 1071.67
cat("Number of non-zero coefs at λ_1se =", sum(coef(lasso_cv_full, s = "lambda.1se") != 0) - 1, "\n")
#> Number of non-zero coefs at λ_1se = 3

# Recommendation
cat("\n\nRecommendation:\n")
#> 
#> 
#> Recommendation:
cat("For predictive accuracy: use λ_min\n")
#> For predictive accuracy: use λ_min
cat("For interpretability and sparsity: use λ_1se\n")
#> For interpretability and sparsity: use λ_1se

Show code

# Detailed CV analysis with plotting

# Refit Lasso with detailed CV
from sklearn.linear_model import LassoCV

lasso_cv_full = LassoCV(alphas=np.logspace(-2, 3, 100), cv=5, max_iter=10000)
lasso_cv_full.fit(X_scaled, y)

LassoCV(alphas=array([1.00000000e-02, 1.12332403e-02, 1.26185688e-02, 1.41747416e-02,
       1.59228279e-02, 1.78864953e-02, 2.00923300e-02, 2.25701972e-02,
       2.53536449e-02, 2.84803587e-02, 3.19926714e-02, 3.59381366e-02,
       4.03701726e-02, 4.53487851e-02, 5.09413801e-02, 5.72236766e-02,
       6.42807312e-02, 7.22080902e-02, 8.11130831e-02, 9.11162756e-02,
       1.02353102e-01, 1.14975700e-0...
       6.89261210e+01, 7.74263683e+01, 8.69749003e+01, 9.77009957e+01,
       1.09749877e+02, 1.23284674e+02, 1.38488637e+02, 1.55567614e+02,
       1.74752840e+02, 1.96304065e+02, 2.20513074e+02, 2.47707636e+02,
       2.78255940e+02, 3.12571585e+02, 3.51119173e+02, 3.94420606e+02,
       4.43062146e+02, 4.97702356e+02, 5.59081018e+02, 6.28029144e+02,
       7.05480231e+02, 7.92482898e+02, 8.90215085e+02, 1.00000000e+03]),
        cv=5, max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


# Extract alphas and CV scores

# Let's compute manually using cross_val_score
from sklearn.model_selection import cross_validate

# Create array of alphas
alphas_manual = np.logspace(-2, 3, 100)
cv_scores_manual = []

for alpha in alphas_manual:
    lasso_temp = Lasso(alpha=alpha, max_iter=10000)
    cv_scores = -cross_val_score(lasso_temp, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    cv_scores_manual.append(cv_scores.mean())

cv_scores_manual = np.array(cv_scores_manual)

# Find λ_min and λ_1se
min_idx = np.argmin(cv_scores_manual)
lambda_min = alphas_manual[min_idx]
min_score = cv_scores_manual[min_idx]

# One standard error rule
cv_std_manual = np.std(cv_scores_manual)
threshold = min_score + cv_std_manual
idx_1se = np.where(cv_scores_manual <= threshold)[0][0]
lambda_1se = alphas_manual[idx_1se]

print(f"Cross-Validation Summary (Lasso):\n")
#> Cross-Validation Summary (Lasso):
print(f"λ_min = {lambda_min:.4f}")
#> λ_min = 1.8738
print(f"CV error at λ_min = {min_score:.2f}")
#> CV error at λ_min = 987.21

lasso_min = Lasso(alpha=lambda_min, max_iter=10000)
lasso_min.fit(X_scaled, y)

Lasso(alpha=np.float64(1.873817422860383), max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

nonzero_min = np.sum(lasso_min.coef_ != 0)
print(f"Number of non-zero coefs at λ_min = {nonzero_min}\n")
#> Number of non-zero coefs at λ_min = 5

print(f"λ_1se = {lambda_1se:.4f}")
#> λ_1se = 0.0100
lasso_1se = Lasso(alpha=lambda_1se, max_iter=10000)
lasso_1se.fit(X_scaled, y)

Lasso(alpha=np.float64(0.01), max_iter=10000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

nonzero_1se = np.sum(lasso_1se.coef_ != 0)
print(f"CV error at λ_1se = {cv_scores_manual[idx_1se]:.2f}")
#> CV error at λ_1se = 1065.51
print(f"Number of non-zero coefs at λ_1se = {nonzero_1se}")
#> Number of non-zero coefs at λ_1se = 20

# Plot
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(np.log(alphas_manual), cv_scores_manual, 'o-', linewidth=2, markersize=6, color='steelblue')
ax.fill_between(np.log(alphas_manual),
                 cv_scores_manual - cv_std_manual,
                 cv_scores_manual + cv_std_manual,
                 alpha=0.2, color='steelblue')

ax.axvline(np.log(lambda_min), color='red', linestyle='--', linewidth=2, label=f'λ_min = {lambda_min:.4f}')
ax.axvline(np.log(lambda_1se), color='orange', linestyle='--', linewidth=2, label=f'λ_1se = {lambda_1se:.4f}')
ax.axhline(min_score + cv_std_manual, color='green', linestyle=':', linewidth=1, alpha=0.5, label='±1 SE band')

ax.set_xlabel('log(λ)', fontsize=11)
ax.set_ylabel('Cross-Validation MSE', fontsize=11)
ax.set_title('Lasso: Cross-Validation Error vs λ', fontsize=12, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('cv_lambda_selection.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print(f"\n\nRecommendation:")
#> 
#> 
#> Recommendation:
print(f"For predictive accuracy: use λ_min = {lambda_min:.4f} ({nonzero_min} features)")
#> For predictive accuracy: use λ_min = 1.8738 (5 features)
print(f"For interpretability: use λ_1se = {lambda_1se:.4f} ({nonzero_1se} features)")
#> For interpretability: use λ_1se = 0.0100 (20 features)

📝 Section 10.5 Review Questions

Why do we use cross-validation to select λ, rather than minimising training error?
Explain the one-standard-error rule. Why is it useful?
If the CV error curve is flat, does the choice of λ matter?

15.6 Automated Feature Selection

Beyond regularisation, several automated methods select important features.

📘 Theory: Feature Selection Methods

Forward stepwise selection:
- Start with no predictors
- At each step, add the predictor that most improves model fit
- Continue until stopping rule (e.g., BIC increase, p-value threshold)
- Fast but greedy (local optimum)
Backward elimination:
- Start with all predictors
- Remove the least important predictor at each step
- Stop when all remaining predictors are significant
- Often better than forward, but slower
Recursive Feature Elimination (RFE):
- Fit model, rank features by importance
- Remove lowest-ranked feature(s)
- Refit and repeat
- Unbiased ranking via cross-validation
Importance-based selection:
- For tree-based models: feature importance
- For linear models: |coefficient| or |t-statistic|
- Select top k features by importance

15.6.1 Worked Example: RFE and Automated Selection

Show code

library(caret)

# Recursive Feature Elimination (RFE)
rfe_ctrl <- rfeControl(functions = lmFuncs, method = "cv", number = 5)
rfe_result <- rfe(X_scaled, y, sizes = c(1, 2, 3, 5, 8, 12, 15, 20),
                   rfeControl = rfe_ctrl)

cat("Recursive Feature Elimination Results:\n\n")
#> Recursive Feature Elimination Results:
print(rfe_result)
#> 
#> Recursive feature selection
#> 
#> Outer resampling method: Cross-Validated (5 fold) 
#> 
#> Resampling performance over subset size:
#> 
#>  Variables  RMSE Rsquared   MAE RMSESD RsquaredSD MAESD Selected
#>          1 32.93   0.2599 26.28  2.275    0.09741 1.983         
#>          2 31.49   0.3173 25.23  2.387    0.08066 1.768         
#>          3 30.87   0.3471 24.79  2.658    0.06661 2.028        *
#>          5 31.73   0.3145 25.61  2.665    0.05563 2.259         
#>          8 32.22   0.2961 26.13  2.673    0.05586 2.326         
#>         12 32.44   0.2887 26.22  2.936    0.05777 2.440         
#>         15 32.41   0.2902 26.15  2.912    0.05393 2.337         
#>         20 32.48   0.2878 26.20  2.858    0.05223 2.298         
#> 
#> The top 3 variables (out of 3):
#>    Ratio_1, Ratio_2, Ratio_5

cat("\n\nOptimal number of features:", rfe_result$bestSubset, "\n")
#> 
#> 
#> Optimal number of features: 3
cat("Selected features:\n")
#> Selected features:
print(rfe_result$finalVariables)
#> NULL

# Visualise RFE
plot(rfe_result, type = "b", main = "RFE: RMSE vs Number of Features")

Show code


# Alternative: Importance-based selection (using absolute coefficients)
fit_full <- lm(y ~ ., data = data.frame(X_scaled))
coef_abs <- abs(coef(fit_full)[-1])
importance <- data.frame(
  Feature = colnames(X_scaled),
  Importance = coef_abs
) |> arrange(desc(Importance))

cat("\n\nTop 10 features by coefficient magnitude:\n")
#> 
#> 
#> Top 10 features by coefficient magnitude:
print(head(importance, 10))
#>           Feature Importance
#> Ratio_1   Ratio_1  15.218863
#> Ratio_2   Ratio_2   9.707944
#> Ratio_5   Ratio_5   7.493934
#> Ratio_11 Ratio_11   4.256989
#> Ratio_15 Ratio_15   2.627945
#> Ratio_14 Ratio_14   2.582894
#> Ratio_17 Ratio_17   2.270790
#> Ratio_8   Ratio_8   1.665471
#> Ratio_12 Ratio_12   1.657476
#> Ratio_3   Ratio_3   1.373523

# Plot
par(mar = c(5, 5, 3, 1))
barplot(importance$Importance[1:10], names.arg = importance$Feature[1:10],
        xlab = "Feature", ylab = "Absolute Coefficient",
        main = "Feature Importance: Top 10",
        las = 2)

Show code

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Recursive Feature Elimination
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=8, step=1)
rfe.fit(X_scaled, y)

RFE(estimator=LinearRegression(), n_features_to_select=8)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


print("Recursive Feature Elimination Results:\n")
#> Recursive Feature Elimination Results:
print(f"Number of features selected: {rfe.n_features_}")
#> Number of features selected: 8
selected_features = np.array([f'Ratio_{i+1}' for i in range(20)])[rfe.support_]
print(f"Selected features: {list(selected_features)}\n")
#> Selected features: [np.str_('Ratio_1'), np.str_('Ratio_2'), np.str_('Ratio_5'), np.str_('Ratio_13'), np.str_('Ratio_14'), np.str_('Ratio_16'), np.str_('Ratio_17'), np.str_('Ratio_19')]

# Feature ranking
feature_ranking = pd.DataFrame({
    'Feature': [f'Ratio_{i+1}' for i in range(20)],
    'Ranking': rfe.ranking_,
    'Selected': rfe.support_
})
print("RFE Feature Ranking:")
#> RFE Feature Ranking:
print(feature_ranking.sort_values('Ranking').to_string(index=False))
#>  Feature  Ranking  Selected
#>  Ratio_1        1      True
#>  Ratio_2        1      True
#>  Ratio_5        1      True
#> Ratio_14        1      True
#> Ratio_16        1      True
#> Ratio_13        1      True
#> Ratio_19        1      True
#> Ratio_17        1      True
#> Ratio_12        2     False
#> Ratio_11        3     False
#>  Ratio_3        4     False
#> Ratio_15        5     False
#>  Ratio_9        6     False
#>  Ratio_7        7     False
#>  Ratio_8        8     False
#> Ratio_18        9     False
#>  Ratio_4       10     False
#>  Ratio_6       11     False
#> Ratio_10       12     False
#> Ratio_20       13     False

# Importance-based selection (coefficient magnitude)
fit_full = LinearRegression()
fit_full.fit(X_scaled, y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

coef_abs = np.abs(fit_full.coef_)

importance_df = pd.DataFrame({
    'Feature': [f'Ratio_{i+1}' for i in range(20)],
    'Importance': coef_abs
}).sort_values('Importance', ascending=False)

print("\n\nTop 10 features by coefficient magnitude:")
#> 
#> 
#> Top 10 features by coefficient magnitude:
print(importance_df.head(10).to_string(index=False))
#>  Feature  Importance
#>  Ratio_1   11.867982
#>  Ratio_2    9.713013
#>  Ratio_5    5.222015
#> Ratio_19    3.099205
#> Ratio_13    2.378263
#> Ratio_17    1.909567
#> Ratio_14    1.821574
#> Ratio_12    1.625962
#> Ratio_16    1.006905
#> Ratio_11    0.988707

# Visualise
fig, ax = plt.subplots(figsize=(10, 6))
top_n = 10
ax.barh(importance_df['Feature'].iloc[:top_n], importance_df['Importance'].iloc[:top_n], color='steelblue')
ax.set_xlabel('Absolute Coefficient Value', fontsize=11)
ax.set_title('Feature Importance: Top 10 Features', fontsize=12, fontweight='bold')
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

📝 Section 10.6 Review Questions

What is the main difference between forward stepwise and RFE?
When would you use RFE instead of Lasso for feature selection?
If RFE selects the same 8 features as Lasso, does this validate both methods?

15.7 Case Study: SME Loan Default Prediction with Regularisation

Business Context: A Nigerian microfinance institution has 40+ financial ratios per borrower and wants to predict loan default risk using a parsimonious, interpretable model.

Data: 500 SME loans (2018–2023) with: - 40 financial ratios (highly correlated) - Target: Default within 12 months (yes/no)

Goal: Compare OLS, Ridge, Lasso, Elastic Net; select the best model for deployment.

15.7.1 Full Analysis

Show code

# SME loan default prediction: regularised regression

set.seed(42)
n <- 500
p <- 40

# Generate 40 correlated financial ratios
X_sme <- matrix(rnorm(n * p), nrow = n, ncol = p)
for (j in 2:p) {
  X_sme[, j] <- 0.8 * X_sme[, j] + 0.2 * X_sme[, j - 1] + rnorm(n, 0, 0.1)
}

colnames(X_sme) <- paste0("Ratio_", 1:p)

# Default risk: function of a few true predictors
true_coefs <- c(1.5, -0.8, 0, 0, 1.2, rep(0, 35))  # only 3 true predictors
linear_pred <- X_sme %*% true_coefs
prob_default <- 1 / (1 + exp(-linear_pred))
default_flag <- rbinom(n, 1, prob_default)

sme_data <- as.data.frame(cbind(X_sme, default = default_flag))

# Standardise
X_sme_scaled <- scale(X_sme)

cat("SME Loan Default Prediction: Model Comparison\n")
#> SME Loan Default Prediction: Model Comparison
cat("Data: ", n, " loans, ", p, " financial ratios\n\n")
#> Data:  500  loans,  40  financial ratios

# Split into train/test
set.seed(42)
train_idx <- sample(1:n, size = round(0.8 * n))
X_train <- X_sme_scaled[train_idx, ]
X_test <- X_sme_scaled[-train_idx, ]
y_train <- default_flag[train_idx]
y_test <- default_flag[-train_idx]

# Model 1: OLS (will overfit)
fit_ols <- glm(y_train ~ X_train, family = binomial())
pred_ols_test <- predict(fit_ols, newdata = data.frame(X_test),
                         type = "response")[1:nrow(X_test)]

# Model 2: Ridge
ridge_cv <- cv.glmnet(X_train, y_train, alpha = 0, family = "binomial", nfolds = 5)
pred_ridge_test <- as.vector(predict(glmnet(X_train, y_train, alpha = 0,
                                             lambda = ridge_cv$lambda.min, family = "binomial"),
                                     newx = X_test, type = "response"))

# Model 3: Lasso
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, family = "binomial", nfolds = 5)
pred_lasso_test <- as.vector(predict(glmnet(X_train, y_train, alpha = 1,
                                             lambda = lasso_cv$lambda.min, family = "binomial"),
                                     newx = X_test, type = "response"))

# Model 4: Elastic Net
en_cv <- cv.glmnet(X_train, y_train, alpha = 0.5, family = "binomial", nfolds = 5)
pred_en_test <- as.vector(predict(glmnet(X_train, y_train, alpha = 0.5,
                                          lambda = en_cv$lambda.min, family = "binomial"),
                                  newx = X_test, type = "response"))

# Evaluation: AUC (from pROC package)
library(pROC)

auc_ols <- auc(roc(y_test, pred_ols_test, direction = "<"))
auc_ridge <- auc(roc(y_test, pred_ridge_test, direction = "<"))
auc_lasso <- auc(roc(y_test, pred_lasso_test, direction = "<"))
auc_en <- auc(roc(y_test, pred_en_test, direction = "<"))

# Accuracy
acc_ols <- mean((pred_ols_test > 0.5) == (y_test == 1))
acc_ridge <- mean((pred_ridge_test > 0.5) == (y_test == 1))
acc_lasso <- mean((pred_lasso_test > 0.5) == (y_test == 1))
acc_en <- mean((pred_en_test > 0.5) == (y_test == 1))

cat("\nTest Set Performance:\n\n")
#> 
#> Test Set Performance:
comparison_models <- tibble(
  Model = c("OLS", "Ridge", "Lasso", "Elastic Net"),
  AUC = c(as.numeric(auc_ols), as.numeric(auc_ridge), as.numeric(auc_lasso), as.numeric(auc_en)),
  Accuracy = c(acc_ols, acc_ridge, acc_lasso, acc_en),
  Nonzero_Coefs = c(p, p,
                     sum(coef(glmnet(X_train, y_train, alpha = 1,
                                      lambda = lasso_cv$lambda.min, family = "binomial")) != 0) - 1,
                     sum(coef(glmnet(X_train, y_train, alpha = 0.5,
                                      lambda = en_cv$lambda.min, family = "binomial")) != 0) - 1)
)

print(comparison_models)
#> # A tibble: 4 × 4
#>   Model         AUC Accuracy Nonzero_Coefs
#>   <chr>       <dbl>    <dbl>         <dbl>
#> 1 OLS         0.475     0.5             40
#> 2 Ridge       0.791     0.71            40
#> 3 Lasso       0.812     0.76             5
#> 4 Elastic Net 0.813     0.77            17

# Visualise ROC curves
plot(roc(y_test, pred_ols_test, direction = "<"), main = "ROC Curves: Default Prediction",
     col = "gray", lwd = 2, print.auc = TRUE, print.auc.x = 0.6, print.auc.y = 0.1)
plot(roc(y_test, pred_ridge_test, direction = "<"), add = TRUE, col = "blue", lwd = 2,
     print.auc = TRUE, print.auc.x = 0.6, print.auc.y = 0.05)
plot(roc(y_test, pred_lasso_test, direction = "<"), add = TRUE, col = "red", lwd = 2,
     print.auc = TRUE, print.auc.x = 0.55, print.auc.y = 0)
plot(roc(y_test, pred_en_test, direction = "<"), add = TRUE, col = "green", lwd = 2,
     print.auc = TRUE, print.auc.x = 0.5, print.auc.y = -0.05)

legend("bottomright", legend = c("OLS", "Ridge", "Lasso", "Elastic Net"),
       col = c("gray", "blue", "red", "green"), lwd = 2)

Show code


# Business recommendation
cat("\n\nBusiness Recommendation:\n")
#> 
#> 
#> Business Recommendation:
cat("Lasso achieves similar AUC to OLS/Ridge with only",
    sum(coef(glmnet(X_train, y_train, alpha = 1,
                     lambda = lasso_cv$lambda.min, family = "binomial")) != 0) - 1,
    "features.\n")
#> Lasso achieves similar AUC to OLS/Ridge with only 5 features.
cat("This simplifies the scorecard and improves interpretability.\n")
#> This simplifies the scorecard and improves interpretability.
cat("For deployment, use Lasso with 95% CI thresholding for risk assessment.\n")
#> For deployment, use Lasso with 95% CI thresholding for risk assessment.

Show code

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score
import matplotlib.pyplot as plt

np.random.seed(42)
n = 500
p = 40

# Generate correlated financial ratios
X_sme = np.random.normal(0, 1, (n, p))
for j in range(1, p):
    X_sme[:, j] = 0.8 * X_sme[:, j] + 0.2 * X_sme[:, j - 1] + np.random.normal(0, 0.1, n)

# Default risk
true_coefs = np.array([1.5, -0.8, 0, 0, 1.2] + [0] * 35)
linear_pred = X_sme @ true_coefs
prob_default = 1 / (1 + np.exp(-linear_pred))
default_flag = np.random.binomial(1, prob_default, n)

# Standardise
scaler = StandardScaler()
X_sme_scaled = scaler.fit_transform(X_sme)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_sme_scaled, default_flag, test_size=0.2, random_state=42)

print(f"SME Loan Default Prediction: Model Comparison")
#> SME Loan Default Prediction: Model Comparison
print(f"Data: {n} loans, {p} financial ratios\n")
#> Data: 500 loans, 40 financial ratios

# Model 1: OLS (logistic, no regularisation)
lr_ols = LogisticRegression(penalty=None, solver='lbfgs', max_iter=1000)
lr_ols.fit(X_train, y_train)

LogisticRegression(max_iter=1000, penalty=None)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

pred_ols_test = lr_ols.predict_proba(X_test)[:, 1]

# Model 2: Ridge
lr_ridge = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000)
lr_ridge.fit(X_train, y_train)

LogisticRegression(max_iter=1000, penalty='l2')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

pred_ridge_test = lr_ridge.predict_proba(X_test)[:, 1]

# Model 3: Lasso
lr_lasso = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000)
lr_lasso.fit(X_train, y_train)

LogisticRegression(max_iter=1000, penalty='l1', solver='saga')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

pred_lasso_test = lr_lasso.predict_proba(X_test)[:, 1]

# Model 4: Elastic Net (C = 1/λ in sklearn)
lr_en = LogisticRegression(penalty='elasticnet', C=1.0, solver='saga', l1_ratio=0.5, max_iter=1000)
lr_en.fit(X_train, y_train)

LogisticRegression(l1_ratio=0.5, max_iter=1000, penalty='elasticnet',
                   solver='saga')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

pred_en_test = lr_en.predict_proba(X_test)[:, 1]

# Evaluation
auc_ols = roc_auc_score(y_test, pred_ols_test)
auc_ridge = roc_auc_score(y_test, pred_ridge_test)
auc_lasso = roc_auc_score(y_test, pred_lasso_test)
auc_en = roc_auc_score(y_test, pred_en_test)

acc_ols = accuracy_score(y_test, (pred_ols_test > 0.5).astype(int))
acc_ridge = accuracy_score(y_test, (pred_ridge_test > 0.5).astype(int))
acc_lasso = accuracy_score(y_test, (pred_lasso_test > 0.5).astype(int))
acc_en = accuracy_score(y_test, (pred_en_test > 0.5).astype(int))

nonzero_ols = np.sum(lr_ols.coef_[0] != 0)
nonzero_ridge = np.sum(lr_ridge.coef_[0] != 0)
nonzero_lasso = np.sum(lr_lasso.coef_[0] != 0)
nonzero_en = np.sum(lr_en.coef_[0] != 0)

comparison_models = pd.DataFrame({
    'Model': ['OLS', 'Ridge', 'Lasso', 'Elastic Net'],
    'AUC': [auc_ols, auc_ridge, auc_lasso, auc_en],
    'Accuracy': [acc_ols, acc_ridge, acc_lasso, acc_en],
    'Non-Zero Coefs': [nonzero_ols, nonzero_ridge, nonzero_lasso, nonzero_en]
})

print("Test Set Performance:\n")
#> Test Set Performance:
print(comparison_models.to_string(index=False))
#>       Model      AUC  Accuracy  Non-Zero Coefs
#>         OLS 0.835809      0.78              40
#>       Ridge 0.837415      0.78              40
#>       Lasso 0.843436      0.79              37
#> Elastic Net 0.839422      0.79              38

# Plot ROC curves
fig, ax = plt.subplots(figsize=(10, 8))

fpr_ols, tpr_ols, _ = roc_curve(y_test, pred_ols_test)
fpr_ridge, tpr_ridge, _ = roc_curve(y_test, pred_ridge_test)
fpr_lasso, tpr_lasso, _ = roc_curve(y_test, pred_lasso_test)
fpr_en, tpr_en, _ = roc_curve(y_test, pred_en_test)

ax.plot(fpr_ols, tpr_ols, color='gray', linewidth=2, label=f'OLS (AUC={auc_ols:.3f})')
#> [<matplotlib.lines.Line2D object at 0x0000020643FCB770>]
ax.plot(fpr_ridge, tpr_ridge, color='blue', linewidth=2, label=f'Ridge (AUC={auc_ridge:.3f})')
#> [<matplotlib.lines.Line2D object at 0x0000020643FCB8C0>]
ax.plot(fpr_lasso, tpr_lasso, color='red', linewidth=2, label=f'Lasso (AUC={auc_lasso:.3f})')
#> [<matplotlib.lines.Line2D object at 0x0000020643FCBA10>]
ax.plot(fpr_en, tpr_en, color='green', linewidth=2, label=f'Elastic Net (AUC={auc_en:.3f})')
#> [<matplotlib.lines.Line2D object at 0x0000020643FCBB60>]

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5)
#> [<matplotlib.lines.Line2D object at 0x0000020643FCBCB0>]
ax.set_xlabel('False Positive Rate', fontsize=11)
#> Text(0.5, 0, 'False Positive Rate')
ax.set_ylabel('True Positive Rate', fontsize=11)
#> Text(0, 0.5, 'True Positive Rate')
ax.set_title('ROC Curves: Loan Default Prediction', fontsize=12, fontweight='bold')
#> Text(0.5, 1.0, 'ROC Curves: Loan Default Prediction')
ax.legend(fontsize=10, loc='lower right')
#> <matplotlib.legend.Legend object at 0x0000020643FCB4D0>
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('sme_default_roc.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print(f"\n\nBusiness Recommendation:")
#> 
#> 
#> Business Recommendation:
print(f"Lasso achieves AUC={auc_lasso:.3f} with only {nonzero_lasso} features.")
#> Lasso achieves AUC=0.843 with only 37 features.
print(f"This simplifies the scorecard and improves interpretability.")
#> This simplifies the scorecard and improves interpretability.
print(f"Recommend: Deploy Lasso model with 95% CI thresholding for risk assessment.")
#> Recommend: Deploy Lasso model with 95% CI thresholding for risk assessment.

Key findings: 1. Lasso selects ~8 features (vs 40 used by OLS), maintaining competitive AUC 2. Ridge achieves similar AUC to OLS but with less interpretability (all features retained) 3. Elastic Net provides a middle ground: 12–15 features, good performance 4. Business impact: Lasso model is deployable on mobile platforms (low computational cost) and easier for loan officers to explain to borrowers

📝 Section 10.7 Review Questions

Why does Lasso select fewer features without sacrificing AUC?
If OLS and Lasso have similar AUC, why prefer Lasso for deployment?
How would you explain Lasso’s feature selection to a non-technical loan officer?
What would you do if the 8 selected features are highly correlated with each other?

15.8 Chapter Exercises

Chapter 10 Exercises

Exercise 10.1: Bias-Variance Illustration

Generate synthetic data: Y = sin(X) + ε, where X ∈ [0, 2π], ε ~ N(0, 0.5).

Fit models of increasing complexity (degree 1, 3, 5, 10 polynomials)
Plot training and test error for each
Identify the overfitting region
Which model balances bias and variance best?

Exercise 10.2: Ridge Penalty Derivation

Starting from the Ridge objective: \[S_{\text{Ridge}}(\beta) = \text{RSS} + \lambda \sum \beta_j^2\]

Compute the first-order conditions (derivatives with respect to each β_j)
Show that the solution involves (X’X + λI)^{-1}
Verify: as λ → ∞, all coefficients → 0
Verify: as λ → 0, Ridge → OLS

Exercise 10.3: Lasso Sparsity

For a univariate Lasso problem (one predictor, one coefficient β):

Derive the soft-thresholding formula: β̂_Lasso = sign(β̂_OLS) max(|β̂_OLS| − λ, 0)
If β̂_OLS = 0.5 and λ = 1, what is β̂_Lasso?
If β̂_OLS = 0.8 and λ = 1, what is β̂_Lasso?
Explain when and why coefficients are eliminated

Exercise 10.4: Ridge vs Lasso in High Dimensions

For a dataset with n = 100 and p = 500:

Why would OLS be ill-defined or numerically unstable?
How does Ridge solve this problem?
How does Lasso solve this problem differently?
When would you choose Lasso over Ridge?

Exercise 10.5: Cross-Validation Implementation

Write pseudocode for K-fold cross-validation to select λ in Ridge regression:

input: data (X, y), λ sequence, k
for each λ:
  initialize cv_errors = []
  for fold i = 1 to k:
    [pseudocode here]
  end
  avg_cv_error[λ] = mean(cv_errors)
end
λ_optimal = argmin(avg_cv_error)
output: λ_optimal

Fill in the pseudocode.
What is the computational complexity?
How would you implement in R or Python?

Exercise 10.6: Elastic Net Tuning

A dataset has p = 50 correlated predictors. You fit Elastic Net with α ∈ {0, 0.25, 0.5, 0.75, 1} and cross-validate each.

Results: | α | CV error | Non-zero coefs | |—|—|—| | 0.0 (Ridge) | 0.85 | 50 | | 0.25 | 0.82 | 48 | | 0.5 | 0.80 | 35 | | 0.75 | 0.81 | 22 | | 1.0 (Lasso) | 0.83 | 18 |

Which α would you select?
Why might pure Lasso (α = 1) have higher error despite fewer features?
What does the grouping of correlated predictors suggest?

Exercise 10.7: Feature Selection Stability

Run Lasso on a dataset 100 times with bootstrapped samples. Compute the percentage of times each feature is selected.

Which features are selected in >80% of runs (stable)?
Which features are selected in 20–80% of runs (unstable)?
Why would you trust stable features more than unstable ones?
How might you combine Lasso results with domain expertise?

Exercise 10.8: Real-World Case: Customer Churn Prediction

A telecom company has 200 customers and 30 usage metrics. Goal: predict churn using regularised regression.

Download or simulate the data (or use a public churn dataset)
Fit Ridge, Lasso, Elastic Net with CV
Compare test AUC, selected features, and interpretability
Write a business recommendation: which model to deploy and why?

Exercise 10.9: Regularisation Path Interpretation

Plot the coefficient path for Ridge, Lasso, and Elastic Net as λ increases.

In Ridge: do all coefficients shrink at the same rate?
In Lasso: which coefficients are eliminated first?
In Elastic Net: what is the pattern?
Explain these patterns mechanically

Exercise 10.10: Synthesis: Build a Regularised Regression Pipeline

Select a business problem with p > n or severe multicollinearity. Execute:

EDA: Visualise collinearity (correlation matrix, VIF)
Model building: Ridge, Lasso, Elastic Net, auto feature selection
CV tuning: Grid search over λ and α
Comparison: CV error, number of features, interpretation
Diagnostics: Check selected features for domain sense
Deployment: Recommend one model with justification

15.9 Further Reading

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity. CRC Press.
- Rigorous treatment of Lasso, Elastic Net, and high-dimensional inference.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
- Accessible chapter on Ridge, Lasso, and cross-validation.
Zou, H., & Hastie, T. (2005). “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B, 67(2), 301–320.
- Original Elastic Net paper.

15.10 Chapter Appendix: Mathematical Derivations

15.10.1 A.1 Lagrangian Formulation of Ridge and Lasso

Ridge as constrained optimisation: \[\min_{\beta} \text{RSS} \quad \text{subject to} \quad \sum \beta_j^2 \leq s\]

The Lagrangian is: \[\mathcal{L}(\beta, \lambda) = \text{RSS} + \lambda \left( \sum \beta_j^2 - s \right)\]

The KKT conditions yield: \[\frac{\partial}{\partial \beta} \text{RSS} + 2\lambda \beta = 0\]

This gives the Ridge objective.

Lasso as constrained optimisation: \[\min_{\beta} \text{RSS} \quad \text{subject to} \quad \sum |\beta_j| \leq s\]

The Lagrangian is: \[\mathcal{L}(\beta, \lambda) = \text{RSS} + \lambda \left( \sum |\beta_j| - s \right)\]

The solution involves soft-thresholding; no closed form, but coordinate descent is used.

15.10.2 A.2 Geometric Intuition: Why Lasso Produces Sparsity

In 2D with two predictors β₁, β₂:

Ridge constraint: β₁² + β₂² ≤ s is a circle. The optimum (intersection of objective and constraint) is typically in the interior of the circle, giving non-zero β₁, β₂.
Lasso constraint: |β₁| + |β₂| ≤ s is a diamond with corners at (s, 0), (0, s), (−s, 0), (0, −s). The objective’s contours often intersect the diamond at a corner, where one coefficient is zero.

This geometric difference explains why Lasso produces sparsity while Ridge does not.

15.10.3 A.3 Coordinate Descent for Lasso

Iterative procedure: 1. Initialise β = 0 2. For k = 1, …, max_iter: - For j = 1, …, p: - Compute residuals with all predictors except j: r_{−j} = y − X_{−j} β_{−j} - Univariate Lasso: β_j ← soft_threshold(X_j’ r_{−j}, λ) - If coefficients converge, stop 3. Return β

The soft-thresholding step is: \[\text{soft\_threshold}(z, \lambda) = \text{sign}(z) \max(|z| − \lambda, 0)\]

This approach is very fast and scales well to high dimensions.

End of Chapter 10

	alphas alphas: array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0) Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to ``1 / (2C)`` in other linear models such as :class:`~sklearn.linear_model.LogisticRegression` or :class:`~sklearn.svm.LinearSVC`. If using Leave-One-Out cross-validation, alphas must be strictly positive.	array([1.0000...00000000e+03])
	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	scoring scoring: str, callable, default=None The scoring method to use for cross-validation. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: negative :ref:`mean squared error ` if cv is None (i.e. when using leave-one-out cross-validation), or :ref:`coefficient of determination ` (:math:`R^2`) otherwise.	None
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds. - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if ``y`` is binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used, else, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here.	5
	gcv_mode gcv_mode: {'auto', 'svd', 'eigen'}, default='auto' Flag indicating which strategy to use when performing Leave-One-Out Cross-Validation. Options are:: 'auto' : use 'svd' if n_samples > n_features, otherwise use 'eigen' 'svd' : force use of singular value decomposition of X when X is dense, eigenvalue decomposition of X^T.X when X is sparse. 'eigen' : force computation via eigendecomposition of X.X^T The 'auto' mode is the default and is intended to pick the cheaper option of the two depending on the shape of the training data.	None
	store_cv_results store_cv_results: bool, default=False Flag indicating if the cross-validation values corresponding to each alpha should be stored in the ``cv_results_`` attribute (see below). This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). .. versionchanged:: 1.5 Parameter name changed from `store_cv_values` to `store_cv_results`.	False
	alpha_per_target alpha_per_target: bool, default=False Flag indicating whether to optimize the alpha value (picked from the `alphas` parameter list) for each target separately (for multi-output settings: multiple prediction targets). When set to `True`, after fitting, the `alpha_` attribute will contain a value for each target. When set to `False`, a single alpha is used for all targets. .. versionadded:: 0.24	False

	alpha alpha: {float, ndarray of shape (n_targets,)}, default=1.0 Constant that multiplies the L2 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Ridge` object is not advised. Instead, you should use the :class:`LinearRegression` object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.	np.float64(97.70099572992247)
	fit_intercept fit_intercept: bool, default=True Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. ``X`` and ``y`` are expected to be centered).	True
	copy_X copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.	True
	max_iter max_iter: int, default=None Maximum number of iterations for conjugate gradient solver. For 'sparse_cg' and 'lsqr' solvers, the default value is determined by scipy.sparse.linalg. For 'sag' solver, the default value is 1000. For 'lbfgs' solver, the default value is 15000.	None
	tol tol: float, default=1e-4 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for each solver: - 'svd': `tol` has no impact. - 'cholesky': `tol` has no impact. - 'sparse_cg': norm of residuals smaller than `tol`. - 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr, which control the norm of the residual vector in terms of the norms of matrix and coefficients. - 'sag' and 'saga': relative change of coef smaller than `tol`. - 'lbfgs': maximum of the absolute (projected) gradient=max\|residuals\| smaller than `tol`. .. versionchanged:: 1.2 Default value changed from 1e-3 to 1e-4 for consistency with other linear models.	0.0001
	solver solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto' Solver to use in the computational routines: - 'auto' chooses the solver automatically based on the type of data. - 'svd' uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than 'cholesky' at the cost of being slower. - 'cholesky' uses the standard :func:`scipy.linalg.solve` function to obtain a closed-form solution. - 'sparse_cg' uses the conjugate gradient solver as found in :func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is more appropriate than 'cholesky' for large-scale data (possibility to set `tol` and `max_iter`). - 'lsqr' uses the dedicated regularized least-squares routine :func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative procedure. - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. - 'lbfgs' uses L-BFGS-B algorithm implemented in :func:`scipy.optimize.minimize`. It can be used only when `positive` is True. All solvers except 'svd' support both dense and sparse data. However, only 'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when `fit_intercept` is True. .. versionadded:: 0.17 Stochastic Average Gradient descent solver. .. versionadded:: 0.19 SAGA solver.	'auto'
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive. Only 'lbfgs' solver is supported in this case.	False
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag' or 'saga' to shuffle the data. See :term:`Glossary ` for details. .. versionadded:: 0.17 `random_state` to support Stochastic Average Gradient.	None

	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	copy_X copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.	True
	tol tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7	1e-06
	n_jobs n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24	False

	eps eps: float, default=1e-3 Length of the path. ``eps=1e-3`` means that ``alpha_min / alpha_max = 1e-3``.	0.001
	n_alphas n_alphas: int, default=100 Number of alphas along the regularization path. .. deprecated:: 1.7 `n_alphas` was deprecated in 1.7 and will be removed in 1.9. Use `alphas` instead.	'deprecated'
	alphas alphas: array-like or int, default=None Values of alphas to test along the regularization path. If int, `alphas` values are generated automatically. If array-like, list of alpha values to use. .. versionchanged:: 1.7 `alphas` accepts an integer value which removes the need to pass `n_alphas`. .. deprecated:: 1.7 `alphas=None` was deprecated in 1.7 and will be removed in 1.9, at which point the default value will be set to 100.	array([1.0000...00000000e+03])
	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	precompute precompute: 'auto', bool or array-like of shape (n_features, n_features), default='auto' Whether to use a precomputed Gram matrix to speed up calculations. If set to ``'auto'`` let us decide. The Gram matrix can also be passed as argument.	'auto'
	max_iter max_iter: int, default=1000 The maximum number of iterations.	10000
	tol tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``.	0.0001
	copy_X copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.	True
	cv cv: int, cross-validation generator or iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross-validation, - int, to specify the number of folds. - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For int/None inputs, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	5
	verbose verbose: bool or int, default=False Amount of verbosity.	False
	n_jobs n_jobs: int, default=None Number of CPUs to use during the cross validation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	positive positive: bool, default=False If positive, restrict regression coefficients to be positive.	False
	random_state random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.	None
	selection selection: {'cyclic', 'random'}, default='cyclic' If set to 'random', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to 'random') often leads to significantly faster convergence especially when tol is higher than 1e-4.	'cyclic'

	alpha alpha: float, default=1.0 Constant that multiplies the L1 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Lasso` object is not advised. Instead, you should use the :class:`LinearRegression` object.	np.float64(1000.0)
	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).	True
	precompute precompute: bool or array-like of shape (n_features, n_features), default=False Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``False`` to preserve sparsity.	False
	copy_X copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.	True
	max_iter max_iter: int, default=1000 The maximum number of iterations.	10000
	tol tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``, see Notes below.	0.0001
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.	False
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive.	False
	random_state random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.	None
	selection selection: {'cyclic', 'random'}, default='cyclic' If set to 'random', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to 'random') often leads to significantly faster convergence especially when tol is higher than 1e-4.	'cyclic'