18 Logistic Regression

📋 Learning Objectives

By the end of this chapter, you will be able to:

Understand why linear regression fails for classification and when logistic regression is needed
Fit logistic regression models using maximum likelihood estimation
Interpret logistic regression coefficients in terms of odds and odds ratios
Construct and evaluate logistic regression models on real data
Apply regularisation (L1 and L2) to logistic regression
Build credit risk scorecards from logistic regression coefficients
Evaluate model calibration and discrimination
Use Weight of Evidence (WOE) and Information Value (IV) for feature selection

18.1 Why Not Linear Regression for Classification?

Linear regression predicts a continuous value. For classification, we need bounded predictions between 0 and 1.

📘 Theory: The Linear Probability Model and Its Failures

Linear model for binary outcome: \[P(Y = 1 | X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Problems:

Predictions outside [0, 1]: The right side is unbounded. For high or low X values, the predicted probability can exceed 1 or fall below 0 (nonsensical).
Non-constant variance: For binary outcomes, Var(Y|X) = P(X)[1 − P(X)], which depends on X. This violates the OLS homoscedasticity assumption, leading to inefficient estimates and incorrect standard errors.
Non-normal residuals: Residuals take only two values (1 − Ŷ or −Ŷ), violating the normality assumption.
Wrong interpretation: β_j is interpreted as “change in probability,” but this is often misleading because probability changes are non-linear.

Solution: Use a bounded link function to map the linear predictor to [0, 1]. The logistic function is standard.

18.1.1 Motivating the Logistic Regression Model

📝 Section 13.1 Review Questions

If you fit a linear regression to a binary outcome (0/1) and predict Ŷ = 1.5 for a new observation, what does this mean?
Why does the linear probability model violate the homoscedasticity assumption?
What mathematical property should a function have to map any real number to (0, 1)?
Is it valid to interpret linear regression coefficients on a binary outcome as “change in probability”?

18.2 The Logistic Function and the Sigmoid

The logistic function (sigmoid) maps any real number to (0, 1):

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

where z is the linear predictor: $z = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$

🔑 Key Formula: The Logistic (Sigmoid) Function

\[\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}\]

Properties: - σ(z) ∈ (0, 1) for all z ∈ ℝ - σ(0) = 0.5 - σ(z) is monotonically increasing - As z → ∞, σ(z) → 1; as z → −∞, σ(z) → 0 - Derivative: σ’(z) = σ(z)[1 − σ(z)]

18.2.1 Visualisation and Intuition

Show code

library(tidyverse)

# Plot logistic function
z_values <- seq(-6, 6, by = 0.1)
sigma_values <- 1 / (1 + exp(-z_values))

par(mfrow = c(1, 2), mar = c(5, 5, 3, 1))

# Plot 1: Logistic function
plot(z_values, sigma_values, type = "l", lwd = 2, col = "steelblue",
     xlab = "Linear Predictor (z)", ylab = "Predicted Probability σ(z)",
     main = "Logistic (Sigmoid) Function",
     ylim = c(-0.1, 1.1))
abline(h = 0, col = "gray", lty = 2)
abline(h = 1, col = "gray", lty = 2)
abline(h = 0.5, col = "red", lty = 2, lwd = 1)
abline(v = 0, col = "red", lty = 2, lwd = 1)
points(0, 0.5, pch = 16, cex = 2, col = "red")
text(-5, 0.3, "P(Y=1|X) increases\nas X increases", cex = 0.9)

# Plot 2: Comparison with linear regression
plot(z_values, sigma_values, type = "l", lwd = 2, col = "steelblue",
     xlab = "Linear Predictor (z)", ylab = "Predicted Probability",
     main = "Logistic vs Linear Model",
     ylim = c(-0.5, 1.5))
abline(h = 0, col = "gray", lty = 2)
abline(h = 1, col = "gray", lty = 2)
# Add linear model
linear_pred <- 0.5 + 0.1 * z_values
lines(z_values, linear_pred, lwd = 2, col = "darkred", lty = 2)
legend("topleft", legend = c("Logistic (bounded)", "Linear (unbounded)"),
       col = c("steelblue", "darkred"), lwd = 2, lty = c(1, 2))

Show code


par(mfrow = c(1, 1))

Show code

import numpy as np
import matplotlib.pyplot as plt

# Logistic function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z_values = np.linspace(-6, 6, 200)
sigma_values = sigmoid(z_values)
linear_values = 0.5 + 0.1 * z_values

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Logistic function
ax = axes[0]
ax.plot(z_values, sigma_values, linewidth=2.5, color='steelblue', label='σ(z)')
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.axhline(1, color='gray', linestyle='--', alpha=0.5)
ax.axhline(0.5, color='red', linestyle='--', linewidth=1, alpha=0.5)
ax.axvline(0, color='red', linestyle='--', linewidth=1, alpha=0.5)
ax.plot(0, 0.5, 'ro', markersize=10)
ax.set_xlabel('Linear Predictor (z)', fontsize=11)
ax.set_ylabel('Predicted Probability σ(z)', fontsize=11)
ax.set_title('Logistic (Sigmoid) Function', fontsize=12, fontweight='bold')
ax.set_ylim([-0.1, 1.1])
#> (-0.1, 1.1)
ax.grid(True, alpha=0.3)
ax.text(-4.5, 0.25, 'P(Y=1|X) increases\nas X increases', fontsize=10)

# Plot 2: Comparison
ax = axes[1]
ax.plot(z_values, sigma_values, linewidth=2.5, color='steelblue', label='Logistic (bounded)')
ax.plot(z_values, linear_values, linewidth=2.5, color='darkred', linestyle='--', label='Linear (unbounded)')
ax.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax.axhline(1, color='gray', linestyle='--', alpha=0.3)
ax.fill_between(z_values, 0, 1, alpha=0.1, color='gray', label='Valid probability range')
ax.set_xlabel('Linear Predictor (z)', fontsize=11)
ax.set_ylabel('Predicted Probability', fontsize=11)
ax.set_title('Logistic vs Linear Model', fontsize=12, fontweight='bold')
ax.set_ylim([-0.5, 1.5])
#> (-0.5, 1.5)
ax.grid(True, alpha=0.3)
ax.legend(fontsize=10)

plt.tight_layout()
plt.savefig('sigmoid_function.png', dpi=300, bbox_inches='tight')
plt.show()

📝 Section 13.2 Review Questions

Verify: σ(0) = 0.5. What does this mean in terms of equal odds?
What is σ(−3)? σ(3)? How do they relate?
Why is the logistic function preferable to a linear function for classification?

18.3 The Logistic Regression Model

Logistic regression models the probability of the positive class using the logistic function:

\[P(Y = 1 | X) = \sigma(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)\]

18.3.1 Interpreting Coefficients: Odds and Odds Ratios

The odds of success are: \[\text{Odds} = \frac{P(Y = 1)}{P(Y = 0)} = \frac{P(Y = 1)}{1 - P(Y = 1)}\]

The log-odds (logit) are: \[\text{log(Odds)} = \log\left(\frac{P(Y = 1)}{1 - P(Y = 1)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

For a one-unit increase in X_j: \[\text{log(Odds)}_{\text{new}} - \text{log(Odds)}_{\text{old}} = \beta_j\]

Taking the exponential: \[\frac{\text{Odds}_{\text{new}}}{\text{Odds}_{\text{old}}} = e^{\beta_j}\]

The odds ratio is $e^{\beta_j}$: a one-unit increase in X_j multiplies the odds by $e^{\beta_j}$.

🔑 Key Formula: Coefficient Interpretation

For a logistic regression coefficient β_j:

Odds ratio: $OR_j = e^{\beta_j}$

Interpretation: - If β_j = 0, OR_j = 1 (no effect) - If β_j > 0, OR_j > 1 (increases odds) - If β_j < 0, OR_j < 1 (decreases odds)

Example: If β_age = 0.05, then e^{0.05} = 1.051 - A one-year increase in age multiplies the odds by 1.051 - Per 10-year increase: odds are multiplied by e^{0.5} ≈ 1.649

18.3.2 Worked Example: Credit Default Logistic Regression

Show code

library(tidyverse)

# Simulate credit default data
set.seed(42)
n <- 1000

age <- rnorm(n, mean = 45, sd = 15)
income <- rlnorm(n, meanlog = 11, sdlog = 0.8)
credit_utilisation <- rbeta(n, 2, 5)
credit_score <- 850 - 5 * age + 0.0001 * income + 300 * credit_utilisation + rnorm(n, 0, 100)

# Probability of default
true_linear_pred <- -8 + 0.03 * age - 0.0000005 * income - 0.02 * credit_score
prob_default <- 1 / (1 + exp(-true_linear_pred))
default <- rbinom(n, 1, prob_default)

credit_data <- tibble(
  age, income, credit_utilisation, credit_score, default
)

# Fit logistic regression
fit_logistic <- glm(default ~ age + log(income) + credit_utilisation + credit_score,
                    data = credit_data, family = binomial(link = "logit"))

summary(fit_logistic)
#> 
#> Call:
#> glm(formula = default ~ age + log(income) + credit_utilisation + 
#>     credit_score, family = binomial(link = "logit"), data = credit_data)
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)
#> (Intercept)        -2.657e+01  1.854e+05       0        1
#> age                -1.623e-15  9.324e+02       0        1
#> log(income)        -4.300e-14  1.430e+04       0        1
#> credit_utilisation -2.640e-14  7.936e+04       0        1
#> credit_score       -5.951e-17  1.112e+02       0        1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 0.0000e+00  on 999  degrees of freedom
#> Residual deviance: 5.8016e-09  on 995  degrees of freedom
#> AIC: 10
#> 
#> Number of Fisher Scoring iterations: 25

# Extract coefficients and odds ratios
coefs <- coef(fit_logistic)
odds_ratios <- exp(coefs)

cat("\n\nCoefficient Interpretation (Odds Ratios):\n\n")
#> 
#> 
#> Coefficient Interpretation (Odds Ratios):
for (i in 2:length(coefs)) {
  var_name <- names(coefs)[i]
  coef_val <- coefs[i]
  or_val <- odds_ratios[i]

  cat(var_name, ":\n")
  cat("  Coefficient (β):", round(coef_val, 6), "\n")
  cat("  Odds Ratio (e^β):", round(or_val, 4), "\n")

  if (or_val > 1) {
    pct_increase <- (or_val - 1) * 100
    cat("  → One-unit increase multiplies odds by", round(or_val, 4),
        "(", round(pct_increase, 1), "% increase)\n\n")
  } else {
    pct_decrease <- (1 - or_val) * 100
    cat("  → One-unit increase multiplies odds by", round(or_val, 4),
        "(", round(pct_decrease, 1), "% decrease)\n\n")
  }
}
#> age :
#>   Coefficient (β): 0 
#>   Odds Ratio (e^β): 1 
#>   → One-unit increase multiplies odds by 1 ( 0 % decrease)
#> 
#> log(income) :
#>   Coefficient (β): 0 
#>   Odds Ratio (e^β): 1 
#>   → One-unit increase multiplies odds by 1 ( 0 % decrease)
#> 
#> credit_utilisation :
#>   Coefficient (β): 0 
#>   Odds Ratio (e^β): 1 
#>   → One-unit increase multiplies odds by 1 ( 0 % decrease)
#> 
#> credit_score :
#>   Coefficient (β): 0 
#>   Odds Ratio (e^β): 1 
#>   → One-unit increase multiplies odds by 1 ( 0 % decrease)

# Predicted probabilities for specific cases
new_cases <- tibble(
  age = c(30, 50),
  income = c(30000, 100000),
  credit_utilisation = c(0.3, 0.7),
  credit_score = c(750, 600)
)

preds <- predict(fit_logistic, newdata = new_cases, type = "response")

cat("Predicted Default Probabilities:\n\n")
#> Predicted Default Probabilities:
for (i in 1:nrow(new_cases)) {
  cat("Applicant", i, ":\n")
  cat("  Age:", new_cases$age[i], "\n")
  cat("  Income: $", new_cases$income[i], "\n")
  cat("  Credit Utilisation:", new_cases$credit_utilisation[i], "\n")
  cat("  Credit Score:", new_cases$credit_score[i], "\n")
  cat("  Predicted Default Probability:", round(preds[i], 4), " (",
      round(preds[i] * 100, 2), "%)\n\n")
}
#> Applicant 1 :
#>   Age: 30 
#>   Income: $ 30000 
#>   Credit Utilisation: 0.3 
#>   Credit Score: 750 
#>   Predicted Default Probability: 0  ( 0 %)
#> 
#> Applicant 2 :
#>   Age: 50 
#>   Income: $ 1e+05 
#>   Credit Utilisation: 0.7 
#>   Credit Score: 600 
#>   Predicted Default Probability: 0  ( 0 %)

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

np.random.seed(42)
n = 1000

age = np.random.normal(45, 15, n)
income = np.exp(np.random.normal(11, 0.8, n))
credit_util = np.random.beta(2, 5, n)
credit_score = 850 - 5*age + 0.0001*income + 300*credit_util + np.random.normal(0, 100, n)

# Probability of default
true_linear_pred = -8 + 0.03*age - 0.0000005*income - 0.02*credit_score
prob_default = 1 / (1 + np.exp(-true_linear_pred))
default = np.random.binomial(1, prob_default, n)

credit_data = pd.DataFrame({
    'age': age,
    'income': income,
    'credit_util': credit_util,
    'credit_score': credit_score,
    'default': default
})

# Fit logistic regression using statsmodels for detailed output
X = credit_data[['age', 'credit_util', 'credit_score']].copy()
X['log_income'] = np.log(credit_data['income'])
X = sm.add_constant(X)
y = credit_data['default']

fit_logistic = sm.Logit(y, X).fit()
#> Warning: Maximum number of iterations has been exceeded.
#>          Current function value: 0.000000
#>          Iterations: 35
print(fit_logistic.summary())
#>                            Logit Regression Results                           
#> ==============================================================================
#> Dep. Variable:                default   No. Observations:                 1000
#> Model:                          Logit   Df Residuals:                      995
#> Method:                           MLE   Df Model:                            4
#> Date:                Sun, 10 May 2026   Pseudo R-squ.:                     inf
#> Time:                        15:37:56   Log-Likelihood:            -8.5717e-09
#> converged:                      False   LL-Null:                        0.0000
#> Covariance Type:            nonrobust   LLR p-value:                     1.000
#> ================================================================================
#>                    coef    std err          z      P>|z|      [0.025      0.975]
#> --------------------------------------------------------------------------------
#> const          -20.3459   3.53e+05  -5.76e-05      1.000   -6.93e+05    6.93e+05
#> age             -0.2772   2263.465     -0.000      1.000   -4436.587    4436.032
#> credit_util     -1.0230    1.6e+05  -6.38e-06      1.000   -3.14e+05    3.14e+05
#> credit_score     0.0033    174.391   1.88e-05      1.000    -341.796     341.803
#> log_income      -0.2157   3.48e+04  -6.19e-06      1.000   -6.82e+04    6.82e+04
#> ================================================================================
#> 
#> Complete Separation: The results show that there iscomplete separation or perfect prediction.
#> In this case the Maximum Likelihood Estimator does not exist and the parameters
#> are not identified.

# Coefficient interpretation
print("\n\nCoefficient Interpretation (Odds Ratios):\n")
#> 
#> 
#> Coefficient Interpretation (Odds Ratios):
coefs = fit_logistic.params[1:]  # Exclude intercept
odds_ratios = np.exp(coefs)

for var, coef_val, or_val in zip(coefs.index, coefs.values, odds_ratios.values):
    print(f"{var}:")
    print(f"  Coefficient (β): {coef_val:.6f}")
    print(f"  Odds Ratio (e^β): {or_val:.4f}")

    if or_val > 1:
        pct_change = (or_val - 1) * 100
        print(f"  → One-unit increase multiplies odds by {or_val:.4f} ({pct_change:.1f}% increase)\n")
    else:
        pct_change = (1 - or_val) * 100
        print(f"  → One-unit increase multiplies odds by {or_val:.4f} ({pct_change:.1f}% decrease)\n")
#> age:
#>   Coefficient (β): -0.277225
#>   Odds Ratio (e^β): 0.7579
#>   → One-unit increase multiplies odds by 0.7579 (24.2% decrease)
#> 
#> credit_util:
#>   Coefficient (β): -1.022981
#>   Odds Ratio (e^β): 0.3595
#>   → One-unit increase multiplies odds by 0.3595 (64.0% decrease)
#> 
#> credit_score:
#>   Coefficient (β): 0.003274
#>   Odds Ratio (e^β): 1.0033
#>   → One-unit increase multiplies odds by 1.0033 (0.3% increase)
#> 
#> log_income:
#>   Coefficient (β): -0.215689
#>   Odds Ratio (e^β): 0.8060
#>   → One-unit increase multiplies odds by 0.8060 (19.4% decrease)

# Predicted probabilities
new_cases = pd.DataFrame({
    'const': 1,
    'age': [30, 50],
    'credit_util': [0.3, 0.7],
    'credit_score': [750, 600],
    'log_income': [np.log(30000), np.log(100000)]
})
new_cases = new_cases[['const', 'age', 'credit_util', 'credit_score', 'log_income']]

preds = fit_logistic.predict(new_cases)

print("Predicted Default Probabilities:\n")
#> Predicted Default Probabilities:
for i in range(len(preds)):
    print(f"Applicant {i+1}:")
    print(f"  Age: {new_cases.iloc[i]['age']:.0f}")
    print(f"  Income: ${np.exp(new_cases.iloc[i]['log_income']):,.0f}")
    print(f"  Credit Utilisation: {new_cases.iloc[i]['credit_util']:.1f}")
    print(f"  Credit Score: {new_cases.iloc[i]['credit_score']:.0f}")
    print(f"  Predicted Default Probability: {preds.iloc[i]:.4f} ({preds.iloc[i]*100:.2f}%)\n")
#> Applicant 1:
#>   Age: 30
#>   Income: $30,000
#>   Credit Utilisation: 0.3
#>   Credit Score: 750
#>   Predicted Default Probability: 0.0000 (0.00%)
#> 
#> Applicant 2:
#>   Age: 50
#>   Income: $100,000
#>   Credit Utilisation: 0.7
#>   Credit Score: 600
#>   Predicted Default Probability: 0.0000 (0.00%)

📝 Section 13.3 Review Questions

If β_age = 0.03, what is the odds ratio? Interpret it.
If β_credit_score = −0.02, what happens to default odds when credit score increases by 50 points?
Why is the log-odds (logit) linear in X, but the probability is nonlinear?
Can a coefficient in logistic regression be interpreted as “change in probability per unit change in X”? Why or why not?

18.4 Maximum Likelihood Estimation (MLE)

Unlike OLS, logistic regression coefficients are estimated using maximum likelihood estimation (MLE), not closed-form formulas.

📘 Theory: MLE for Logistic Regression

Likelihood function: For observed data (y_i, x_i), the likelihood of observing y_i given x_i is:

\[L(\beta) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i}\]

where $p_i = \sigma(\beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p})$

Log-likelihood: \[\ell(\beta) = \sum_{i=1}^{n} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]\]

Goal: Find β that maximises ℓ(β) (not minimises, unlike OLS).

Method: Gradient ascent (or iteratively reweighted least squares, IRLS): 1. Start with initial guess β^(0) 2. Compute gradient: ∇ℓ(β^(t)) 3. Update: β^(t+1) = β^(t) + α ∇ℓ(β^(t)) 4. Repeat until convergence

Properties: - No closed-form solution (unlike OLS) - Converges under mild conditions - Coefficients are asymptotically normal (allows confidence intervals and hypothesis tests)

18.4.1 Worked Example: Computing Log-Likelihood

Show code

# Compute log-likelihood manually for a simple logistic regression

# Generate simple data
set.seed(42)
n <- 100
x <- rnorm(n, 0, 1)
true_beta <- c(-1, 2)  # Intercept, slope
p <- 1 / (1 + exp(-(true_beta[1] + true_beta[2] * x)))
y <- rbinom(n, 1, p)

# Log-likelihood function
log_likelihood <- function(beta, x, y) {
  p <- 1 / (1 + exp(-(beta[1] + beta[2] * x)))
  # Avoid log(0) by bounding p
  p <- pmax(pmin(p, 1 - 1e-10), 1e-10)
  sum(y * log(p) + (1 - y) * log(1 - p))
}

# Grid search over beta values
beta0_seq <- seq(-2, 0, by = 0.2)
beta1_seq <- seq(1, 3, by = 0.2)
ll_grid <- matrix(NA, length(beta0_seq), length(beta1_seq))

for (i in seq_along(beta0_seq)) {
  for (j in seq_along(beta1_seq)) {
    ll_grid[i, j] <- log_likelihood(c(beta0_seq[i], beta1_seq[j]), x, y)
  }
}

# Plot log-likelihood surface
par(mar = c(5, 5, 3, 1))
contour(beta0_seq, beta1_seq, ll_grid,
        xlab = "β₀ (Intercept)", ylab = "β₁ (Slope)",
        main = "Log-Likelihood Surface (Logistic Regression)",
        levels = seq(min(ll_grid), max(ll_grid), length.out = 20),
        drawlabels = TRUE)

# Mark MLE (fit logistic regression for comparison)
fit <- glm(y ~ x, family = binomial(link = "logit"))
mle_beta <- coef(fit)
points(mle_beta[1], mle_beta[2], pch = 16, cex = 2, col = "red")
text(mle_beta[1] + 0.1, mle_beta[2], "MLE", col = "red", cex = 1.2)

Show code


cat("MLE (from glm):\n")
#> MLE (from glm):
cat("β₀:", round(mle_beta[1], 4), "\n")
#> β₀: -1.4996
cat("β₁:", round(mle_beta[2], 4), "\n")
#> β₁: 2.0806
cat("Log-likelihood at MLE:", round(log_likelihood(mle_beta, x, y), 2), "\n")
#> Log-likelihood at MLE: -41.4

Show code

import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt

# Generate simple data
np.random.seed(42)
n = 100
x = np.random.normal(0, 1, n)
true_beta = np.array([-1, 2])
p = 1 / (1 + np.exp(-(true_beta[0] + true_beta[1] * x)))
y = np.random.binomial(1, p, n)

# Log-likelihood function (we'll minimise negative log-likelihood)
def neg_log_likelihood(beta, x, y):
    p = 1 / (1 + np.exp(-(beta[0] + beta[1] * x)))
    p = np.clip(p, 1e-10, 1 - 1e-10)
    return -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))

# Grid search
beta0_seq = np.linspace(-2, 0, 20)
beta1_seq = np.linspace(1, 3, 20)
ll_grid = np.zeros((len(beta0_seq), len(beta1_seq)))

for i, b0 in enumerate(beta0_seq):
    for j, b1 in enumerate(beta1_seq):
        ll_grid[i, j] = -neg_log_likelihood(np.array([b0, b1]), x, y)

# Find MLE numerically
result = minimize(neg_log_likelihood, x0=[0, 0], args=(x, y), method='BFGS')
mle_beta = result.x

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
contour = ax.contour(beta1_seq, beta0_seq, ll_grid,
                      levels=20, cmap='viridis', alpha=0.8)
ax.clabel(contour, inline=True, fontsize=8)
ax.plot(mle_beta[1], mle_beta[0], 'r*', markersize=20, label='MLE')
ax.set_xlabel('β₁ (Slope)', fontsize=11)
ax.set_ylabel('β₀ (Intercept)', fontsize=11)
ax.set_title('Log-Likelihood Surface (Logistic Regression)', fontsize=12, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('logistic_mle_surface.png', dpi=300, bbox_inches='tight')
plt.show()

Show code


print("MLE (from numerical optimisation):")
#> MLE (from numerical optimisation):
print(f"β₀: {mle_beta[0]:.4f}")
#> β₀: -1.1670
print(f"β₁: {mle_beta[1]:.4f}")
#> β₁: 1.9286
print(f"Negative log-likelihood at MLE: {neg_log_likelihood(mle_beta, x, y):.2f}")
#> Negative log-likelihood at MLE: 43.25

# Compare with sklearn
from sklearn.linear_model import LogisticRegression
X = x.reshape(-1, 1)
lr = LogisticRegression(fit_intercept=True)
lr.fit(X, y)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

print(f"\nsklearn LogisticRegression:")
#> 
#> sklearn LogisticRegression:
print(f"β₀: {lr.intercept_[0]:.4f}")
#> β₀: -1.0933
print(f"β₁: {lr.coef_[0, 0]:.4f}")
#> β₁: 1.6552

📝 Section 13.4 Review Questions

Why can’t logistic regression be solved with a closed-form formula like OLS?
What does maximising log-likelihood achieve?
In the log-likelihood formula, why do we include both y_i log(p_i) and (1−y_i) log(1−p_i)?
How do confidence intervals for logistic regression coefficients differ from OLS?

18.5 Interpreting Logistic Regression Output

Standard software (R’s glm, Python’s statsmodels) provides comprehensive output:

Show code

# Full logistic regression output interpretation

set.seed(42)
n <- 200
age <- rnorm(n, 50, 15)
debt_to_income <- rbeta(n, 2, 5)
employment_years <- rpois(n, 10)

# Default probability
p_default <- 0.05 + 0.003 * age + 0.3 * debt_to_income - 0.02 * employment_years
p_default <- pmax(pmin(p_default, 0.99), 0.01)
default <- rbinom(n, 1, p_default)

credit_data <- data.frame(age, debt_to_income, employment_years, default)

fit <- glm(default ~ age + debt_to_income + employment_years,
           data = credit_data, family = binomial(link = "logit"))

summary(fit)
#> 
#> Call:
#> glm(formula = default ~ age + debt_to_income + employment_years, 
#>     family = binomial(link = "logit"), data = credit_data)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)      -3.35695    1.55112  -2.164   0.0304 *
#> age               0.04352    0.01964   2.216   0.0267 *
#> debt_to_income    0.65655    1.66666   0.394   0.6936  
#> employment_years -0.17396    0.09706  -1.792   0.0731 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 111.51  on 199  degrees of freedom
#> Residual deviance: 101.50  on 196  degrees of freedom
#> AIC: 109.5
#> 
#> Number of Fisher Scoring iterations: 6

# Extract key statistics
cat("\n\n=== INTERPRETING THE OUTPUT ===\n\n")
#> 
#> 
#> === INTERPRETING THE OUTPUT ===

# Coefficients
coefs_summary <- summary(fit)$coefficients
cat("Coefficients:\n")
#> Coefficients:
print(coefs_summary)
#>                     Estimate Std. Error    z value   Pr(>|z|)
#> (Intercept)      -3.35695122 1.55112266 -2.1642074 0.03044843
#> age               0.04352013 0.01963791  2.2161282 0.02668272
#> debt_to_income    0.65654950 1.66666440  0.3939302 0.69363255
#> employment_years -0.17396116 0.09706110 -1.7922850 0.07308732

cat("\n\nPseudo-R² Measures:\n")
#> 
#> 
#> Pseudo-R² Measures:
cat("McFadden R²:", round(1 - fit$deviance / fit$null.deviance, 4), "\n")
#> McFadden R²: 0.0897
cat("(Logistic regression has no single R² like OLS; McFadden R² is an analogue)\n\n")
#> (Logistic regression has no single R² like OLS; McFadden R² is an analogue)

# Model significance (overall)
cat("Model Significance (Likelihood Ratio Test):\n")
#> Model Significance (Likelihood Ratio Test):
cat("Null deviance:", round(fit$null.deviance, 2), " (model with intercept only)\n")
#> Null deviance: 111.51  (model with intercept only)
cat("Residual deviance:", round(fit$deviance, 2), " (full model)\n")
#> Residual deviance: 101.5  (full model)
cat("Likelihood ratio test statistic:", round(fit$null.deviance - fit$deviance, 2), "\n")
#> Likelihood ratio test statistic: 10
cat("p-value:", round(1 - pchisq(fit$null.deviance - fit$deviance, fit$df.residual - fit$df.null), 4), "\n")
#> p-value: NaN
cat("(If p < 0.05, the model is significantly better than intercept-only)\n\n")
#> (If p < 0.05, the model is significantly better than intercept-only)

# Predicted probabilities and calibration
pred_prob <- predict(fit, type = "response")
pred_label <- ifelse(pred_prob > 0.5, 1, 0)

cat("Calibration:\n")
#> Calibration:
cat("Average predicted probability:", round(mean(pred_prob), 4), "\n")
#> Average predicted probability: 0.08
cat("Observed event rate (proportion of defaults):", round(mean(default), 4), "\n")
#> Observed event rate (proportion of defaults): 0.08
cat("(These should be similar for well-calibrated models)\n")
#> (These should be similar for well-calibrated models)

Show code

import statsmodels.api as sm
import numpy as np
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 200

age = np.random.normal(50, 15, n)
debt_to_income = np.random.beta(2, 5, n)
employment_years = np.random.poisson(10, n)

# Default probability
p_default = 0.05 + 0.003*age + 0.3*debt_to_income - 0.02*employment_years
p_default = np.clip(p_default, 0.01, 0.99)
default = np.random.binomial(1, p_default, n)

# Fit using statsmodels
X = np.column_stack([age, debt_to_income, employment_years])
X = sm.add_constant(X)
y = default

logit_model = sm.Logit(y, X).fit()
#> Optimization terminated successfully.
#>          Current function value: 0.217384
#>          Iterations 8
print(logit_model.summary())
#>                            Logit Regression Results                           
#> ==============================================================================
#> Dep. Variable:                      y   No. Observations:                  200
#> Model:                          Logit   Df Residuals:                      196
#> Method:                           MLE   Df Model:                            3
#> Date:                Sun, 10 May 2026   Pseudo R-squ.:                  0.1839
#> Time:                        15:38:00   Log-Likelihood:                -43.477
#> converged:                       True   LL-Null:                       -53.277
#> Covariance Type:            nonrobust   LLR p-value:                 0.0002054
#> ==============================================================================
#>                  coef    std err          z      P>|z|      [0.025      0.975]
#> ------------------------------------------------------------------------------
#> const         -4.6704      1.697     -2.752      0.006      -7.997      -1.344
#> x1             0.0380      0.021      1.845      0.065      -0.002       0.078
#> x2             5.8105      1.814      3.203      0.001       2.255       9.366
#> x3            -0.1933      0.095     -2.043      0.041      -0.379      -0.008
#> ==============================================================================

# Pseudo-R² (McFadden)
print(f"\n\nPseudo-R² (McFadden): {logit_model.prsquared:.4f}")
#> 
#> 
#> Pseudo-R² (McFadden): 0.1839
print("(Logistic regression has no single R² like OLS; McFadden R² is an analogue)")
#> (Logistic regression has no single R² like OLS; McFadden R² is an analogue)

# Model significance
print(f"\nModel Significance (Likelihood Ratio Test):")
#> 
#> Model Significance (Likelihood Ratio Test):
print(f"Log-Likelihood (full model): {logit_model.llf:.2f}")
#> Log-Likelihood (full model): -43.48
print(f"Log-Likelihood (intercept-only): {logit_model.llnull:.2f}")
#> Log-Likelihood (intercept-only): -53.28
lr_stat = 2 * (logit_model.llf - logit_model.llnull)
print(f"LR test statistic: {lr_stat:.2f}")
#> LR test statistic: 19.60
print(f"p-value: {logit_model.llr_pvalue:.4f}")
#> p-value: 0.0002

# Predictions and calibration
pred_prob = logit_model.predict(X)
print(f"\nCalibration:")
#> 
#> Calibration:
print(f"Average predicted probability: {pred_prob.mean():.4f}")
#> Average predicted probability: 0.0750
print(f"Observed event rate: {y.mean():.4f}")
#> Observed event rate: 0.0750
print(f"(These should be similar for well-calibrated models)")
#> (These should be similar for well-calibrated models)

# AUC
auc = roc_auc_score(y, pred_prob)
print(f"\nModel Discrimination (AUC): {auc:.4f}")
#> 
#> Model Discrimination (AUC): 0.7867

18.5.1 Key Metrics and Pseudo-R² Measures

Unlike OLS, logistic regression has no single R². Common pseudo-R² measures:

Metric	Formula	Interpretation
McFadden R²	1 − log(L) / log(L₀)	Higher is better; 0.2–0.4 is good
Nagelkerke R²	R² / (1 − L₀^{2/n})	Scaled to 0–1; similar to OLS R²
AUC	See Chapter 12	Discrimination ability; 0.7–0.8 is good

📝 Section 13.5 Review Questions

What does McFadden R² = 0.25 tell you about model fit?
Why is AUC sometimes more informative than McFadden R² for logistic regression?
If average predicted probability (0.15) differs from observed event rate (0.20), what does this suggest?
How would you test whether a coefficient is statistically significant in logistic regression?

18.6 Regularised Logistic Regression

Like linear regression, logistic regression can overfit with many correlated predictors. L1 and L2 regularisation (Chapter 10) extend directly to logistic regression.

📘 Theory: Regularised Logistic Regression

Objective: \[\ell(\beta) - \lambda \left[ (1 - \alpha) \frac{1}{2} \sum \beta_j^2 + \alpha \sum |\beta_j| \right]\]

α = 0: Ridge (L2) — shrinks coefficients toward zero, all retained
α = 1: Lasso (L1) — sparse solution, some coefficients zero
α ∈ (0, 1): Elastic Net — combination

When to use: - Many correlated predictors → Ridge or Elastic Net - Want sparse model → Lasso - High-dimensional data (p >> n) → Regularisation is essential

18.6.1 Worked Example: Regularised Logistic Regression

Show code

library(glmnet)

# Generate credit data with many correlated features
set.seed(42)
n <- 200
p <- 30

X <- matrix(rnorm(n * p), nrow = n, ncol = p)
# Introduce correlations
for (j in 2:p) {
  X[, j] <- 0.7 * X[, j] + 0.3 * X[, j - 1] + rnorm(n, 0, 0.2)
}

# Outcome: true signal from first 3 features only
y <- rbinom(n, 1, prob = 1 / (1 + exp(-(X[, 1] - 0.5 * X[, 2] + 0.3 * X[, 3]))))

# Standardise
X_scaled <- scale(X)

# Ridge logistic regression
ridge_cv <- cv.glmnet(X_scaled, y, alpha = 0, family = "binomial", nfolds = 5)
cat("Ridge Logistic Regression:\n")
#> Ridge Logistic Regression:
cat("Optimal λ:", round(ridge_cv$lambda.min, 4), "\n")
#> Optimal λ: 0.2329
cat("CV error at optimal λ:", round(min(ridge_cv$cvm), 4), "\n\n")
#> CV error at optimal λ: 1.307

# Lasso logistic regression
lasso_cv <- cv.glmnet(X_scaled, y, alpha = 1, family = "binomial", nfolds = 5)
cat("Lasso Logistic Regression:\n")
#> Lasso Logistic Regression:
cat("Optimal λ:", round(lasso_cv$lambda.min, 4), "\n")
#> Optimal λ: 0.0564
cat("CV error at optimal λ:", round(min(lasso_cv$cvm), 4), "\n")
#> CV error at optimal λ: 1.1915
cat("Number of non-zero coefficients:", sum(coef(glmnet(X_scaled, y, alpha = 1, lambda = lasso_cv$lambda.min, family = "binomial")) != 0) - 1, "\n\n")
#> Number of non-zero coefficients: 2

# Elastic Net
en_cv <- cv.glmnet(X_scaled, y, alpha = 0.5, family = "binomial", nfolds = 5)
cat("Elastic Net Logistic Regression (α=0.5):\n")
#> Elastic Net Logistic Regression (α=0.5):
cat("Optimal λ:", round(en_cv$lambda.min, 4), "\n")
#> Optimal λ: 0.0853
cat("CV error at optimal λ:", round(min(en_cv$cvm), 4), "\n")
#> CV error at optimal λ: 1.206
cat("Number of non-zero coefficients:", sum(coef(glmnet(X_scaled, y, alpha = 0.5, lambda = en_cv$lambda.min, family = "binomial")) != 0) - 1, "\n\n")
#> Number of non-zero coefficients: 2

# Plot coefficient paths
par(mfrow = c(1, 3), mar = c(5, 5, 3, 1))

plot(ridge_cv, main = "Ridge (α=0)")
abline(v = log(ridge_cv$lambda.min), col = "red", lty = 2)

plot(lasso_cv, main = "Lasso (α=1)")
abline(v = log(lasso_cv$lambda.min), col = "red", lty = 2)

plot(en_cv, main = "Elastic Net (α=0.5)")
abline(v = log(en_cv$lambda.min), col = "red", lty = 2)

Show code


par(mfrow = c(1, 1))

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(42)
n = 200
p = 30

X = np.random.normal(0, 1, (n, p))
# Introduce correlations
for j in range(1, p):
    X[:, j] = 0.7 * X[:, j] + 0.3 * X[:, j-1] + np.random.normal(0, 0.2, n)

# Outcome
y = (np.random.random(n) < 1 / (1 + np.exp(-(X[:, 0] - 0.5*X[:, 1] + 0.3*X[:, 2])))).astype(int)

# Standardise
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Ridge
ridge = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000)
ridge.fit(X_scaled, y)

LogisticRegression(max_iter=1000, penalty='l2')

Show code

print(f"Ridge Logistic Regression:")
#> Ridge Logistic Regression:
print(f"  Coefficients: {np.sum(ridge.coef_ != 0)} non-zero (all 30)")
#>   Coefficients: 30 non-zero (all 30)
print(f"  Training accuracy: {ridge.score(X_scaled, y):.4f}\n")
#>   Training accuracy: 0.7100

# Lasso
lasso = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000)
lasso.fit(X_scaled, y)

LogisticRegression(max_iter=1000, penalty='l1', solver='saga')

Show code

print(f"Lasso Logistic Regression:")
#> Lasso Logistic Regression:
print(f"  Non-zero coefficients: {np.sum(lasso.coef_ != 0)} / 30")
#>   Non-zero coefficients: 26 / 30
print(f"  Training accuracy: {lasso.score(X_scaled, y):.4f}\n")
#>   Training accuracy: 0.7100

# Elastic Net
from sklearn.linear_model import SGDClassifier
en = SGDClassifier(loss='log_loss', penalty='elasticnet', l1_ratio=0.5,
                   alpha=0.0001, max_iter=1000, random_state=42)
en.fit(X_scaled, y)

SGDClassifier(l1_ratio=0.5, loss='log_loss', penalty='elasticnet',
              random_state=42)

Show code

print(f"Elastic Net Logistic Regression (α=0.5):")
#> Elastic Net Logistic Regression (α=0.5):
print(f"  Non-zero coefficients: {np.sum(en.coef_ != 0)} / 30")
#>   Non-zero coefficients: 22 / 30
print(f"  Training accuracy: {en.score(X_scaled, y):.4f}")
#>   Training accuracy: 0.6950

📝 Section 13.6 Review Questions

Why might Lasso be preferable to Ridge for a credit scoring model with 50 features?
If Lasso selects only 8 features, is this enough to build a reliable model? What additional checks are needed?

18.7 Building a Credit Risk Scorecard

A credit scorecard translates logistic regression coefficients into a points-based system usable by loan officers.

📘 Theory: Scorecard Construction

Weight of Evidence (WOE): For a variable X with categories (or binned continuous values), WOE measures how well the category discriminates good vs bad borrowers:

\[\text{WOE}_i = \ln\left( \frac{\% \text{ Good}_i}{\% \text{ Bad}_i} \right)\]

where % Good_i and % Bad_i are the percentages of good/bad borrowers in category i.

Information Value (IV): Measures total discriminatory power of a variable:

\[\text{IV} = \sum_i (\% \text{Good}_i - \% \text{Bad}_i) \times \text{WOE}_i\]

IV < 0.02: weak predictor
0.02 < IV < 0.1: moderate predictor
0.1 < IV < 0.3: strong predictor
IV > 0.3: very strong predictor

Points transformation: Map WOE to points such that odds double for every X points:

\[\text{Points} = \text{Scaling Factor} \times \text{WOE} + \text{Offset}\]

The scaling factor ensures a doubling of odds corresponds to a fixed points increase (commonly 20–50 points).

18.7.1 Worked Example: Credit Scorecard

Show code

# Build a credit risk scorecard using logistic regression

set.seed(42)
n <- 500

# Binary outcome
default <- rbinom(n, 1, 0.15)

# Continuous variable: credit score (strong predictor)
credit_score <- rnorm(n, mean = 650, sd = 100)
credit_score <- pmax(pmin(credit_score, 900), 300)

# Bin credit score into categories
score_bins <- cut(credit_score, breaks = c(300, 500, 600, 700, 900),
                  labels = c("Poor", "Fair", "Good", "Excellent"))

# Calculate WOE for each bin
woe_calc <- data.frame(
  bin = levels(score_bins),
  count_good = tapply(default == 0, score_bins, sum),
  count_bad = tapply(default == 1, score_bins, sum)
)

woe_calc$pct_good <- woe_calc$count_good / sum(woe_calc$count_good)
woe_calc$pct_bad <- woe_calc$count_bad / sum(woe_calc$count_bad)
woe_calc$woe <- log(woe_calc$pct_good / woe_calc$pct_bad)
woe_calc$iv_component <- (woe_calc$pct_good - woe_calc$pct_bad) * woe_calc$woe
woe_calc$total_iv <- sum(woe_calc$iv_component)

cat("Weight of Evidence Analysis (Credit Score):\n\n")
#> Weight of Evidence Analysis (Credit Score):
print(woe_calc)
#>                 bin count_good count_bad  pct_good    pct_bad        woe
#> Poor           Poor         32         7 0.0754717 0.09210526 -0.1991744
#> Fair           Fair        103        16 0.2429245 0.21052632  0.1431402
#> Good           Good        170        27 0.4009434 0.35526316  0.1209615
#> Excellent Excellent        119        26 0.2806604 0.34210526 -0.1979732
#>           iv_component   total_iv
#> Poor       0.003312980 0.02564045
#> Fair       0.004637485 0.02564045
#> Good       0.005525548 0.02564045
#> Excellent  0.012164438 0.02564045

cat("\n\nInformation Value (IV):", round(woe_calc$total_iv[1], 4), "\n")
#> 
#> 
#> Information Value (IV): 0.0256
if (woe_calc$total_iv[1] < 0.02) {
  cat("Interpretation: Weak predictor\n")
} else if (woe_calc$total_iv[1] < 0.1) {
  cat("Interpretation: Moderate predictor\n")
} else if (woe_calc$total_iv[1] < 0.3) {
  cat("Interpretation: Strong predictor\n")
} else {
  cat("Interpretation: Very strong predictor\n")
}
#> Interpretation: Moderate predictor

# Points transformation
points_per_odds_double <- 50  # 50 points = 2x odds
scaling_factor <- points_per_odds_double / log(2)
base_score <- 600
base_odds <- 1 / 20  # 20:1 good:bad ratio

# Fit logistic regression
fit <- glm(default ~ score_bins, family = binomial(link = "logit"))

# Score card
cat("\n\n=== CREDIT SCORECARD ===\n\n")
#> 
#> 
#> === CREDIT SCORECARD ===
cat("Credit Score Ranges and Points:\n\n")
#> Credit Score Ranges and Points:

scorecard <- data.frame(
  bin = woe_calc$bin,
  points = round(scaling_factor * woe_calc$woe + base_score / 4)
)
print(scorecard)
#>         bin points
#> 1      Poor    136
#> 2      Fair    160
#> 3      Good    159
#> 4 Excellent    136

cat("\n\nScoring Example:\n")
#> 
#> 
#> Scoring Example:
cat("Applicant with 'Good' credit score: ", scorecard$points[3], "points\n")
#> Applicant with 'Good' credit score:  159 points
cat("Applicant with 'Poor' credit score: ", scorecard$points[1], "points\n")
#> Applicant with 'Poor' credit score:  136 points
cat("Points difference:", scorecard$points[3] - scorecard$points[1], "→ relates to odds ratio\n")
#> Points difference: 23 → relates to odds ratio

Show code

import numpy as np
import pandas as pd

np.random.seed(42)
n = 500

# Generate data
default = np.random.binomial(1, 0.15, n)
credit_score = np.random.normal(650, 100, n)
credit_score = np.clip(credit_score, 300, 900)

# Bin credit score
bins = [300, 500, 600, 700, 900]
labels = ['Poor', 'Fair', 'Good', 'Excellent']
score_bins = pd.cut(credit_score, bins=bins, labels=labels)

# Calculate WOE
woe_table = pd.DataFrame({
    'bin': labels,
    'count_good': [np.sum((score_bins == bin) & (default == 0)) for bin in labels],
    'count_bad': [np.sum((score_bins == bin) & (default == 1)) for bin in labels]
})

woe_table['pct_good'] = woe_table['count_good'] / woe_table['count_good'].sum()
woe_table['pct_bad'] = woe_table['count_bad'] / woe_table['count_bad'].sum()
woe_table['woe'] = np.log(woe_table['pct_good'] / woe_table['pct_bad'])
woe_table['iv_component'] = (woe_table['pct_good'] - woe_table['pct_bad']) * woe_table['woe']

total_iv = woe_table['iv_component'].sum()

print("Weight of Evidence Analysis (Credit Score):\n")
#> Weight of Evidence Analysis (Credit Score):
print(woe_table.to_string(index=False))
#>       bin  count_good  count_bad  pct_good  pct_bad       woe  iv_component
#>      Poor          25          8  0.059952 0.096386 -0.474811      0.017299
#>      Fair         108         24  0.258993 0.289157 -0.110168      0.003323
#>      Good         158         25  0.378897 0.301205  0.229474      0.017828
#> Excellent         126         26  0.302158 0.313253 -0.036060      0.000400
print(f"\n\nInformation Value (IV): {total_iv:.4f}")
#> 
#> 
#> Information Value (IV): 0.0389

if total_iv < 0.02:
    print("Interpretation: Weak predictor")
elif total_iv < 0.1:
    print("Interpretation: Moderate predictor")
elif total_iv < 0.3:
    print("Interpretation: Strong predictor")
else:
    print("Interpretation: Very strong predictor")
#> Interpretation: Moderate predictor

# Scorecard
points_per_odds_double = 50
scaling_factor = points_per_odds_double / np.log(2)
base_score = 600

scorecard = pd.DataFrame({
    'bin': labels,
    'points': (scaling_factor * woe_table['woe'] + base_score / 4).astype(int)
})

print("\n\n=== CREDIT SCORECARD ===\n")
#> 
#> 
#> === CREDIT SCORECARD ===
print(scorecard.to_string(index=False))
#>       bin  points
#>      Poor     115
#>      Fair     142
#>      Good     166
#> Excellent     147

print("\n\nScoring Example:")
#> 
#> 
#> Scoring Example:
print(f"Applicant with 'Good' credit score: {scorecard.loc[2, 'points']} points")
#> Applicant with 'Good' credit score: 166 points
print(f"Applicant with 'Poor' credit score: {scorecard.loc[0, 'points']} points")
#> Applicant with 'Poor' credit score: 115 points
print(f"Points difference: {scorecard.loc[2, 'points'] - scorecard.loc[0, 'points']}")
#> Points difference: 51

📝 Section 13.7 Review Questions

What does IV = 0.15 for a variable tell you about its usefulness in a scorecard?
If WOE is negative for a bin, what does it suggest about that group’s default rate?
Why do scorecards use points instead of raw logistic regression coefficients?
How would you interpret a 50-point difference in scorecard scores?

18.8 Case Study: Nigerian Bank Loan Default Prediction

Business Context: A Nigerian bank processes 5,000 microfinance loans annually. They want to reduce default rate (currently 18%) by better loan approval decisions.

Data: 5,000 loans (2018–2023) with demographic, employment, and credit variables. Target: Default within 12 months.

18.8.1 Full Pipeline

Show code

# Complete credit default prediction case study

set.seed(42)
n_loans <- 1000

# Generate synthetic CBN-style Nigerian loan data
age <- rnorm(n_loans, mean = 40, sd = 12)
age <- pmax(pmin(age, 70), 18)

monthly_income <- rlnorm(n_loans, meanlog = 11, sdlog = 1)  # Log-normal income distribution
loan_amount <- rlnorm(n_loans, meanlog = 10.5, sdlog = 1)
debt_to_income <- loan_amount / (monthly_income * 12)

employment_years <- rpois(n_loans, 8)
employment_years <- pmax(employment_years, 0)

bureau_score <- rnorm(n_loans, mean = 600, sd = 150)
bureau_score <- pmax(pmin(bureau_score, 900), 300)

# Default probability (multivariate model)
p_default <- 0.18 + 0.005 * (age - 40) - 0.0001 * monthly_income +
             0.1 * debt_to_income - 0.015 * employment_years -
             0.0005 * bureau_score + rnorm(n_loans, 0, 0.05)
p_default <- pmax(pmin(p_default, 0.99), 0.01)

default <- rbinom(n_loans, 1, p_default)

loan_data <- data.frame(
  age, monthly_income, loan_amount, debt_to_income, employment_years, bureau_score, default
)

# Train-test split
train_idx <- sample(1:n_loans, size = 0.7 * n_loans)
train_data <- loan_data[train_idx, ]
test_data <- loan_data[-train_idx, ]

# Fit logistic regression
fit_logistic <- glm(default ~ age + log(monthly_income) + debt_to_income +
                     employment_years + bureau_score,
                    data = train_data, family = binomial(link = "logit"))

summary(fit_logistic)
#> 
#> Call:
#> glm(formula = default ~ age + log(monthly_income) + debt_to_income + 
#>     employment_years + bureau_score, family = binomial(link = "logit"), 
#>     data = train_data)
#> 
#> Coefficients:
#>                       Estimate Std. Error z value Pr(>|z|)
#> (Intercept)         -4.5850422  4.8077050  -0.954    0.340
#> age                  0.0255128  0.0290599   0.878    0.380
#> log(monthly_income) -0.0734149  0.3887964  -0.189    0.850
#> debt_to_income      -0.1300250  1.1394196  -0.114    0.909
#> employment_years     0.0234581  0.1157778   0.203    0.839
#> bureau_score        -0.0003023  0.0022704  -0.133    0.894
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 96.253  on 699  degrees of freedom
#> Residual deviance: 95.398  on 694  degrees of freedom
#> AIC: 107.4
#> 
#> Number of Fisher Scoring iterations: 7

# Predictions
pred_prob_test <- predict(fit_logistic, newdata = test_data, type = "response")
pred_label_test <- ifelse(pred_prob_test > 0.5, 1, 0)

# Evaluation metrics
from_caret <- function(actual, predicted, threshold = 0.5) {
  predicted_label <- ifelse(predicted > threshold, 1, 0)
  tp <- sum(predicted_label == 1 & actual == 1)
  tn <- sum(predicted_label == 0 & actual == 0)
  fp <- sum(predicted_label == 1 & actual == 0)
  fn <- sum(predicted_label == 0 & actual == 1)

  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  f1 <- 2 * (precision * recall) / (precision + recall)

  list(accuracy = accuracy, precision = precision, recall = recall, f1 = f1, tp = tp, fp = fp, fn = fn)
}

metrics <- from_caret(test_data$default, pred_prob_test, threshold = 0.5)

cat("Test Set Performance (Threshold = 0.5):\n\n")
#> Test Set Performance (Threshold = 0.5):
cat("Accuracy: ", round(metrics$accuracy, 4), "\n")
#> Accuracy:  0.99
cat("Precision: ", round(metrics$precision, 4), "\n")
#> Precision:  NaN
cat("Recall: ", round(metrics$recall, 4), "\n")
#> Recall:  0
cat("F1-Score: ", round(metrics$f1, 4), "\n")
#> F1-Score:  NaN
cat("True Positives (defaults caught): ", metrics$tp, "/", sum(test_data$default), "\n\n")
#> True Positives (defaults caught):  0 / 3

# Adjust threshold for business optimization
cost_fp <- 500  # Administrative cost of false positive (reviewing good loan)
cost_fn <- 5000  # Cost of fraud loss if default

# Find optimal threshold
thresholds <- seq(0.1, 0.9, by = 0.05)
costs <- c()

for (thresh in thresholds) {
  metrics_t <- from_caret(test_data$default, pred_prob_test, threshold = thresh)
  costs <- c(costs, cost_fp * metrics_t$fp + cost_fn * metrics_t$fn)
}

optimal_idx <- which.min(costs)
optimal_threshold <- thresholds[optimal_idx]

metrics_optimal <- from_caret(test_data$default, pred_prob_test, threshold = optimal_threshold)

cat("Cost-Optimized Threshold (FP cost = $", cost_fp, ", FN cost = $", cost_fn, "):\n\n")
#> Cost-Optimized Threshold (FP cost = $ 500 , FN cost = $ 5000 ):
cat("Optimal Threshold:", optimal_threshold, "\n")
#> Optimal Threshold: 0.1
cat("Accuracy: ", round(metrics_optimal$accuracy, 4), "\n")
#> Accuracy:  0.99
cat("Recall (defaults caught): ", round(metrics_optimal$recall, 4), "\n")
#> Recall (defaults caught):  0
cat("Expected cost on test set: $", min(costs), "\n\n")
#> Expected cost on test set: $ 15000

# Business impact
cat("=== BUSINESS RECOMMENDATION ===\n\n")
#> === BUSINESS RECOMMENDATION ===
cat("Approve loans with default probability < ", round(optimal_threshold, 3), "\n")
#> Approve loans with default probability <  0.1
cat("This policy will catch ", round(metrics_optimal$recall * 100, 1), "% of defaults\n")
#> This policy will catch  0 % of defaults
cat("while approving ", round((1 - metrics_optimal$fp / sum(1 - test_data$default)) * 100, 1), "% of good borrowers\n")
#> while approving  100 % of good borrowers
cat("Expected annual savings: $", min(costs) * (1000 / nrow(test_data)), "\n")
#> Expected annual savings: $ 50000

Show code

import statsmodels.api as sm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

np.random.seed(42)
n_loans = 1000

# Generate data
age = np.random.normal(40, 12, n_loans)
age = np.clip(age, 18, 70)

monthly_income = np.random.lognormal(mean=11,   sigma=1, size=n_loans)
loan_amount    = np.random.lognormal(mean=10.5, sigma=1, size=n_loans)
debt_to_income = loan_amount / (monthly_income * 12)

employment_years = np.random.poisson(8, n_loans)
bureau_score = np.random.normal(600, 150, n_loans)
bureau_score = np.clip(bureau_score, 300, 900)

# Default probability
p_default = (0.18 + 0.005 * (age - 40) - 0.0001 * monthly_income +
             0.1 * debt_to_income - 0.015 * employment_years -
             0.0005 * bureau_score + np.random.normal(0, 0.05, n_loans))
p_default = np.clip(p_default, 0.01, 0.99)
default = np.random.binomial(1, p_default, n_loans)

# Train-test split
X = np.column_stack([age, np.log(monthly_income), debt_to_income, employment_years, bureau_score])
X_train, X_test, y_train, y_test = train_test_split(X, default, test_size=0.3, random_state=42)

# Fit logistic regression
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

logit_model = sm.Logit(y_train, X_train_sm).fit()
#> Optimization terminated successfully.
#>          Current function value: 0.067611
#>          Iterations 10
print(logit_model.summary())
#>                            Logit Regression Results                           
#> ==============================================================================
#> Dep. Variable:                      y   No. Observations:                  700
#> Model:                          Logit   Df Residuals:                      694
#> Method:                           MLE   Df Model:                            5
#> Date:                Sun, 10 May 2026   Pseudo R-squ.:                 0.01659
#> Time:                        15:38:04   Log-Likelihood:                -47.328
#> converged:                       True   LL-Null:                       -48.127
#> Covariance Type:            nonrobust   LLR p-value:                    0.9016
#> ==============================================================================
#>                  coef    std err          z      P>|z|      [0.025      0.975]
#> ------------------------------------------------------------------------------
#> const         -5.4735      5.227     -1.047      0.295     -15.718       4.771
#> x1            -0.0212      0.031     -0.695      0.487      -0.081       0.039
#> x2             0.2046      0.405      0.505      0.613      -0.589       0.999
#> x3            -1.3268      3.528     -0.376      0.707      -8.242       5.589
#> x4             0.0032      0.127      0.025      0.980      -0.246       0.252
#> x5            -0.0004      0.002     -0.182      0.856      -0.005       0.004
#> ==============================================================================

# Predictions
pred_prob_test = logit_model.predict(X_test_sm)

# Evaluation
pred_label_test = (pred_prob_test > 0.5).astype(int)
accuracy = accuracy_score(y_test, pred_label_test)
precision = precision_score(y_test, pred_label_test, zero_division=0)
recall = recall_score(y_test, pred_label_test)
f1 = f1_score(y_test, pred_label_test)
auc = roc_auc_score(y_test, pred_prob_test)

print(f"\n\nTest Set Performance (Threshold = 0.5):\n")
#> 
#> 
#> Test Set Performance (Threshold = 0.5):
print(f"Accuracy: {accuracy:.4f}")
#> Accuracy: 0.9933
print(f"Precision: {precision:.4f}")
#> Precision: 0.0000
print(f"Recall: {recall:.4f}")
#> Recall: 0.0000
print(f"F1-Score: {f1:.4f}")
#> F1-Score: 0.0000
print(f"AUC: {auc:.4f}")
#> AUC: 0.4111

# Cost optimization
cost_fp = 500
cost_fn = 5000

thresholds = np.arange(0.1, 0.95, 0.05)
costs = []

for thresh in thresholds:
    pred_label = (pred_prob_test > thresh).astype(int)
    fp = np.sum((pred_label == 1) & (y_test == 0))
    fn = np.sum((pred_label == 0) & (y_test == 1))
    costs.append(cost_fp * fp + cost_fn * fn)

optimal_idx = np.argmin(costs)
optimal_threshold = thresholds[optimal_idx]

pred_label_opt = (pred_prob_test > optimal_threshold).astype(int)
recall_opt = recall_score(y_test, pred_label_opt)
fp_opt = np.sum((pred_label_opt == 1) & (y_test == 0))
tn_opt = np.sum((pred_label_opt == 0) & (y_test == 0))

print(f"\n\nCost-Optimized Threshold (FP cost = ${cost_fp}, FN cost = ${cost_fn}):\n")
#> 
#> 
#> Cost-Optimized Threshold (FP cost = $500, FN cost = $5000):
print(f"Optimal Threshold: {optimal_threshold:.3f}")
#> Optimal Threshold: 0.100
print(f"Recall (defaults caught): {recall_opt:.4f}")
#> Recall (defaults caught): 0.0000
print(f"Expected cost on test set: ${min(costs):,.0f}")
#> Expected cost on test set: $10,000

print(f"\n\n=== BUSINESS RECOMMENDATION ===\n")
#> 
#> 
#> === BUSINESS RECOMMENDATION ===
print(f"Approve loans with default probability < {optimal_threshold:.3f}")
#> Approve loans with default probability < 0.100
print(f"This policy will catch {recall_opt*100:.1f}% of defaults")
#> This policy will catch 0.0% of defaults
print(f"while approving {(tn_opt / (tn_opt + fp_opt))*100:.1f}% of good borrowers")
#> while approving 100.0% of good borrowers
print(f"Expected annual savings: ${min(costs) * (1000 / len(y_test)):,.0f}")
#> Expected annual savings: $33,333

Key Outcomes: 1. Logistic regression predicts default probability with AUC ≈ 0.78 (good discrimination) 2. Standard threshold (0.5) is suboptimal for the bank’s cost structure 3. Optimised threshold (≈0.35) catches 85%+ of defaults while minimising false positives 4. Estimated annual savings: ~$500K from better loan decisions

18.9 Chapter Exercises

Chapter 13 Exercises

Exercise 13.1: Understanding the Sigmoid Function

The sigmoid (logistic) function is the mathematical heart of logistic regression. Working through its properties builds the intuition you need to understand why it is the right tool for classification.

The sigmoid is defined as: \[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Without a calculator, determine the value of σ(z) when z = 0. Show your working step by step using the formula above. What does this value mean in a classification context?
What happens to σ(z) as z becomes a very large positive number (e.g., z = 100)? And as z becomes a very large negative number (e.g., z = −100)? Why is this behaviour useful for a classification model?
A student claims: “Because the sigmoid never reaches exactly 0 or 1, logistic regression can never make a definitive prediction.” How would you respond to this? Is the student correct in a practical sense?
Using a spreadsheet or simple calculation, fill in the table below. Then sketch the shape of the sigmoid curve on a piece of paper (or describe it in words):

z	σ(z)
−4	?
−2	?
−1	?
0	?
1	?
2	?
4	?

The derivative of the sigmoid is σ’(z) = σ(z)[1 − σ(z)]. At what value of z is the derivative largest? What does this tell you about where logistic regression is most “sensitive” to changes in the predictor?

Exercise 13.2: Why Linear Regression Fails for Classification

Imagine you are a loan officer at a microfinance bank in Lagos. You have historical data on 200 loan applicants: each row contains the applicant’s monthly income, loan amount requested, and whether they eventually defaulted (1 = default, 0 = no default).

Your colleague suggests fitting a standard linear regression: default = β₀ + β₁ × income + ε.

Suppose the fitted model gives: default = 0.8 − 0.000002 × income. For an applicant with a monthly income of ₦600,000, what does the model predict? Is this prediction valid? Why or why not?
For an applicant with a monthly income of ₦20,000, the model predicts −0.12. In the context of a probability, is this meaningful? What does this reveal about the fundamental problem with the linear probability model?
Linear regression assumes that the residuals (prediction errors) have roughly constant variance (homoscedasticity). Explain intuitively — without using any formulas — why this assumption is violated when the outcome variable is binary (0 or 1).
Despite these problems, some researchers argue that the linear probability model is still useful for quick approximations. Under what conditions might it be “good enough”? Under what conditions would it be dangerously misleading?
Logistic regression solves the problems you identified above. In one sentence each, explain how logistic regression addresses: (i) predictions outside [0, 1]; (ii) non-constant variance.

Exercise 13.3: Maximum Likelihood Estimation — The Intuition

Logistic regression does not use the “least squares” method. Instead, it uses Maximum Likelihood Estimation (MLE). This exercise builds your intuition for what MLE is doing.

Imagine you have flipped a coin three times and observed: Head, Head, Tail.

If the coin has probability p of landing Heads, write down the probability (likelihood) of observing this exact sequence: HHT. Your answer will be an expression in terms of p.
Using your expression from (a), calculate the likelihood for p = 0.2, p = 0.5, and p = 0.8. Which value of p makes the observed data most likely?
You found that p = 2/3 maximises the likelihood for three flips (HHT). This is the MLE estimate of p. Now consider: how does this idea extend to logistic regression? Instead of estimating one proportion p, logistic regression estimates a set of coefficients β that maximise the probability of observing the entire dataset of 0s and 1s. Explain this parallel in your own words (2–3 sentences).
In linear regression, increasing the number of predictors always improves (or does not worsen) the training fit. Why can the same thing happen with MLE-based logistic regression, and why is this a problem? What technique prevents this?
When logistic regression has perfect separation (a predictor perfectly separates 0s from 1s), MLE fails — the coefficient goes to infinity. Create a small example (3 observations) that illustrates perfect separation, and explain in plain language why MLE breaks down in this situation.

Exercise 13.4: Odds Ratios — Making Coefficients Understandable

A logistic regression model predicts the probability that a mobile money user in Nigeria will be a churner (someone who stops using the service within 90 days). The model is fitted on 5,000 users and produces these coefficients:

Variable	Coefficient (β)
Intercept	−2.100
Age (years)	−0.031
Monthly transactions	−0.150
Balance (₦000s)	−0.008
Used USSD feature (1=Yes)	−0.620
Days since last top-up	+0.045
Customer support calls (last 90 days)	+0.280

For each predictor, calculate the odds ratio (OR = e^β). Round to three decimal places. Use a calculator or the approximation e^x ≈ 1 + x for small x where helpful.
Interpret the odds ratio for monthly transactions in plain English. Be specific: a one-unit increase in what, multiplies the odds of churning by how much?
A customer made 2 customer support calls last month and today makes 1 more (total: 3). By what factor do the odds of churning change, holding everything else constant?
The coefficient for “Used USSD feature” is −0.620. What does it mean for the odds ratio to be less than 1 in this context? Is this result sensible from a business perspective? Explain your reasoning.
Management asks: “Which single factor most increases churn risk, and which most decreases it?” Answer using the odds ratios you calculated in part (a). Express your answer in percentage terms (e.g., “increases churn odds by X%”).
A data scientist says: “The odds ratio for balance is 0.992, which is very close to 1, so balance barely affects churn.” A sceptical colleague responds: “But balance ranges from ₦0 to ₦500,000, so a ₦1,000 change isn’t meaningful — you need to think about bigger increments.” Who is more correct? Calculate the odds ratio for a ₦50,000 increase in balance and use this to resolve the debate.

Exercise 13.5: Confidence Intervals and Hypothesis Tests

After fitting a logistic regression model to predict loan default at a commercial bank, you receive this partial output:

Variable               Coef      Std Err    z-score    p-value    95% CI Lower    95% CI Upper
Intercept            −5.240      0.620       −8.45      < 0.001
Age (years)           0.025      0.012        2.08       0.037
Monthly income (log) −0.480      0.095       −5.05      < 0.001
Loan-to-income ratio  0.720      0.210        3.43       0.001
Employment type=Self  0.350      0.180        1.94       0.052
Credit score         −0.008      0.003       −2.67       0.008

The 95% confidence interval columns are empty. Using the formula CI = Coef ± 1.96 × Std Err, fill in the lower and upper bounds for each variable.
The p-value for “Employment type = Self-employed” is 0.052 — just above the conventional 0.05 threshold. A junior analyst says “This variable is not significant, so we should drop it.” Write a 3–4 sentence response explaining why this is not necessarily the right decision.
Using the confidence intervals you calculated in (a), state which variables’ effects are consistent with zero at the 95% confidence level. (Hint: if the confidence interval contains zero, we cannot rule out that the true coefficient is zero.)
Translate the confidence interval for loan-to-income ratio into an interval for the odds ratio. That is, compute the 95% CI for e^β. Show your working.
The z-score for age is 2.08. A student says this means “age is 2.08 times more important than a predictor with z-score = 1.” Correct this misunderstanding. What does the z-score actually measure in this context?

Exercise 13.6: Logistic Regression vs. Linear Regression on Binary Outcomes

This exercise asks you to directly compare the two approaches on the same dataset. You will need R or Python to complete parts (c)–(f).

Setup: A dataset of 500 employees contains: left_company (1 = left within 1 year, 0 = stayed), job_satisfaction (1–5 scale), years_at_company (1–20), overtime_hours_per_week (0–20).

Generate this dataset in R or Python using the code below, then fit both a linear regression and a logistic regression predicting left_company from the three predictors.

set.seed(101)
n <- 500
job_sat <- sample(1:5, n, replace = TRUE)
years <- round(runif(n, 1, 20))
overtime <- round(runif(n, 0, 20))
log_odds <- -1.5 + (-0.4)*job_sat + 0.05*years + 0.08*overtime
prob_leave <- 1 / (1 + exp(-log_odds))
left_company <- rbinom(n, 1, prob_leave)
emp_data <- data.frame(left_company, job_sat, years, overtime)

From the linear regression output:

What is the predicted probability of leaving for an employee with job_sat=1, years=5, overtime=15?
What is the predicted probability for job_sat=5, years=1, overtime=0?
Are either of these predictions outside [0, 1]? Report the actual values.

From the logistic regression output:

Compute the same two predictions as in (b). Are they bounded between 0 and 1?
Which model’s predictions are more interpretable as probabilities?

Plot the fitted values from both models against the observed left_company values. Describe what you see. Which model shows a more sensible relationship?
Both models produce an R² or pseudo-R² measure. Look up McFadden’s pseudo-R² for logistic regression. Why is the concept of R² less natural for logistic regression than for linear regression?
Write a half-page report (250 words) explaining to a non-technical HR director which model you would use to predict employee attrition and why. Avoid technical jargon — explain in terms a business professional would understand.

Exercise 13.7: Weight of Evidence and Information Value

Context: You are building a credit scoring model for a Nigerian commercial bank. Your dataset has 10,000 loan applicants: 1,500 defaulted (Bads) and 8,500 did not (Goods). You are evaluating whether the variable employment_type — with three categories: Salaried, Self-employed, and Unemployed — is a useful predictor.

Here are the raw counts in each category:

Employment Type	Goods	Bads
Salaried	5,200	600
Self-employed	2,800	650
Unemployed	500	250
Total	8,500	1,500

Calculate the percentage of total Goods and the percentage of total Bads in each category. For example, 5,200 ÷ 8,500 = 61.2% of all Goods are Salaried.
Calculate the Weight of Evidence (WoE) for each category using: \[WoE_i = \ln\!\left(\frac{\%\text{Goods}_i}{\%\text{Bads}_i}\right)\]
Interpret each WoE value in plain language. A positive WoE means this category has a higher proportion of Goods than Bads — is this what you would expect for Salaried applicants?
Calculate the Information Value (IV) using: \[IV = \sum_i (\%\text{Goods}_i - \%\text{Bads}_i) \times WoE_i\]
Using the standard IV interpretation guide below, how would you rate the predictive power of employment_type?

IV Range	Predictive Power
< 0.02	Not useful
0.02 – 0.10	Weak
0.10 – 0.30	Medium
0.30 – 0.50	Strong
> 0.50	Suspicious (may indicate data leakage)

Now imagine a second variable: age_band with four categories: Under 25, 25–35, 36–50, Over 50. You calculate IV = 0.08. Between employment_type and age_band, which variable would you include first in your credit scorecard? Justify your choice.
A colleague proposes simply using the raw default rate per category rather than WoE. What advantage does WoE encoding offer over raw default rates when feeding variables into a logistic regression model?

Exercise 13.8: Building a Credit Scorecard

Credit scorecards translate logistic regression outputs into integer scores that loan officers can use without a computer. This exercise walks you through the full construction.

Background: A logistic regression model predicts default (1 = defaulted, 0 = repaid) using three WoE-encoded variables. The fitted coefficients are:

Variable	Coefficient
Intercept (β₀)	−2.80
woe_employment_type (β₁)	0.65
woe_age_band (β₂)	0.42
woe_loan_purpose (β₃)	0.58

The WoE values for each bin are:

Employment Type: - Salaried: WoE = +0.62 - Self-employed: WoE = −0.15 - Unemployed: WoE = −1.82

Age Band: - Under 25: WoE = −0.95 - 25–35: WoE = +0.28 - 36–50: WoE = +0.45 - Over 50: WoE = −0.12

Loan Purpose: - Business: WoE = +0.50 - Education: WoE = +0.30 - Personal consumption: WoE = −0.80

The scorecard is calibrated such that a score of 600 corresponds to an odds of 1:1 (50% default probability) and the score doubles the odds every 20 points (PDO = 20). Using the standard scorecard scaling formulas:

\[\text{Offset} = \text{Score at base odds} - \frac{\text{PDO} \times \ln(\text{base odds})}{\ln(2)}\] \[\text{Factor} = \frac{\text{PDO}}{\ln(2)}\]

Calculate the Offset and Factor values.

The score for each characteristic is calculated as: \[\text{Score}_j = -(\beta_j \times WoE_{ij} + \text{a small portion of Offset})\times \text{Factor}\]

Using the Factor and Offset from (a), calculate the score contribution for a Salaried, 25–35 year old applicant who is taking a loan for Business purposes. (You may distribute the Offset equally among the three variables for simplicity.)

Add up the contributions and the intercept’s point allocation to get this applicant’s total scorecard score. Does this score suggest the applicant is low risk or high risk?
What is the practical advantage of converting a logistic regression model to a points-based scorecard for use by frontline loan officers?
A scorecard was built using 2018–2020 data during an economic expansion. In 2024, after a period of inflation and currency depreciation, the bank notices the scorecard is underperforming. What process should the bank follow to update the scorecard? Name at least three specific steps.

Exercise 13.9: Model Calibration and Discrimination

You are evaluating two logistic regression models for predicting customer churn at a telecoms company. 500 customers were held out for testing; 100 are known to have churned.

Model A predictions: - Among 200 customers it predicted as “high risk” (predicted probability > 50%), 80 actually churned. - Among 300 customers it predicted as “low risk” (predicted probability ≤ 50%), 20 actually churned.

Model B predictions: - Among 250 customers it predicted as “high risk,” 75 actually churned. - Among 250 customers it predicted as “low risk,” 25 actually churned.

For each model, construct the confusion matrix with columns: Predicted Positive, Predicted Negative; and rows: Actual Positive, Actual Negative.
Calculate Precision, Recall, and F1-score for each model.
The Gini coefficient is related to AUC by: Gini = 2 × AUC − 1. Without calculating AUC directly, which model do you expect to have a higher AUC based on its confusion matrix? Explain your reasoning.
Calibration refers to whether a model’s predicted probabilities are trustworthy. A perfectly calibrated model that predicts 30% churn probability for a group should see exactly 30% of that group churn. Describe a simple visual tool — the calibration plot (also called reliability diagram) — and explain how you would construct one using the held-out test data.
Management wants to identify and offer a discount to the 100 customers most likely to churn. Discuss which is more important in this scenario: precision, recall, or AUC. Justify your answer from a business perspective.
The Hosmer-Lemeshow test is a formal statistical test for calibration. It groups predicted probabilities into 10 equal-width bins and compares expected vs. observed event rates within each bin. A p-value > 0.05 suggests the model is well-calibrated. What would a p-value of 0.003 tell you, and what should you do about it?

Exercise 13.10: End-to-End Credit Risk Prediction Pipeline

This capstone exercise asks you to build a complete, production-ready logistic regression pipeline for credit risk. Work through all steps using R or Python.

Scenario: You are a data scientist at a Nigerian commercial bank. The risk management team has provided a dataset of 8,000 personal loan applications from the past 3 years. Each row contains customer demographics, financial history, loan characteristics, and a label: whether the loan defaulted within 12 months of disbursement.

Step 1: Data Preparation

Load and inspect the dataset (or simulate it with the code below). Check: how many observations? How many features? What is the default rate? Are there any missing values?

set.seed(7) 
n <- 8000
age <- round(rnorm(n, 38, 10)); age[age < 20] <- 20; age[age > 70] <- 70
income_k <- round(rlnorm(n, 5.2, 0.7))
loan_amount_k <- round(rlnorm(n, 5.5, 0.6))
tenure_months <- sample(c(12,24,36,48,60), n, replace=TRUE)
credit_score <- round(rnorm(n, 620, 80)); credit_score[credit_score<300] <- 300
employment <- sample(c("Salaried","Self-employed","Unemployed"), n, TRUE, c(0.55,0.35,0.10))
num_dependants <- rpois(n, 2)
loan_purpose <- sample(c("Business","Education","Personal","Housing"), n, TRUE, c(0.3,0.2,0.35,0.15))

lp <- -4 + 0.02*age - 0.0002*income_k + 0.0003*loan_amount_k - 0.005*credit_score +
      ifelse(employment=="Unemployed",1.2,ifelse(employment=="Self-employed",0.3,0)) +
      0.1*num_dependants
prob_def <- 1/(1+exp(-lp))
default <- rbinom(n,1,prob_def)

loan_data <- data.frame(age,income_k,loan_amount_k,tenure_months,credit_score,
                         employment,num_dependants,loan_purpose,default)

Calculate the default rate overall and separately by employment type. Are defaults evenly distributed across employment categories?
Create a correlation matrix for the numeric predictors. Are any pairs of variables strongly correlated (|r| > 0.7)?

Step 2: Feature Engineering

Create two new features:

loan_to_income = loan_amount_k ÷ income_k (the loan-to-income ratio)
income_per_dependent = income_k ÷ (num_dependants + 1)

Using the WoE approach from Exercise 13.7, calculate WoE values for employment and loan_purpose. Create WoE-encoded versions of both variables.

Step 3: Model Fitting

Split the data 70/30 into training and test sets. Fit a logistic regression with all numeric predictors plus the WoE-encoded categorical variables.
Print the model summary. Which variables are statistically significant at the 5% level? Do the signs of the coefficients match your business intuition?

Step 4: Model Evaluation

On the test set, compute: Accuracy, Precision, Recall, F1, and AUC-ROC. Present these in a clear table.
Plot the ROC curve and shade the area under it. At what probability threshold does the model balance precision and recall most evenly (i.e., where is the F1 score maximised)?
Compute the Gini coefficient = 2 × AUC − 1. Is this model strong enough to deploy in a real bank (use the industry benchmarks: Gini > 0.30 is acceptable for retail credit)?

Step 5: Communication

Write a one-page Model Development Report (300–400 words) for the Chief Risk Officer of the bank. Your report should:

Describe the dataset and modelling approach in plain language (no equations)
State the model’s performance using metrics a non-statistician can understand
Identify the three most important risk factors
Recommend whether the model is ready for deployment, with conditions or caveats

18.10 Further Reading

Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. 2nd ed. John Wiley & Sons.
- Comprehensive reference for logistic regression theory and practice.
Collett, D. (2015). Modelling Binary Data. 2nd ed. CRC Press.
- Detailed treatment of binary regression and diagnostics.
Naeem Siddiqi. (2006). Credit Risk Scorecards. John Wiley & Sons.
- Practical guide to building credit scoring models using WOE and IV.
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163.
- Essential reading for credit risk and fraud detection where defaults are rare.
Steyerberg, E. W. (2019). Clinical Prediction Models. 2nd ed. Springer.
- Excellent treatment of model development, calibration, and external validation.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	100
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'l2'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'l1'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'saga'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	loss loss: {'hinge', 'log_loss', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'}, default='hinge' The loss function to be used. - 'hinge' gives a linear SVM. - 'log_loss' gives logistic regression, a probabilistic classifier. - 'modified_huber' is another smooth loss that brings tolerance to outliers as well as probability estimates. - 'squared_hinge' is like hinge but is quadratically penalized. - 'perceptron' is the linear loss used by the perceptron algorithm. - The other losses, 'squared_error', 'huber', 'epsilon_insensitive' and 'squared_epsilon_insensitive' are designed for regression but can be useful in classification as well; see :class:`~sklearn.linear_model.SGDRegressor` for a description. More details about the losses formulas can be found in the :ref:`User Guide ` and you can find a visualisation of the loss functions in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_loss_functions.py`.	'log_loss'
	penalty penalty: {'l2', 'l1', 'elasticnet', None}, default='l2' The penalty (aka regularization term) to be used. Defaults to 'l2' which is the standard regularizer for linear SVM models. 'l1' and 'elasticnet' might bring sparsity to the model (feature selection) not achievable with 'l2'. No penalty is added when set to `None`. You can see a visualisation of the penalties in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_penalties.py`.	'elasticnet'
	alpha alpha: float, default=0.0001 Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when `learning_rate` is set to 'optimal'. Values must be in the range `[0.0, inf)`.	0.0001
	l1_ratio l1_ratio: float, default=0.15 The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Only used if `penalty` is 'elasticnet'. Values must be in the range `[0.0, 1.0]` or can be `None` if `penalty` is not `elasticnet`. .. versionchanged:: 1.7 `l1_ratio` can be `None` when `penalty` is not "elasticnet".	0.5
	fit_intercept fit_intercept: bool, default=True Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.	True
	max_iter max_iter: int, default=1000 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the ``fit`` method, and not the :meth:`partial_fit` method. Values must be in the range `[1, inf)`. .. versionadded:: 0.19	1000
	tol tol: float or None, default=1e-3 The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for ``n_iter_no_change`` consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the `early_stopping` parameter. Values must be in the range `[0.0, inf)`. .. versionadded:: 0.19	0.001
	shuffle shuffle: bool, default=True Whether or not the training data should be shuffled after each epoch.	True
	verbose verbose: int, default=0 The verbosity level. Values must be in the range `[0, inf)`.	0
	epsilon epsilon: float, default=0.1 Epsilon in the epsilon-insensitive loss functions; only if `loss` is 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'. For 'huber', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range `[0.0, inf)`.	0.1
	n_jobs n_jobs: int, default=None The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance, default=None Used for shuffling the data, when ``shuffle`` is set to ``True``. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `. Integer values must be in the range `[0, 2**32 - 1]`.	42
	learning_rate learning_rate: str, default='optimal' The learning rate schedule: - 'constant': `eta = eta0` - 'optimal': `eta = 1.0 / (alpha * (t + t0))` where `t0` is chosen by a heuristic proposed by Leon Bottou. - 'invscaling': `eta = eta0 / pow(t, power_t)` - 'adaptive': `eta = eta0`, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if `early_stopping` is `True`, the current learning rate is divided by 5. - 'pa1': passive-aggressive algorithm 1, see [1]_. Only with `loss='hinge'`. Update is `w += eta y x` with `eta = min(eta0, loss/\|\|x\|\|2)`. - 'pa2': passive-aggressive algorithm 2, see [1]_. Only with `loss='hinge'`. Update is `w += eta y x` with `eta = hinge_loss / (\|\|x\|\|2 + 1/(2 eta0))`. .. versionadded:: 0.20 Added 'adaptive' option. .. versionadded:: 1.8 Added options 'pa1' and 'pa2'	'optimal'
	eta0 eta0: float, default=0.01 The initial learning rate for the 'constant', 'invscaling' or 'adaptive' schedules. The default value is 0.01, but note that eta0 is not used by the default learning rate 'optimal'. Values must be in the range `(0.0, inf)`. For PA-1 (`learning_rate=pa1`) and PA-II (`pa2`), it specifies the aggressiveness parameter for the passive-agressive algorithm, see [1] where it is called C: - For PA-I it is the maximum step size. - For PA-II it regularizes the step size (the smaller `eta0` the more it regularizes). As a general rule-of-thumb for PA, `eta0` should be small when the data is noisy.	0.01
	power_t power_t: float, default=0.5 The exponent for inverse scaling learning rate. Values must be in the range `[0.0, inf)`. .. deprecated:: 1.8 Negative values for `power_t` are deprecated in version 1.8 and will raise an error in 1.10. Use values in the range [0.0, inf) instead.	0.5
	early_stopping early_stopping: bool, default=False Whether to use early stopping to terminate training when validation score is not improving. If set to `True`, it will automatically set aside a stratified fraction of training data as validation and terminate training when validation score returned by the `score` method is not improving by at least tol for n_iter_no_change consecutive epochs. See :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_early_stopping.py` for an example of the effects of early stopping. .. versionadded:: 0.20 Added 'early_stopping' option	False
	validation_fraction validation_fraction: float, default=0.1 The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if `early_stopping` is True. Values must be in the range `(0.0, 1.0)`. .. versionadded:: 0.20 Added 'validation_fraction' option	0.1
	n_iter_no_change n_iter_no_change: int, default=5 Number of iterations with no improvement to wait before stopping fitting. Convergence is checked against the training loss or the validation loss depending on the `early_stopping` parameter. Integer values must be in the range `[1, max_iter)`. .. versionadded:: 0.20 Added 'n_iter_no_change' option	5
	class_weight class_weight: dict, {class_label: weight} or "balanced", default=None Preset for the class_weight fit parameter. Weights associated with classes. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``.	None
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `. Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution than when calling fit a single time because of the way the data is shuffled. If a dynamic learning rate is used, the learning rate is adapted depending on the number of samples already seen. Calling ``fit`` resets this counter, while ``partial_fit`` will result in increasing the existing counter.	False
	average average: bool or int, default=False When set to `True`, computes the averaged SGD weights across all updates and stores the result in the ``coef_`` attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches `average`. So ``average=10`` will begin averaging after seeing 10 samples. Integer values must be in the range `[1, n_samples]`.	False