61  Bayesian Thinking in Business Analytics

Note📋 Learning Objectives

By the end of this chapter, you will: - Understand probability as degree of belief (Bayesian) vs. long-run frequency (Frequentist) - Apply Bayes’ theorem to sequential decision-making with data updates - Use conjugate priors for efficient computation (Beta-Binomial, Normal-Normal, Gamma-Poisson) - Build Bayesian regression models with credible intervals (not confidence intervals) - Apply probabilistic programming (Stan, PyMC) for complex models - Update beliefs sequentially as new data arrives - Make credible inferences about parameters, not just reject/fail-to-reject null hypotheses

61.1 What Is Bayesian Statistics? Learning from Experience

Suppose you are a loan officer at a microfinance institution in Kano. A new applicant, Mr Musa, sits across from you. You have no transaction history for him — he is a first-time borrower. But you are not completely in the dark. You know that, historically, traders in his market segment have a 15% default rate. You also notice that he has a well-organised account book, offers collateral, and his references check out. How should you use this background knowledge together with the new information in front of you?

This is Bayesian reasoning in everyday life. You start with a prior belief based on existing knowledge (“traders in this category default 15% of the time”). You observe new evidence (his behaviour, records, and references). You update your belief to produce a posterior — a revised estimate of his default risk that combines both the background knowledge and the new evidence. If his individual signals are strong, your posterior belief shifts toward “this person is a better-than-average risk.” If his signals are weak or mixed, your posterior stays close to the 15% prior.

Bayesian statistics formalises this intuitive process into mathematics. It is not a radically different kind of statistics from what you have learned so far — it is the same probability theory applied in a different way. The key difference is philosophical: Bayesian statistics treats uncertainty about unknown quantities as probability distributions that can be updated as new data arrives. This means you can answer questions like “Given everything I know, what is the probability that this borrower defaults?” rather than “If I repeat this experiment infinitely many times, how often would my test reject the null hypothesis?”

This distinction matters enormously for business. In the real world, you do not run an infinite number of experiments. You make one decision per customer, per product launch, per marketing campaign. You want to incorporate everything you already know (prior knowledge) and combine it with the specific data in front of you (likelihood) to arrive at the best possible belief about the unknown quantity (posterior). This is exactly what Bayesian statistics does.

61.1.1 Two Ways to Think About Probability

There are two ways to interpret probability, and understanding their difference is the key to understanding Bayesian statistics.

Frequentist probability (what you learned in Chapter 6) says: probability is the long-run frequency of an event if you repeated an experiment many times. The probability of a coin landing heads is 0.5 because, if you flipped it a million times, half would be heads. This interpretation works well for repeatable experiments. But it becomes awkward when the event is a one-time decision: “What is the probability that this specific merger succeeds?” — you cannot repeat the merger a million times to find out.

Bayesian probability is broader: probability is a measure of degree of belief about an uncertain quantity. You can have a probability distribution over “will this merger succeed?” based on your knowledge and analysis, even though the event happens only once. As you gather more evidence (due diligence reports, market analysis, regulatory signals), you update that probability. This is perfectly natural and reflects how thoughtful decision-makers actually reason.

In practice, for large datasets and standard problems, Bayesian and frequentist methods often give very similar answers. The Bayesian approach becomes distinctly more valuable when: (i) data is scarce and prior knowledge is important; (ii) you need to update beliefs sequentially as data arrives; (iii) you want a direct probability statement about the parameter of interest rather than a p-value. As a Nigerian bank makes credit decisions on new customers continuously, a Bayesian approach updates the default probability model daily, automatically incorporating new patterns — something frequentist batch methods cannot do efficiently.

Philosophically, Bayesian thinking reflects how humans actually learn: start with a prior belief (a borrower’s market segment has historically had a 10% default rate), observe new data (this particular borrower’s payment behaviour and collateral), update your belief (perhaps their default risk is now assessed at 7%), and make a decision based on the posterior.

61.2 Bayes’ Theorem from First Principles

Bayes’ theorem is derived from the definition of conditional probability. If \(A\) and \(B\) are events:

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \]

where: - \(P(A | B)\) = posterior: probability of \(A\) given we observed \(B\) - \(P(B | A)\) = likelihood: probability of observing \(B\) if \(A\) is true - \(P(A)\) = prior: our belief in \(A\) before seeing data - \(P(B)\) = marginal likelihood: probability of observing \(B\) under all possible values of \(A\)

The medical test example: A disease has prevalence 1% (1 in 100 people). A test is 95% sensitive (detects 95% of diseased) and 90% specific (correctly identifies 90% of healthy). You test positive. What is the probability you have the disease?

Using Bayes: \[ P(\text{Disease} | \text{Positive}) = \frac{P(\text{Positive} | \text{Disease}) \times P(\text{Disease})}{P(\text{Positive})} \]

\[ = \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.10 \times 0.99} = \frac{0.0095}{0.0095 + 0.099} \approx 0.087 \]

So despite the positive test, there’s only an 8.7% chance you have the disease! This is the “base rate fallacy”—the low prior probability dominates. This intuition is crucial for business: a model predicting a rare event (fraud, equipment failure) must have very high precision to be useful.

Note📘 Theory: Bayes’ Theorem for Continuous Parameters

For a continuous parameter \(\theta\) and observed data \(y\):

\[ p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)} \]

where \(p(\theta)\) is the prior, \(p(y|\theta)\) is the likelihood, and \(p(\theta|y)\) is the posterior. The denominator is the marginal likelihood:

\[ p(y) = \int p(y | \theta) p(\theta) d\theta \]

Often intractable, but proportionality suffices: \(p(\theta | y) \propto p(y | \theta) p(\theta)\).

Tip🔑 Key Formula

\[\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\]

Show code
library(tidyverse)

# Example: Bayesian updating of default probability for Nigerian microfinance borrowers
# Prior: historical default rate in the zone is 15%
prior_prob_default <- 0.15

# Test: credit bureau data shows positive history
# Likelihood: P(positive history | default) = 0.3 (30% of defaulters have spotty history)
#           P(positive history | no default) = 0.9 (90% of non-defaulters have clean history)
likelihood_given_default <- 0.3
likelihood_given_no_default <- 0.9

# Marginal likelihood
marginal_likelihood <- (likelihood_given_default * prior_prob_default) +
                       (likelihood_given_no_default * (1 - prior_prob_default))

# Posterior
posterior_prob_default <- (likelihood_given_default * prior_prob_default) / marginal_likelihood

cat("\n=== Bayesian Update: Borrower Default Risk ===\n")
#> 
#> === Bayesian Update: Borrower Default Risk ===
cat("Prior P(default):", round(prior_prob_default, 3), "\n")
#> Prior P(default): 0.15
cat("Likelihood P(clean history | no default):", round(likelihood_given_no_default, 3), "\n")
#> Likelihood P(clean history | no default): 0.9
cat("Posterior P(default | clean history):", round(posterior_prob_default, 3), "\n")
#> Posterior P(default | clean history): 0.056
cat("Risk reduction:", round((prior_prob_default - posterior_prob_default) / prior_prob_default * 100, 1), "%\n")
#> Risk reduction: 63 %

# Sequential update: borrower has made 12 consecutive on-time payments
# Each payment updates the belief
n_payments <- 12
prob_payment_given_default <- 0.3  # Defaulters sometimes make sporadic payments
prob_payment_given_no_default <- 0.98  # Non-defaulters almost always pay on time

current_posterior <- posterior_prob_default

cat("\n=== Sequential Updates ===\n")
#> 
#> === Sequential Updates ===
for (payment in 1:n_payments) {
  # Bayes update for this payment
  likelihood <- (prob_payment_given_default * current_posterior) +
                (prob_payment_given_no_default * (1 - current_posterior))
  current_posterior <- (prob_payment_given_default * current_posterior) / likelihood

  if (payment %in% c(1, 3, 6, 12)) {
    cat("After payment", payment, "- P(default) =", round(current_posterior, 4), "\n")
  }
}
#> After payment 1 - P(default) = 0.0177 
#> After payment 3 - P(default) = 0.0017 
#> After payment 6 - P(default) = 0 
#> After payment 12 - P(default) = 0

cat("\nFinal posterior after 12 payments: P(default) =", round(current_posterior, 4), "\n")
#> 
#> Final posterior after 12 payments: P(default) = 0
Show code
import numpy as np

# Bayesian update: default probability
prior = 0.15
likelihood_default = 0.3
likelihood_no_default = 0.9

marginal = likelihood_default * prior + likelihood_no_default * (1 - prior)
posterior = (likelihood_default * prior) / marginal

print("\n=== Bayesian Update: Borrower Default Risk ===")
#> 
#> === Bayesian Update: Borrower Default Risk ===
print(f"Prior P(default): {prior:.3f}")
#> Prior P(default): 0.150
print(f"Likelihood P(clean history | no default): {likelihood_no_default:.3f}")
#> Likelihood P(clean history | no default): 0.900
print(f"Posterior P(default | clean history): {posterior:.3f}")
#> Posterior P(default | clean history): 0.056
print(f"Risk reduction: {(prior - posterior) / prior * 100:.1f}%")
#> Risk reduction: 63.0%

# Sequential updates
current_posterior = posterior
prob_pay_default = 0.3
prob_pay_no_default = 0.98

print("\n=== Sequential Updates (12 Payments) ===")
#> 
#> === Sequential Updates (12 Payments) ===
for i in range(1, 13):
    likelihood = prob_pay_default * current_posterior + prob_pay_no_default * (1 - current_posterior)
    current_posterior = (prob_pay_default * current_posterior) / likelihood

    if i in [1, 3, 6, 12]:
        print(f"After payment {i}: P(default) = {current_posterior:.4f}")
#> After payment 1: P(default) = 0.0177
#> After payment 3: P(default) = 0.0017
#> After payment 6: P(default) = 0.0000
#> After payment 12: P(default) = 0.0000

print(f"\nFinal posterior: P(default) = {current_posterior:.4f}")
#> 
#> Final posterior: P(default) = 0.0000

61.3 The Prior, Likelihood, and Posterior

Prior: Your belief before seeing data. For a new loan applicant in Lagos, the prior might be the historical default rate in Lagos (say, 12%). This can be informative (based on years of data) or vague (weak information).

Likelihood: How probable is the observed data under different parameter values? For a borrower with 12 months of on-time payments, the likelihood of observing this given they will eventually default is low (0.3^12); given they will not default, it is high (0.98^12).

Posterior: Your updated belief after seeing data. The ratio of posterior to prior is driven by the likelihood ratio. As more data arrives, the posterior becomes more concentrated.

Key insight: with enough data, the posterior is dominated by the likelihood, and the prior “forgets.” With sparse data, the prior remains influential.

61.4 Conjugate Priors

Some prior-likelihood combinations have a posterior in the same family, enabling analytical updates without numerical integration.

Beta-Binomial: Prior \(\theta \sim \text{Beta}(\alpha, \beta)\), likelihood is Binomial. Posterior is \(\text{Beta}(\alpha + \text{successes}, \beta + \text{failures})\). Example: estimate the probability a customer defaults (binary outcome). Prior: \(\text{Beta}(2, 8)\) (expecting ~20% default). Observe 15 defaults, 85 no-defaults in new cohort. Posterior: \(\text{Beta}(2 + 15, 8 + 85) = \text{Beta}(17, 93)\), with posterior mean = \(17 / 110 = 0.155\).

Normal-Normal: Prior and likelihood are both Normal. Posterior is Normal with mean = weighted average of prior mean and data mean.

Gamma-Poisson: Prior \(\lambda \sim \text{Gamma}(\alpha, \beta)\), likelihood is Poisson. Posterior is \(\text{Gamma}(\alpha + \text{sum of counts}, \beta + n)\).

Show code
library(tidyverse)

# Beta-Binomial conjugacy
# Prior: Beta(2, 8) - expect ~20% default rate
prior_alpha <- 2
prior_beta <- 8

# Observed data: 25 defaults, 75 non-defaults
observed_defaults <- 25
observed_non_defaults <- 75

# Posterior
post_alpha <- prior_alpha + observed_defaults
post_beta <- prior_beta + observed_non_defaults

# Prior and posterior means
prior_mean <- prior_alpha / (prior_alpha + prior_beta)
post_mean <- post_alpha / (post_alpha + post_beta)

cat("\n=== Beta-Binomial Conjugacy ===\n")
#> 
#> === Beta-Binomial Conjugacy ===
cat("Prior Beta(α=", prior_alpha, ", β=", prior_beta, ")\n")
#> Prior Beta(α= 2 , β= 8 )
cat("Prior mean (expected default rate):", round(prior_mean, 3), "\n")
#> Prior mean (expected default rate): 0.2
cat("Observed: ", observed_defaults, " defaults out of ", observed_defaults + observed_non_defaults, "\n")
#> Observed:  25  defaults out of  100
cat("Observed default rate:", round(observed_defaults / (observed_defaults + observed_non_defaults), 3), "\n")
#> Observed default rate: 0.25
cat("Posterior Beta(α=", post_alpha, ", β=", post_beta, ")\n")
#> Posterior Beta(α= 27 , β= 83 )
cat("Posterior mean:", round(post_mean, 3), "\n")
#> Posterior mean: 0.245

# Credible interval (Bayesian confidence interval)
lower <- qbeta(0.025, post_alpha, post_beta)
upper <- qbeta(0.975, post_alpha, post_beta)

cat("\n95% Credible Interval:", round(lower, 3), "to", round(upper, 3), "\n")
#> 
#> 95% Credible Interval: 0.17 to 0.33
cat("Interpretation: Given the data, the default rate is between",
    round(lower * 100, 1), "% and", round(upper * 100, 1), "% with 95% probability.\n")
#> Interpretation: Given the data, the default rate is between 17 % and 33 % with 95% probability.

# Visualization
x <- seq(0, 1, length.out = 1000)
prior_density <- dbeta(x, prior_alpha, prior_beta)
post_density <- dbeta(x, post_alpha, post_beta)

n_pts <- length(x)
ggplot(tibble(x = rep(x, 2), density = c(prior_density, post_density),
              distribution = rep(c("Prior", "Posterior"), each = n_pts)),
       aes(x = x, y = density, color = distribution, linetype = distribution)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Beta-Binomial Conjugacy: Default Rate Update",
    x = "Default Rate", y = "Density",
    color = "", linetype = ""
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Show code
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

# Beta-Binomial
prior_alpha, prior_beta = 2, 8
observed_defaults, observed_non_defaults = 25, 75

post_alpha = prior_alpha + observed_defaults
post_beta = prior_beta + observed_non_defaults

prior_mean = prior_alpha / (prior_alpha + prior_beta)
post_mean = post_alpha / (post_alpha + post_beta)
observed_rate = observed_defaults / (observed_defaults + observed_non_defaults)

print("\n=== Beta-Binomial Conjugacy ===")
#> 
#> === Beta-Binomial Conjugacy ===
print(f"Prior Beta(α={prior_alpha}, β={prior_beta})")
#> Prior Beta(α=2, β=8)
print(f"Prior mean: {prior_mean:.3f}")
#> Prior mean: 0.200
print(f"Observed default rate: {observed_rate:.3f}")
#> Observed default rate: 0.250
print(f"Posterior Beta(α={post_alpha}, β={post_beta})")
#> Posterior Beta(α=27, β=83)
print(f"Posterior mean: {post_mean:.3f}")
#> Posterior mean: 0.245

# Credible interval
lower = beta.ppf(0.025, post_alpha, post_beta)
upper = beta.ppf(0.975, post_alpha, post_beta)

print(f"\n95% Credible Interval: {lower:.3f} to {upper:.3f}")
#> 
#> 95% Credible Interval: 0.170 to 0.330
print(f"Interpretation: Default rate is between {lower*100:.1f}% and {upper*100:.1f}% with 95% probability")
#> Interpretation: Default rate is between 17.0% and 33.0% with 95% probability

# Visualization
x = np.linspace(0, 1, 1000)
prior_density = beta.pdf(x, prior_alpha, prior_beta)
post_density = beta.pdf(x, post_alpha, post_beta)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, prior_density, label='Prior', linewidth=2)
ax.plot(x, post_density, label='Posterior', linewidth=2)
ax.axvline(prior_mean, color='steelblue', linestyle='--', alpha=0.5, label=f'Prior Mean: {prior_mean:.3f}')
ax.axvline(post_mean, color='orange', linestyle='--', alpha=0.5, label=f'Posterior Mean: {post_mean:.3f}')
ax.fill_between(x[(x >= lower) & (x <= upper)], 0,
                post_density[(x >= lower) & (x <= upper)], alpha=0.3, color='orange')
ax.set_xlabel('Default Rate')
ax.set_ylabel('Density')
ax.set_title('Beta-Binomial Conjugacy: Default Rate Update')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

61.5 Bayesian Regression

Ordinary regression gives point estimates of coefficients. Bayesian regression gives posterior distributions. For a simple linear regression \(y = \alpha + \beta x + \epsilon\) with \(\epsilon \sim N(0, \sigma^2)\), assign priors to \(\alpha\), \(\beta\), and \(\sigma\). The posterior is a distribution, and credible intervals (not confidence intervals) are interpretable.

Unlike frequentist CIs, a Bayesian credible interval directly states: “The coefficient is in this range with 95% probability given the data and prior.”

We demonstrate Bayesian simple linear regression using grid approximation (no external MCMC package required). The setting: predict monthly loan repayment rate from customer engagement score for a Nigerian microfinance bank.

Show code
library(tidyverse)

set.seed(8273)

# Simulate data: loan repayment rate ~ engagement score
n_obs       <- 120
engagement  <- runif(n_obs, 1, 10)
repay_rate  <- 0.40 + 0.05 * engagement + rnorm(n_obs, 0, 0.08)
repay_rate  <- pmin(pmax(repay_rate, 0.05), 1.00)

loan_df <- tibble(engagement = engagement, repay_rate = repay_rate)

# --- Grid approximation for Bayesian linear regression ---
# Model: repay_rate = alpha + beta * engagement + eps, eps ~ N(0, sigma^2)
# Priors: alpha ~ N(0.40, 0.10), beta ~ N(0.05, 0.02), sigma ~ Uniform(0.01, 0.30)

alpha_grid <- seq(0.20, 0.60, length.out = 40)
beta_grid  <- seq(0.00, 0.12, length.out = 40)
sigma      <- 0.08  # fix sigma at MLE estimate for 2D grid

# Compute log-posterior on grid
log_post <- outer(alpha_grid, beta_grid, FUN = function(a, b) {
  log_prior    <- dnorm(a, 0.40, 0.10, log = TRUE) + dnorm(b, 0.05, 0.02, log = TRUE)
  fitted       <- a + b * engagement
  log_lik      <- sum(dnorm(repay_rate, mean = fitted, sd = sigma, log = TRUE))
  log_prior + log_lik
})

# Normalise to get posterior probabilities
post_prob <- exp(log_post - max(log_post))
post_prob <- post_prob / sum(post_prob)

# Marginal posteriors
alpha_marginal <- rowSums(post_prob)
beta_marginal  <- colSums(post_prob)

alpha_map <- alpha_grid[which.max(alpha_marginal)]
beta_map  <- beta_grid[which.max(beta_marginal)]

# 95% credible intervals
alpha_cdf <- cumsum(alpha_marginal) / sum(alpha_marginal)
alpha_ci  <- c(alpha_grid[which(alpha_cdf >= 0.025)[1]],
               alpha_grid[which(alpha_cdf >= 0.975)[1]])

beta_cdf  <- cumsum(beta_marginal) / sum(beta_marginal)
beta_ci   <- c(beta_grid[which(beta_cdf >= 0.025)[1]],
               beta_grid[which(beta_cdf >= 0.975)[1]])

cat("=== Bayesian Linear Regression (Grid Approximation) ===\n")
#> === Bayesian Linear Regression (Grid Approximation) ===
cat("Intercept (alpha):\n")
#> Intercept (alpha):
cat("  MAP estimate:", round(alpha_map, 4), "\n")
#>   MAP estimate: 0.3949
cat("  95% Credible Interval:", round(alpha_ci, 4), "\n")
#>   95% Credible Interval: 0.2308 0.5692
cat("\nSlope (beta, per unit engagement):\n")
#> 
#> Slope (beta, per unit engagement):
cat("  MAP estimate:", round(beta_map, 4), "\n")
#>   MAP estimate: 0.0492
cat("  95% Credible Interval:", round(beta_ci, 4), "\n")
#>   95% Credible Interval: 0.0123 0.0892
cat("\nInterpretation: a 1-unit increase in engagement raises expected\n")
#> 
#> Interpretation: a 1-unit increase in engagement raises expected
cat("repayment rate by", round(beta_map * 100, 1), "pp (95% CI:",
    round(beta_ci[1] * 100, 1), "–", round(beta_ci[2] * 100, 1), "pp)\n")
#> repayment rate by 4.9 pp (95% CI: 1.2 – 8.9 pp)

# Visualise: posterior over (alpha, beta) + data
par(mfrow = c(1, 2))
image(alpha_grid, beta_grid, post_prob,
      main = "Joint Posterior: α × β",
      xlab = "Intercept (α)", ylab = "Slope (β)",
      col  = hcl.colors(20, "YlOrRd"))
abline(v = alpha_map, h = beta_map, lty = 2, col = "white")

plot(engagement, repay_rate,
     col = "steelblue", pch = 16, cex = 0.7,
     main = "Posterior Predictive Fit",
     xlab = "Engagement Score", ylab = "Repayment Rate")
abline(a = alpha_map, b = beta_map, col = "red", lwd = 2)

Show code
par(mfrow = c(1, 1))
Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(8273)

# Simulate data
n_obs      = 120
engagement = np.random.uniform(1, 10, n_obs)
repay_rate = np.clip(0.40 + 0.05 * engagement + np.random.normal(0, 0.08, n_obs), 0.05, 1.0)

# Grid approximation
alpha_grid = np.linspace(0.20, 0.60, 40)
beta_grid  = np.linspace(0.00, 0.12, 40)
sigma      = 0.08

# Log-posterior
AA, BB = np.meshgrid(alpha_grid, beta_grid, indexing='ij')
log_prior = (-(AA - 0.40)**2 / (2 * 0.10**2) +
             -(BB - 0.05)**2 / (2 * 0.02**2))

# Vectorised log-likelihood
fitted    = AA[:, :, np.newaxis] + BB[:, :, np.newaxis] * engagement[np.newaxis, np.newaxis, :]
log_lik   = np.sum(-0.5 * ((repay_rate - fitted) / sigma)**2, axis=2)

log_post  = log_prior + log_lik
post      = np.exp(log_post - log_post.max())
post     /= post.sum()

# Marginals
alpha_marg = post.sum(axis=1)
beta_marg  = post.sum(axis=0)

alpha_map  = alpha_grid[alpha_marg.argmax()]
beta_map   = beta_grid[beta_marg.argmax()]

# Credible intervals
alpha_cdf  = np.cumsum(alpha_marg) / alpha_marg.sum()
beta_cdf   = np.cumsum(beta_marg) / beta_marg.sum()
alpha_ci   = (alpha_grid[np.searchsorted(alpha_cdf, 0.025)],
              alpha_grid[np.searchsorted(alpha_cdf, 0.975)])
beta_ci    = (beta_grid[np.searchsorted(beta_cdf, 0.025)],
              beta_grid[np.searchsorted(beta_cdf, 0.975)])

print("=== Bayesian Linear Regression (Grid Approximation) ===")
#> === Bayesian Linear Regression (Grid Approximation) ===
print(f"Intercept MAP: {alpha_map:.4f},  95% CI: [{alpha_ci[0]:.4f}, {alpha_ci[1]:.4f}]")
#> Intercept MAP: 0.4051,  95% CI: [0.3846, 0.4359]
print(f"Slope MAP:     {beta_map:.4f},   95% CI: [{beta_ci[0]:.4f}, {beta_ci[1]:.4f}]")
#> Slope MAP:     0.0492,   95% CI: [0.0431, 0.0523]
print(f"\nInterpretation: 1-unit engagement → +{beta_map*100:.1f}pp repayment rate")
#> 
#> Interpretation: 1-unit engagement → +4.9pp repayment rate
print(f"  (95% CI: {beta_ci[0]*100:.1f}{beta_ci[1]*100:.1f} pp)")
#>   (95% CI: 4.3 – 5.2 pp)

# Visualise
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))

ax1.contourf(alpha_grid, beta_grid, post.T, levels=20, cmap='YlOrRd')
ax1.axvline(alpha_map, color='white', linestyle='--', linewidth=1)
ax1.axhline(beta_map,  color='white', linestyle='--', linewidth=1)
ax1.set_xlabel('Intercept (α)', fontsize=11)
ax1.set_ylabel('Slope (β)',     fontsize=11)
ax1.set_title('Joint Posterior: α × β', fontweight='bold')

x_line = np.linspace(1, 10, 100)
ax2.scatter(engagement, repay_rate, s=20, alpha=0.6, color='steelblue')
ax2.plot(x_line, alpha_map + beta_map * x_line, color='red',
         linewidth=2, label='MAP fit')
ax2.set_xlabel('Engagement Score', fontsize=11)
ax2.set_ylabel('Repayment Rate', fontsize=11)
ax2.set_title('Posterior Predictive Fit', fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

61.6 Probabilistic Programming: Stan and PyMC

For complex models, analytical posteriors don’t exist. Markov Chain Monte Carlo (MCMC) algorithms sample from the posterior numerically. Stan (R) and PyMC (Python) are languages where you write the model generatively, and the software automatically generates MCMC samplers.

The example below shows a conceptual Stan/PyMC model for a hierarchical default-rate analysis (50 Nigerian states, each with its own default rate, drawn from a common national distribution). We show the model specification and explain the output rather than running the full MCMC, which can take several minutes.

Show code
cat("=== Hierarchical Bayesian Model: Default Rates Across States ===\n\n")
#> === Hierarchical Bayesian Model: Default Rates Across States ===

cat("Stan model specification (conceptual — not executed):\n")
#> Stan model specification (conceptual — not executed):
cat("────────────────────────────────────────────────────\n")
#> ────────────────────────────────────────────────────
cat("data {\n")
#> data {
cat("  int<lower=0> N_states;             // number of states (50)\n")
#>   int<lower=0> N_states;             // number of states (50)
cat("  int<lower=0> n[N_states];          // loans per state\n")
#>   int<lower=0> n[N_states];          // loans per state
cat("  int<lower=0> k[N_states];          // defaults per state\n")
#>   int<lower=0> k[N_states];          // defaults per state
cat("}\n\n")
#> }
cat("parameters {\n")
#> parameters {
cat("  real<lower=0,upper=1> mu;          // national mean default rate\n")
#>   real<lower=0,upper=1> mu;          // national mean default rate
cat("  real<lower=0>         kappa;       // concentration (higher = less pooling)\n")
#>   real<lower=0>         kappa;       // concentration (higher = less pooling)
cat("  real<lower=0,upper=1> theta[N_states]; // state-level default rates\n")
#>   real<lower=0,upper=1> theta[N_states]; // state-level default rates
cat("}\n\n")
#> }
cat("model {\n")
#> model {
cat("  mu    ~ beta(2, 8);                // prior: ~20% national default rate\n")
#>   mu    ~ beta(2, 8);                // prior: ~20% national default rate
cat("  kappa ~ exponential(0.1);\n")
#>   kappa ~ exponential(0.1);
cat("  for (s in 1:N_states) {\n")
#>   for (s in 1:N_states) {
cat("    theta[s] ~ beta(mu * kappa, (1 - mu) * kappa);  // hyperprior\n")
#>     theta[s] ~ beta(mu * kappa, (1 - mu) * kappa);  // hyperprior
cat("    k[s]     ~ binomial(n[s], theta[s]);             // likelihood\n")
#>     k[s]     ~ binomial(n[s], theta[s]);             // likelihood
cat("  }\n")
#>   }
cat("}\n\n")
#> }

cat("Key idea: partial pooling\n")
#> Key idea: partial pooling
cat("  - States with many loans: posterior theta[s] ≈ observed default rate\n")
#>   - States with many loans: posterior theta[s] ≈ observed default rate
cat("  - States with few loans: posterior theta[s] shrinks toward national mu\n")
#>   - States with few loans: posterior theta[s] shrinks toward national mu
cat("  - This prevents noisy estimates for data-sparse states\n\n")
#>   - This prevents noisy estimates for data-sparse states

# Simulate what the hierarchical model would produce
set.seed(9147)
n_states  <- 36  # Nigeria's 36 states
mu_true   <- 0.15
kappa     <- 20   # concentration

# Each state has a true default rate drawn from the hyperprior
theta_true <- rbeta(n_states, mu_true * kappa, (1 - mu_true) * kappa)

# Observed data (variable sample sizes across states)
n_loans   <- round(rexp(n_states, 1 / 200)) + 20
k_defaults<- rbinom(n_states, n_loans, theta_true)
obs_rate  <- k_defaults / n_loans

# Naive estimate vs partial pooling (James-Stein shrinkage approximation)
pooled_estimate <- (k_defaults + mu_true * kappa) / (n_loans + kappa)

state_df <- tibble(
  state        = paste0("State_", 1:n_states),
  n_loans      = n_loans,
  k_defaults   = k_defaults,
  obs_rate     = obs_rate,
  pooled_est   = pooled_estimate,
  true_rate    = theta_true
)

cat("=== Partial Pooling Effect (Simulated) ===\n")
#> === Partial Pooling Effect (Simulated) ===
cat(sprintf("%-12s %8s %10s %10s %10s\n",
            "State", "n_loans", "Obs Rate", "Pooled", "True Rate"))
#> State         n_loans   Obs Rate     Pooled  True Rate
cat(paste(rep("-", 54), collapse = ""), "\n")
#> ------------------------------------------------------
for (i in 1:10) {
  cat(sprintf("%-12s %8d %10.3f %10.3f %10.3f\n",
              state_df$state[i], state_df$n_loans[i],
              state_df$obs_rate[i], state_df$pooled_est[i],
              state_df$true_rate[i]))
}
#> State_1          1170      0.145      0.145      0.135
#> State_2            74      0.149      0.149      0.123
#> State_3            57      0.140      0.143      0.080
#> State_4           441      0.088      0.091      0.104
#> State_5           677      0.131      0.132      0.126
#> State_6            85      0.071      0.086      0.075
#> State_7            36      0.361      0.286      0.289
#> State_8           159      0.145      0.145      0.152
#> State_9            51      0.078      0.099      0.092
#> State_10          232      0.073      0.079      0.118

# RMSE comparison
rmse_naive  <- sqrt(mean((obs_rate - theta_true)^2))
rmse_pooled <- sqrt(mean((pooled_estimate - theta_true)^2))
cat(sprintf("\nRMSE (naive observed rate): %.4f\n", rmse_naive))
#> 
#> RMSE (naive observed rate): 0.0339
cat(sprintf("RMSE (partial pooling):     %.4f\n", rmse_pooled))
#> RMSE (partial pooling):     0.0290
cat(sprintf("Improvement:                %.1f%%\n", (rmse_naive - rmse_pooled) / rmse_naive * 100))
#> Improvement:                14.7%
Show code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print("=== Hierarchical Bayesian Model: State-Level Default Rates ===\n")
#> === Hierarchical Bayesian Model: State-Level Default Rates ===

print("PyMC model specification (conceptual — not executed):")
#> PyMC model specification (conceptual — not executed):
print("─" * 54)
#> ──────────────────────────────────────────────────────
print("import pymc as pm")
#> import pymc as pm
print("")
print("with pm.Model() as hierarchical_model:")
#> with pm.Model() as hierarchical_model:
print("    # Hyperpriors (national distribution)")
#>     # Hyperpriors (national distribution)
print("    mu    = pm.Beta('mu', alpha=2, beta=8)")
#>     mu    = pm.Beta('mu', alpha=2, beta=8)
print("    kappa = pm.Exponential('kappa', lam=0.1)")
#>     kappa = pm.Exponential('kappa', lam=0.1)
print("")
print("    # State-level default rates")
#>     # State-level default rates
print("    theta = pm.Beta('theta', alpha=mu*kappa,")
#>     theta = pm.Beta('theta', alpha=mu*kappa,
print("                            beta=(1-mu)*kappa, shape=N_states)")
#>                             beta=(1-mu)*kappa, shape=N_states)
print("")
print("    # Likelihood")
#>     # Likelihood
print("    k = pm.Binomial('k', n=n_loans, p=theta, observed=k_obs)")
#>     k = pm.Binomial('k', n=n_loans, p=theta, observed=k_obs)
print("")
print("    # Sample with NUTS")
#>     # Sample with NUTS
print("    trace = pm.sample(2000, tune=1000, target_accept=0.9)")
#>     trace = pm.sample(2000, tune=1000, target_accept=0.9)
print()

# Simulate partial pooling
np.random.seed(9147)
n_states  = 36
mu_true   = 0.15
kappa     = 20.0

theta_true   = np.random.beta(mu_true * kappa, (1 - mu_true) * kappa, n_states)
n_loans      = np.clip(np.random.exponential(200, n_states).astype(int), 20, None)
k_defaults   = np.random.binomial(n_loans, theta_true)
obs_rate     = k_defaults / n_loans
pooled_est   = (k_defaults + mu_true * kappa) / (n_loans + kappa)

rmse_naive   = np.sqrt(np.mean((obs_rate   - theta_true)**2))
rmse_pooled  = np.sqrt(np.mean((pooled_est - theta_true)**2))

print(f"RMSE (naive observed rate):  {rmse_naive:.4f}")
#> RMSE (naive observed rate):  0.0438
print(f"RMSE (partial pooling):      {rmse_pooled:.4f}")
#> RMSE (partial pooling):      0.0403
print(f"Improvement:                 {(rmse_naive - rmse_pooled) / rmse_naive * 100:.1f}%")
#> Improvement:                 8.1%

# Visualise shrinkage
fig, ax = plt.subplots(figsize=(11, 5))
ax.scatter(obs_rate, theta_true,   s=n_loans/10, alpha=0.5,
           color='steelblue', label='Naive (obs rate)')
ax.scatter(pooled_est, theta_true, s=n_loans/10, alpha=0.5,
           color='#2ca02c', label='Partial pooling')
ax.plot([0, 0.5], [0, 0.5], '--', color='grey', linewidth=1)
ax.set_xlabel('Estimated Default Rate', fontsize=11)
ax.set_ylabel('True Default Rate', fontsize=11)
ax.set_title('Partial Pooling: Estimates Closer to True Values\n'
             '(point size = sample size)', fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Example: estimating customer lifetime value (CLV) which depends on retention probability and spending per period. Neither retention nor spending is directly observed; instead, we observe purchase histories. A hierarchical Bayesian model pools information across customers, shrinking estimates of low-data individuals toward the population mean.

61.7 When to Choose Bayesian vs. Frequentist

Choose Bayesian when: - Small sample sizes (prior knowledge stabilizes estimates) - Sequential updating (credit scoring, A/B testing) - Honest uncertainty quantification matters (credible intervals) - Hierarchical structure (customers nested in regions nested in countries) - Prior information is valuable

Choose Frequentist when: - Large samples (prior matters less) - Regulatory requirements mandate p-values and fixed significance levels - Computational speed is critical - The question is naturally phrased as “Can we reject the null?” (equivalence tests, non-inferiority)

61.8 Case Study: Bayesian Default Probability Updating

A Nigerian microfinance bank develops a dynamic credit score. The prior default probability for a borrower in a rural zone is 20% (historical data). As the customer makes payments (6 months), the Bayesian model updates: after 6 on-time payments, default probability drops to 8%; after 12 months, 5%. The bank uses this dynamic risk assessment to adjust interest rates: at 8% default risk, the rate is 18%; at 5%, the rate is 14%. This automated pricing responds to observed behavior, rewarding reliable payers and reducing rates, while maintaining spreads based on true risk.

Sequential Bayes enables this: each payment is processed as new data, posterior updated overnight, rates adjusted daily. A frequentist approach would require monthly or quarterly batch re-estimation, missing opportunities for real-time adaptation.

Chapter 56 Exercises

  1. Prior Sensitivity Analysis: Build a Bayesian logistic regression for churn prediction. Test with three priors (weakly informative, moderately informative, strongly informative). How much do estimates differ?

  2. Hierarchical Model: Model default rates across 50 Nigerian states. State-level estimates are noisy (few observations). Use a hierarchical prior to shrink state estimates toward the national mean, with state-specific offsets. Compare to frequentist state-wise estimates.

  3. A/B Test with Bayesian Stopping: Run an A/B test on conversion rates. Use a Bayesian approach to decide when to stop: stop when posterior P(A is better than B) exceeds 95% or P(B better) exceeds 95%. Compare to frequentist fixed-sample test.

  4. Causal Inference with Bayesian Model Averaging: Estimate the causal effect of a promotion on sales using multiple model specifications (different sets of controls). Weight models by posterior probability (Bayesian Model Averaging) rather than selecting one.

  5. Posterior Predictive Checking: After fitting a Bayesian model to historical data, generate simulations from the posterior predictive distribution. Visually compare to observed data to assess model fit.

61.9 Further Reading

  • McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC.
  • Gelman, A., et al. (2013). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC.
  • Carpenter, B., et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.
  • PyMC Documentation: https://www.pymc.io/

61.10 Chapter 56 Appendix: Mathematical Derivations

61.10.1 A56.1 Bayes’ Theorem from Probability Axioms

From the definition of conditional probability:

\[P(A \cap B) = P(A|B) P(B) = P(B|A) P(A)\]

Rearranging:

\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]

For disjoint partitions \(A_1, \ldots, A_k\):

\[P(A_i | B) = \frac{P(B|A_i) P(A_i)}{\sum_{j=1}^k P(B|A_j) P(A_j)}\]

61.10.2 A56.2 Beta-Binomial Conjugacy Derivation

Prior: \(\theta \sim \text{Beta}(\alpha, \beta)\), so \(p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}\).

Likelihood given \(n\) trials with \(k\) successes: \(p(k|\theta) \propto \theta^k (1-\theta)^{n-k}\).

Posterior: \(p(\theta|k) \propto \theta^k (1-\theta)^{n-k} \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha+k-1} (1-\theta)^{\beta+n-k-1}\).

This is \(\text{Beta}(\alpha + k, \beta + n - k)\).

61.10.3 A56.3 Markov Chain Monte Carlo (MCMC)

To sample from a complex posterior \(p(\theta|y)\), MCMC uses a random walk that, over many iterations, converges to the stationary distribution \(p(\theta|y)\). The Metropolis-Hastings algorithm:

  1. Start with \(\theta^{(0)}\)
  2. At iteration \(t\), propose \(\theta' \sim q(\theta' | \theta^{(t)})\) (symmetric proposal)
  3. Accept with probability \(\min(1, \frac{p(\theta'|y)}{p(\theta^{(t)}|y)})\)
  4. If accepted, \(\theta^{(t+1)} = \theta'\); otherwise, \(\theta^{(t+1)} = \theta^{(t)}\)

The No-U-Turn Sampler (NUTS), used by Stan, is more efficient: it adapts the proposal variance and step size automatically.

61.10.4 A56.4 Credible Interval vs. Confidence Interval

A 95% credible interval is a range \([a, b]\) such that \(P(a < \theta < b | y) = 0.95\). This is a direct probability statement about the parameter given data.

A 95% confidence interval is a procedure such that, if repeated infinitely, 95% of intervals contain the true parameter. This is a statement about the long-run procedure, not the current interval.


61.11 Chapter 56 Acknowledgments and Final Thoughts

This chapter closes a 7-chapter sequence on analytics for African business. From demand forecasting (Ch. 50) through operations, people, and simulation, the themes are: data-driven decisions, quantified uncertainty, fairness, and actionable insights.

Bayesian thinking embodies the best of analytics: starting with prior knowledge (experience, institutional memory), updating beliefs as new data arrives, and making decisions under uncertainty. It is the natural framework for iterative improvement and continuous learning—the essence of modern business analytics.

For readers in Nigeria and across Africa: the tools in this book are powerful, but their ethical application matters most. An attrition model that accurately predicts who will leave is useful only if it informs fair retention offers, not discriminatory targeting. A failure prediction model is valuable only if it guides equitable maintenance, not biased intervention. Always audit your models for bias, be transparent about uncertainty, and place human judgment at the center of decisions.


61.12 Appendix: Selected Solutions to Chapter Exercises

(Abbreviated solutions to key exercises across chapters)

61.12.1 Chapter 50: Exercise 1 – Promotional Calendar Optimization

Formulate as a mixed-integer program. Variables: \(x_{ij} \in \{0,1\}\) = 1 if SKU \(i\) is promoted in window \(j\). Maximize \(\sum_i \sum_j (\text{Uplift}_{ij} - \text{Cannibalization}_{ij}) x_{ij}\) subject to: each SKU promoted at most twice, promotion budget constraints, production capacity constraints. Solve with lpSolve (R) or PuLP (Python).

61.12.2 Chapter 53: Exercise 2 – Promotion Prediction Model

Build XGBoost to predict promotion within 12 months. Features: prior promotions, current grade, performance rating, tenure, engagement. Compare predicted promotion probability to attrition risk. Identify flight risks: high attrition risk, high promotion probability (likely to leave if not promoted).

61.12.3 Chapter 55: Exercise 4 – Bootstrap Parameter Estimation

Sample historical cost data with replacement (100 times). For each sample, estimate μ and σ. Use this distribution of (μ, σ) as the prior in the simulation. Results show that parameter uncertainty widens the final NPV distribution.