38  Neural Networks: An Accessible Introduction

Author

Bongo Adi

Note📋 Learning Objectives
  • Understand the biological inspiration and mathematical formulation of artificial neurons
  • Build and train feedforward neural networks from first principles using matrix algebra
  • Apply backpropagation to optimise network weights on real data
  • Recognise when neural networks are the appropriate tool versus simpler alternatives
  • Implement a complete neural network pipeline on Nigerian telecom churn data

38.1 The Biological Analogy and Beyond

The story of artificial neural networks begins not in a computer science lab, but in the neuroscience of the 1940s. Warren McCulloch and Walter Pitts observed that biological neurons operate as thresholding devices: they receive signals through dendrites from other neurons, integrate those signals in the cell body, and fire an electrical impulse down the axon if the cumulative signal exceeded a threshold. This observation—that a neuron was fundamentally a threshold function—led to a beautiful mathematical abstraction: the artificial neuron. However, it is crucial to understand that the analogy, while pedagogically useful, is limited. Real neurons are far more complex: they communicate through chemical synapses with variable efficacy, their firing rate depends on temporal patterns of input, they exhibit plasticity over days and weeks, and they operate in networks that exploit feedback loops, neuromodulators, and mechanisms we still do not fully understand. The artificial neuron is a caricature—a drastic simplification that captures one essential idea (integration and thresholding) and discards almost everything else. We use it anyway because, despite its crudeness, it works.

The practical history of neural networks is one of cycles of hype and disappointment. Frank Rosenblatt introduced the perceptron in 1957, a single-layer network that could learn a linear decision boundary. For years, researchers dreamed of stacking layers to build deeper networks. That dream collided with a mathematical wall in the 1970s and 1980s: the vanishing gradient problem. When you tried to train deep networks by backpropagation (the algorithm for computing weight gradients), the gradients flowing backward through many layers would shrink exponentially, becoming so small that learning in the early layers stalled. The field retreated. Support vector machines, random forests, and kernel methods became fashionable. Then, between 2011 and 2012, three breakthroughs converged: GPUs made it cheap to train large networks in hours instead of months, massive datasets became available (ImageNet had 14 million labelled images by 2010), and researchers—Hinton, Bengio, LeCun, and others—discovered that ReLU activations and careful weight initialisation largely sidestepped the vanishing gradient problem. The 2012 ImageNet competition saw a convolutional neural network (AlexNet) vastly outperform traditional computer vision methods, and the deep learning renaissance began. Today, neural networks drive language models, computer vision, speech recognition, and play dominoes.

38.2 The Artificial Neuron

At the heart of every neural network sits the artificial neuron: a function that takes multiple numeric inputs, weights them, sums them, adds a bias, and passes the result through an activation function. Mathematically, given inputs \(x_1, x_2, \ldots, x_n\), weights \(w_1, w_2, \ldots, w_n\), and bias \(b\), we compute the pre-activation value (or “logit”) as: \[z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b\]

We then apply an activation function \(f\) to introduce nonlinearity: \[a = f(z)\]

The output \(a\) becomes the input to the next layer. Without the activation function—if \(f\) were the identity function—stacking layers would be pointless because the composition of linear functions is linear, and a multi-layer network would compute no more than a single linear transformation. Activation functions are what give neural networks their expressive power.

The sigmoid activation function, \(\sigma(z) = \frac{1}{1 + e^{-z}}\), was historically the standard. It maps any input to the range \((0, 1)\), making it natural for binary classification (interpret the output as a probability). However, sigmoid has a sharp peak in its derivative around \(z = 0\), meaning gradients are tiny except in a narrow region. This exacerbates the vanishing gradient problem in deep networks. The hyperbolic tangent, \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\), maps to \((-1, 1)\) and has a steeper gradient than sigmoid, making it slightly better for hidden layers. Today, the Rectified Linear Unit (ReLU), defined as \(f(z) = \max(0, z)\), dominates hidden layers. It is simple to compute, its gradient is a constant 1 for positive inputs, and it naturally gives rise to sparse activations (many neurons output exactly zero). For multi-class classification, the softmax function normalises a vector of logits into a probability distribution: given logits \(z_1, \ldots, z_K\) for \(K\) classes, softmax outputs \[p_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\] Each \(p_k\) is between 0 and 1, and they sum to 1, so we can interpret them as class probabilities.

Note📘 Theory: Activation Functions and Their Properties
Tip🔑 Key Formula

The artificial neuron computes: \[a = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w}^T \mathbf{x} + b)\]

where \(f\) is an activation function such as: - Sigmoid: \(\sigma(z) = \frac{1}{1+e^{-z}}\) - ReLU: \(f(z) = \max(0, z)\) - Tanh: \(\tanh(z) = \frac{e^{2z}-1}{e^{2z}+1}\) - Softmax (for output): \(p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\)

Show code
# Load required libraries
library(ggplot2)
library(dplyr)
library(gridExtra)

# Define activation functions
sigmoid <- function(z) {
  1 / (1 + exp(-z))
}

relu <- function(z) { r <- pmax(0, z); dim(r) <- dim(z); r }

tanh_activation <- function(z) {
  tanh(z)
}

softmax <- function(z) {
  exp_z <- exp(z - max(z))  # Subtract max for numerical stability
  exp_z / sum(exp_z)
}

# Create input range
z_range <- seq(-5, 5, by = 0.01)

# Compute activations
df_activations <- data.frame(
  z = z_range,
  sigmoid = sigmoid(z_range),
  relu = relu(z_range),
  tanh = tanh_activation(z_range)
)

# Plot activations
df_long <- df_activations |>
  tidyr::pivot_longer(-z, names_to = "activation", values_to = "output")

p_activations <- ggplot(df_long, aes(x = z, y = output, colour = activation)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Activation Functions",
    x = "Pre-activation (z)",
    y = "Output",
    colour = "Activation"
  ) +
  theme(legend.position = "bottom")

# Plot derivatives
df_deriv <- data.frame(
  z = z_range,
  sigmoid_deriv = sigmoid(z_range) * (1 - sigmoid(z_range)),
  relu_deriv = as.numeric(z_range > 0),
  tanh_deriv = 1 - tanh_activation(z_range)^2
)

df_deriv_long <- df_deriv |>
  tidyr::pivot_longer(-z, names_to = "derivative", values_to = "grad")

p_derivatives <- ggplot(df_deriv_long, aes(x = z, y = grad, colour = derivative)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Gradients of Activation Functions",
    x = "Pre-activation (z)",
    y = "Gradient",
    colour = "Derivative"
  ) +
  theme(legend.position = "bottom")

# Print plots
print(p_activations)

Show code
print(p_derivatives)

Show code

# Demonstrate softmax
logits <- c(2.0, 1.0, 0.1)
probs <- softmax(logits)
cat("Logits:", logits, "\n")
#> Logits: 2 1 0.1
cat("Softmax probabilities:", probs, "\n")
#> Softmax probabilities: 0.6590011 0.242433 0.09856589
cat("Sum of probabilities:", sum(probs), "\n")
#> Sum of probabilities: 1
Show code
import numpy as np
import matplotlib.pyplot as plt

# Define activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

def tanh_activation(z):
    return np.tanh(z)

def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Subtract max for numerical stability
    return exp_z / np.sum(exp_z)

# Create input range
z_range = np.linspace(-5, 5, 200)

# Compute activations
sigmoid_out = sigmoid(z_range)
relu_out = relu(z_range)
tanh_out = tanh_activation(z_range)

# Compute derivatives
sigmoid_deriv = sigmoid(z_range) * (1 - sigmoid(z_range))
relu_deriv = (z_range > 0).astype(float)
tanh_deriv = 1 - tanh_activation(z_range)**2

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Activation functions
ax1.plot(z_range, sigmoid_out, label='Sigmoid', linewidth=2)
ax1.plot(z_range, relu_out, label='ReLU', linewidth=2)
ax1.plot(z_range, tanh_out, label='Tanh', linewidth=2)
ax1.set_xlabel('Pre-activation (z)')
ax1.set_ylabel('Output')
ax1.set_title('Activation Functions')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Derivatives
ax2.plot(z_range, sigmoid_deriv, label='Sigmoid derivative', linewidth=2)
ax2.plot(z_range, relu_deriv, label='ReLU derivative', linewidth=2)
ax2.plot(z_range, tanh_deriv, label='Tanh derivative', linewidth=2)
ax2.set_xlabel('Pre-activation (z)')
ax2.set_ylabel('Gradient')
ax2.set_title('Gradients of Activation Functions')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Show code

# Demonstrate softmax
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
#> Logits: [2.  1.  0.1]
print(f"Softmax probabilities: {probs}")
#> Softmax probabilities: [0.65900114 0.24243297 0.09856589]
print(f"Sum of probabilities: {probs.sum()}")
#> Sum of probabilities: 1.0
Caution📝 Section 33.2 Review Questions
  1. Why is an activation function necessary? What would happen if we used the identity function \(f(z) = z\) in every layer?
  2. Explain why ReLU is preferred over sigmoid in modern deep networks, even though sigmoid was historically standard.
  3. The softmax function maps \(K\) logits to a probability distribution. What is the key property that makes it suitable for multi-class classification?
  4. Compute the sigmoid derivative \(\frac{d\sigma}{dz}\) analytically. At which value of \(z\) is the derivative maximum?

38.3 The Feedforward Network

A feedforward network is a series of layers stacked in sequence, where the output of one layer becomes the input to the next. The architecture flows information in one direction only: input → hidden layers → output. Layer 0 is the input layer (raw features). Layers 1 through \(L-1\) are hidden layers (their size and activation are design choices). Layer \(L\) is the output layer (whose activation depends on the task: sigmoid for binary classification, softmax for multi-class, linear for regression).

During the forward pass, we compute activations layer by layer. If \(\mathbf{a}^{(\ell)}\) denotes the activation vector at layer \(\ell\), and \(\mathbf{W}^{(\ell)}\) and \(\mathbf{b}^{(\ell)}\) are the weight matrix and bias vector connecting layer \(\ell-1\) to layer \(\ell\), then: \[\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\] \[\mathbf{a}^{(\ell)} = f^{(\ell)}\left(\mathbf{z}^{(\ell)}\right)\]

Starting from \(\mathbf{a}^{(0)} = \mathbf{x}\) (the input), we compute forward through all layers to get the final output \(\mathbf{a}^{(L)}\). Why does depth matter? Each layer learns a representation of the input at a different level of abstraction. Early layers (close to the input) learn low-level features: edges, corners, textures. Middle layers compose these into mid-level features: shapes, patterns. Deep layers recognise high-level semantic concepts: objects, entities, relationships. This hierarchy of representations is what makes deep networks powerful. A shallow network may need exponentially many neurons to capture such abstractions; a deeper network can build them up incrementally, using far fewer parameters.

The Universal Approximation Theorem states (roughly) that a feedforward network with one hidden layer and a nonlinear activation can approximate any continuous function on a compact domain, given enough hidden neurons. This is a powerful existence result: it guarantees that neural networks are expressive enough, in principle, to learn any relationship in the data. However, the theorem says nothing about how many neurons are needed (it could be astronomically large), how long training takes, or whether your optimisation algorithm will find a good solution. In practice, deeper networks with fewer total parameters often train faster and generalise better than shallow wide networks on real-world data.

Note📘 Theory: Feedforward Forward Pass
Tip🔑 Key Formula

For a feedforward network with \(L\) layers: \[\mathbf{a}^{(\ell)} = f^{(\ell)}\left(\mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right), \quad \ell = 1, 2, \ldots, L\]

with \(\mathbf{a}^{(0)} = \mathbf{x}\) and output \(\hat{\mathbf{y}} = \mathbf{a}^{(L)}\).

Show code
# Build a simple 2-layer feedforward network from scratch in R
# We'll use a synthetic dataset with 2 inputs, 1 output (binary classification)

set.seed(2974)

# Activation functions
sigmoid <- function(z) 1 / (1 + exp(-z))
relu <- function(z) { r <- pmax(0, z); dim(r) <- dim(z); r }
softmax <- function(z) {
  exp_z <- exp(z - max(z))
  exp_z / sum(exp_z)
}

# Feedforward pass for a 2-layer network
feedforward <- function(X, W1, b1, W2, b2) {
  # Layer 1: hidden layer with ReLU
  Z1 <- X %*% W1 + matrix(rep(b1, nrow(X)), nrow = nrow(X), byrow = TRUE)
  A1 <- relu(Z1)

  # Layer 2: output layer with sigmoid (binary classification)
  Z2 <- A1 %*% W2 + matrix(rep(b2, nrow(X)), nrow = nrow(X), byrow = TRUE)
  A2 <- sigmoid(Z2)

  list(Z1 = Z1, A1 = A1, Z2 = Z2, A2 = A2)
}

# Generate synthetic data
n_samples <- 200
n_features <- 2
X <- matrix(rnorm(n_samples * n_features, sd = 2), nrow = n_samples, ncol = n_features)
# Nonlinear decision boundary: y = 1 if x1^2 + x2^2 > 4
Y <- as.numeric((X[, 1]^2 + X[, 2]^2) > 4)

# Initialize weights (small random values to break symmetry)
n_hidden <- 5
W1 <- matrix(rnorm(n_features * n_hidden, sd = 0.5), nrow = n_features, ncol = n_hidden)
b1 <- rnorm(n_hidden, sd = 0.1)
W2 <- matrix(rnorm(n_hidden * 1, sd = 0.5), nrow = n_hidden, ncol = 1)
b2 <- rnorm(1, sd = 0.1)

# Forward pass on the first 5 samples to show the computation
sample_indices <- 1:5
X_sample <- X[sample_indices, ]
results <- feedforward(X_sample, W1, b1, W2, b2)

cat("Sample inputs (first 5):\n")
#> Sample inputs (first 5):
print(X_sample)
#>             [,1]       [,2]
#> [1,]  0.59101290  3.7989968
#> [2,] -0.75013193  2.3211992
#> [3,] -0.03018071 -0.4979861
#> [4,]  1.78040344 -0.7276358
#> [5,]  1.30294891  0.8711701
cat("\nHidden layer activations (ReLU output):\n")
#> 
#> Hidden layer activations (ReLU output):
print(results$A1)
#>          [,1]      [,2] [,3] [,4]      [,5]
#> [1,] 3.260446 0.0000000    0    0 0.3176315
#> [2,] 2.128653 0.0000000    0    0 0.2661552
#> [3,] 0.000000 0.1453142    0    0 0.2608819
#> [4,] 0.000000 1.4824295    0    0 0.3101112
#> [5,] 0.813603 0.6370789    0    0 0.3111415
cat("\nNetwork output (sigmoid):\n")
#> 
#> Network output (sigmoid):
print(results$A2)
#>           [,1]
#> [1,] 0.2864065
#> [2,] 0.3505693
#> [3,] 0.4651329
#> [4,] 0.2790401
#> [5,] 0.3427895
cat("\nTrue labels:\n")
#> 
#> True labels:
print(Y[sample_indices])
#> [1] 1 1 0 0 0

# Visualize the network architecture
cat("\n=== Network Architecture ===\n")
#> 
#> === Network Architecture ===
cat("Input layer: 2 neurons\n")
#> Input layer: 2 neurons
cat("Hidden layer: 5 neurons (ReLU)\n")
#> Hidden layer: 5 neurons (ReLU)
cat("Output layer: 1 neuron (Sigmoid)\n")
#> Output layer: 1 neuron (Sigmoid)
cat("Total parameters:", nrow(W1)*ncol(W1) + length(b1) + nrow(W2)*ncol(W2) + length(b2), "\n")
#> Total parameters: 21
Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2974)

# Activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

# Feedforward pass for a 2-layer network
def feedforward(X, W1, b1, W2, b2):
    # Layer 1: hidden layer with ReLU
    Z1 = X @ W1 + b1
    A1 = relu(Z1)

    # Layer 2: output layer with sigmoid
    Z2 = A1 @ W2 + b2
    A2 = sigmoid(Z2)

    return {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2}

# Generate synthetic nonlinear data
n_samples = 200
n_features = 2
X = np.random.randn(n_samples, n_features) * 2
Y = ((X[:, 0]**2 + X[:, 1]**2) > 4).astype(int)

# Initialize weights
n_hidden = 5
W1 = np.random.randn(n_features, n_hidden) * 0.5
b1 = np.random.randn(n_hidden) * 0.1
W2 = np.random.randn(n_hidden, 1) * 0.5
b2 = np.random.randn(1) * 0.1

# Forward pass on sample
results = feedforward(X[:5], W1, b1, W2, b2)

print("Sample inputs (first 5):")
#> Sample inputs (first 5):
print(X[:5])
#> [[-2.21564493  0.59976686]
#>  [-0.47714532  0.54812272]
#>  [-1.91015816  3.29127775]
#>  [ 0.13120205  1.1544781 ]
#>  [ 2.32261095  3.58430902]]
print("\nHidden layer activations (ReLU output):")
#> 
#> Hidden layer activations (ReLU output):
print(results['A1'])
#> [[0.         0.64406645 0.         0.         0.        ]
#>  [0.         0.38110653 0.07570392 0.         0.08609739]
#>  [0.         2.06792926 0.4724978  0.         1.18913564]
#>  [0.31842811 0.62899807 0.62512759 0.25410106 0.7718491 ]
#>  [1.75301403 1.65565274 2.70383849 2.02199661 3.40213261]]
print("\nNetwork output (sigmoid):")
#> 
#> Network output (sigmoid):
print(results['A2'])
#> [[0.45034023]
#>  [0.46083887]
#>  [0.25633343]
#>  [0.30354774]
#>  [0.01952779]]
print("\nTrue labels:")
#> 
#> True labels:
print(Y[:5])
#> [1 0 1 0 1]

# Visualize decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z_grid = np.c_[xx.ravel(), yy.ravel()]
predictions = feedforward(Z_grid, W1, b1, W2, b2)['A2'].reshape(xx.shape)

fig, ax = plt.subplots(figsize=(8, 6))
ax.contourf(xx, yy, predictions, levels=[0, 0.5, 1], colors=['lightblue', 'lightcoral'], alpha=0.6)
ax.scatter(X[Y == 0, 0], X[Y == 0, 1], c='blue', label='Class 0', s=30)
ax.scatter(X[Y == 1, 0], X[Y == 1, 1], c='red', label='Class 1', s=30)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('2-Layer Network Decision Boundary (Untrained Weights)')
ax.legend()
plt.show()

Show code

print("\n=== Network Architecture ===")
#> 
#> === Network Architecture ===
print(f"Input layer: {n_features} neurons")
#> Input layer: 2 neurons
print(f"Hidden layer: {n_hidden} neurons (ReLU)")
#> Hidden layer: 5 neurons (ReLU)
print(f"Output layer: 1 neuron (Sigmoid)")
#> Output layer: 1 neuron (Sigmoid)
print(f"Total parameters: {W1.size + b1.size + W2.size + b2.size}")
#> Total parameters: 21
Caution📝 Section 33.3 Review Questions
  1. What is the purpose of the hidden layer in a feedforward network? Why can’t a single-layer network with the same total number of neurons fit a nonlinear boundary?
  2. If a network has 5 input neurons, 10 hidden neurons (layer 1), 8 hidden neurons (layer 2), and 3 output neurons, how many weight matrices do we have? What are their dimensions?
  3. The Universal Approximation Theorem guarantees that a single hidden layer can approximate any continuous function. Why do we still use deep networks in practice?

38.4 Loss Functions

A loss function quantifies how wrong the network’s predictions are. During training, we adjust weights to minimise the loss. The choice of loss function depends on the task.

For regression (predicting a continuous value), Mean Squared Error (MSE) is standard: \[L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

For binary classification, Binary Cross-Entropy (BCE) is nearly universal: \[L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]\]

For multi-class classification (predicting one of \(K\) classes), Categorical Cross-Entropy is used: \[L_{\text{CCE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})\]

where \(y_{i,k}\) is 1 if sample \(i\) belongs to class \(k\) and 0 otherwise, and \(\hat{y}_{i,k}\) is the model’s predicted probability for class \(k\). Cross-entropy penalises confident wrong predictions heavily. If the true label is \(y = 1\) but the model predicts \(\hat{y} = 0.99\), the loss is small. If \(\hat{y} = 0.01\), the loss is large (approximately \(-\log(0.01) \approx 4.6\)). This makes sense: being confidently wrong is worse than being uncertainly right.

Note📘 Theory: Loss Functions for Machine Learning Tasks
Tip🔑 Key Formula

Mean Squared Error (Regression): \[L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Binary Cross-Entropy (Binary Classification): \[L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]\]

Categorical Cross-Entropy (Multi-class): \[L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})\]

Show code
# Compute loss functions

binary_cross_entropy <- function(y_true, y_pred) {
  # Clip predictions to avoid log(0)
  y_pred <- pmax(pmin(y_pred, 1 - 1e-7), 1e-7)
  loss <- -mean(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
  return(loss)
}

mean_squared_error <- function(y_true, y_pred) {
  loss <- mean((y_true - y_pred)^2)
  return(loss)
}

categorical_cross_entropy <- function(y_true_matrix, y_pred_probs) {
  # y_true_matrix: n x K one-hot encoded matrix
  # y_pred_probs: n x K prediction probability matrix
  y_pred_probs <- pmax(pmin(y_pred_probs, 1 - 1e-7), 1e-7)
  loss <- -mean(rowSums(y_true_matrix * log(y_pred_probs)))
  return(loss)
}

# Example: Binary classification
set.seed(5183)
y_true_binary <- c(0, 1, 1, 0, 1)
y_pred_binary <- c(0.1, 0.8, 0.9, 0.3, 0.6)

bce_loss <- binary_cross_entropy(y_true_binary, y_pred_binary)
cat("Binary Cross-Entropy Loss:", round(bce_loss, 4), "\n")
#> Binary Cross-Entropy Loss: 0.2603

# Example: Regression
y_true_reg <- c(1.0, 2.5, 3.2, 0.8, 4.1)
y_pred_reg <- c(1.1, 2.3, 3.5, 0.5, 4.2)

mse_loss <- mean_squared_error(y_true_reg, y_pred_reg)
cat("Mean Squared Error Loss:", round(mse_loss, 4), "\n")
#> Mean Squared Error Loss: 0.048

# Example: Multi-class classification (3 classes)
y_true_multiclass <- matrix(c(
  1, 0, 0,
  0, 1, 0,
  0, 0, 1,
  1, 0, 0,
  0, 1, 0
), nrow = 5, ncol = 3, byrow = TRUE)

y_pred_multiclass <- matrix(c(
  0.7, 0.2, 0.1,
  0.1, 0.8, 0.1,
  0.1, 0.2, 0.7,
  0.6, 0.3, 0.1,
  0.2, 0.7, 0.1
), nrow = 5, ncol = 3, byrow = TRUE)

cce_loss <- categorical_cross_entropy(y_true_multiclass, y_pred_multiclass)
cat("Categorical Cross-Entropy Loss:", round(cce_loss, 4), "\n")
#> Categorical Cross-Entropy Loss: 0.3608

# Visualize the effect of prediction confidence on BCE loss
pred_range <- seq(0.01, 0.99, by = 0.01)
loss_y_true_1 <- -log(pred_range)  # Loss when true label is 1
loss_y_true_0 <- -log(1 - pred_range)  # Loss when true label is 0

df_loss <- data.frame(
  prediction = c(pred_range, pred_range),
  loss = c(loss_y_true_1, loss_y_true_0),
  true_label = c(rep("y=1", length(pred_range)), rep("y=0", length(pred_range)))
)

library(ggplot2)
ggplot(df_loss, aes(x = prediction, y = loss, colour = true_label)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Binary Cross-Entropy Loss: Impact of Confidence",
    x = "Predicted Probability",
    y = "Loss",
    colour = "True Label"
  ) +
  ylim(0, 5)

Show code
import numpy as np
import matplotlib.pyplot as plt

def binary_cross_entropy(y_true, y_pred):
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

def mean_squared_error(y_true, y_pred):
    loss = np.mean((y_true - y_pred)**2)
    return loss

def categorical_cross_entropy(y_true, y_pred):
    # y_true: one-hot encoded (n x K)
    # y_pred: predicted probabilities (n x K)
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    loss = -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
    return loss

# Example: Binary classification
y_true_binary = np.array([0, 1, 1, 0, 1])
y_pred_binary = np.array([0.1, 0.8, 0.9, 0.3, 0.6])

bce = binary_cross_entropy(y_true_binary, y_pred_binary)
print(f"Binary Cross-Entropy Loss: {bce:.4f}")
#> Binary Cross-Entropy Loss: 0.2603

# Example: Regression
y_true_reg = np.array([1.0, 2.5, 3.2, 0.8, 4.1])
y_pred_reg = np.array([1.1, 2.3, 3.5, 0.5, 4.2])

mse = mean_squared_error(y_true_reg, y_pred_reg)
print(f"Mean Squared Error Loss: {mse:.4f}")
#> Mean Squared Error Loss: 0.0480

# Example: Multi-class classification
y_true_multiclass = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0],
    [0, 1, 0]
])

y_pred_multiclass = np.array([
    [0.7, 0.2, 0.1],
    [0.1, 0.8, 0.1],
    [0.1, 0.2, 0.7],
    [0.6, 0.3, 0.1],
    [0.2, 0.7, 0.1]
])

cce = categorical_cross_entropy(y_true_multiclass, y_pred_multiclass)
print(f"Categorical Cross-Entropy Loss: {cce:.4f}")
#> Categorical Cross-Entropy Loss: 0.3608

# Visualize BCE loss
pred_range = np.linspace(0.01, 0.99, 100)
loss_y_true_1 = -np.log(pred_range)
loss_y_true_0 = -np.log(1 - pred_range)

fig, ax = plt.subplots(figsize=(9, 6))
ax.plot(pred_range, loss_y_true_1, label='y=1', linewidth=2)
ax.plot(pred_range, loss_y_true_0, label='y=0', linewidth=2)
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Loss')
ax.set_title('Binary Cross-Entropy Loss: Impact of Confidence')
ax.legend()
ax.set_ylim(0, 5)
#> (0.0, 5.0)
ax.grid(True, alpha=0.3)
plt.show()

Show code

print("\nNote: BCE loss is 0 only when prediction matches reality perfectly,")
#> 
#> Note: BCE loss is 0 only when prediction matches reality perfectly,
print("and increases sharply as confidence in wrong prediction increases.")
#> and increases sharply as confidence in wrong prediction increases.
Caution📝 Section 33.4 Review Questions
  1. Why does cross-entropy loss penalise confident wrong predictions more heavily than MSE?
  2. If a binary classification model predicts \(\hat{y} = 0.99\) for a sample with true label \(y = 0\), compute the BCE loss contribution.
  3. In multi-class classification, why do we use one-hot encoding for the true labels when computing categorical cross-entropy?

38.5 Backpropagation

Backpropagation is the algorithm for computing gradients of the loss with respect to every weight in the network. It exploits the chain rule of calculus to do this efficiently. The key insight is elegant: compute the gradient of the loss with respect to the output, then flow that gradient backward through each layer, accumulating the effect of each layer’s weights on the loss.

Consider a simple case: a 2-layer network with loss \(L\). We want \(\frac{\partial L}{\partial w_{ij}}\) for every weight \(w_{ij}\). By the chain rule, \[\frac{\partial L}{\partial w_{ij}^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_i^{(2)}} \cdot \frac{\partial z_i^{(2)}}{\partial w_{ij}^{(2)}}\]

Define \(\delta_i^{(2)} = \frac{\partial L}{\partial z_i^{(2)}}\) (the “error” at layer 2). Then \(\frac{\partial L}{\partial w_{ij}^{(2)}} = \delta_i^{(2)} \cdot a_j^{(1)}\). Now, to compute gradients with respect to weights in layer 1, we need \(\delta^{(1)}\). By the chain rule: \[\delta_j^{(1)} = \frac{\partial L}{\partial a_j^{(1)}} \cdot \frac{\partial a_j^{(1)}}{\partial z_j^{(1)}} = \left(\sum_i \delta_i^{(2)} w_{ij}^{(2)}\right) \cdot f'(z_j^{(1)})\]

So we can compute \(\delta^{(1)}\) from \(\delta^{(2)}\) by multiplying by the weight matrix transpose and the derivative of the activation function. This is the “backpropagation” step: the error flows backward through the network, getting transformed by each layer’s weights and activations. For deep networks, this means the gradient of an early layer’s weight is the product of many terms: a loss derivative, several activation derivatives, and several weight matrices. If each activation derivative is less than 1 (as with sigmoid), these products shrink exponentially with depth—the vanishing gradient problem. ReLU mitigates this because its derivative is exactly 1 for positive inputs and 0 for negative inputs, so the product doesn’t shrink as aggressively.

Note📘 Theory: Chain Rule and Backpropagation
Tip🔑 Key Formula

For a loss \(L\) and weights through layers, the gradient flows backward via the chain rule: \[\frac{\partial L}{\partial w^{(\ell)}} = \frac{\partial L}{\partial z^{(\ell)}} \cdot \frac{\partial z^{(\ell)}}{\partial w^{(\ell)}}\]

Define the error term \(\delta^{(\ell)} = \frac{\partial L}{\partial z^{(\ell)}}\). Then: \[\delta^{(\ell)} = \left(\delta^{(\ell+1)^T} \mathbf{W}^{(\ell+1)}\right) \odot f'(\mathbf{z}^{(\ell)})\]

where \(\odot\) denotes element-wise multiplication (Hadamard product).

Show code
# Implement backpropagation for a simple 2-layer network

set.seed(8416)

# Activation functions and their derivatives
sigmoid <- function(z) 1 / (1 + exp(-z))
sigmoid_deriv <- function(a) a * (1 - a)  # a is the activation, not z

relu <- function(z) { r <- pmax(0, z); dim(r) <- dim(z); r }
relu_deriv <- function(z) as.numeric(z > 0)

# Binary cross-entropy loss
bce_loss <- function(y, y_pred) {
  y_pred <- pmax(pmin(y_pred, 1 - 1e-7), 1e-7)
  -mean(y * log(y_pred) + (1 - y) * log(1 - y_pred))
}

# Compute gradients via backpropagation
backpropagate <- function(X, y, W1, b1, W2, b2) {
  n <- nrow(X)

  # Forward pass
  Z1 <- X %*% W1 + matrix(rep(b1, n), nrow = n, byrow = TRUE)
  A1 <- relu(Z1)
  Z2 <- A1 %*% W2 + matrix(rep(b2, n), nrow = n, byrow = TRUE)
  A2 <- sigmoid(Z2)

  # Backward pass: compute loss and gradients
  loss <- bce_loss(y, A2)

  # Output layer error
  dZ2 <- (A2 - y) / n  # Derivative of BCE with sigmoid output
  dW2 <- t(A1) %*% dZ2
  db2 <- colMeans(dZ2)

  # Hidden layer error
  dA1 <- dZ2 %*% t(W2)
  dZ1 <- dA1 * relu_deriv(Z1)
  dW1 <- t(X) %*% dZ1
  db1 <- colMeans(dZ1)

  list(
    loss = loss,
    dW1 = dW1, db1 = db1,
    dW2 = dW2, db2 = db2,
    A1 = A1, A2 = A2, Z1 = Z1, Z2 = Z2
  )
}

# Generate synthetic data
n_samples <- 100
X <- matrix(rnorm(n_samples * 2), nrow = n_samples, ncol = 2)
y <- as.numeric((X[, 1]^2 + X[, 2]^2) > 2)
y <- matrix(y, ncol = 1)

# Initialize weights
W1 <- matrix(rnorm(2 * 5, sd = 0.5), nrow = 2, ncol = 5)
b1 <- rnorm(5, sd = 0.1)
W2 <- matrix(rnorm(5 * 1, sd = 0.5), nrow = 5, ncol = 1)
b2 <- rnorm(1, sd = 0.1)

# Training loop
learning_rate <- 0.1
n_epochs <- 100
loss_history <- numeric(n_epochs)

for (epoch in 1:n_epochs) {
  grads <- backpropagate(X, y, W1, b1, W2, b2)
  loss_history[epoch] <- grads$loss

  # Update weights with gradient descent
  W2 <- W2 - learning_rate * grads$dW2
  b2 <- b2 - learning_rate * grads$db2
  W1 <- W1 - learning_rate * grads$dW1
  b1 <- b1 - learning_rate * grads$db1

  if (epoch %% 20 == 0) {
    cat("Epoch", epoch, "Loss:", round(grads$loss, 4), "\n")
  }
}
#> Epoch 20 Loss: 0.6815 
#> Epoch 40 Loss: 0.6776 
#> Epoch 60 Loss: 0.6747 
#> Epoch 80 Loss: 0.6721 
#> Epoch 100 Loss: 0.6696

# Plot loss history
library(ggplot2)
df_loss <- data.frame(epoch = 1:n_epochs, loss = loss_history)
ggplot(df_loss, aes(x = epoch, y = loss)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Training Loss over Epochs",
    x = "Epoch",
    y = "Binary Cross-Entropy Loss"
  )

Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(8416)

# Activation functions and derivatives
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_deriv(a):
    return a * (1 - a)

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

# Binary cross-entropy loss
def bce_loss(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

# Backpropagation
def backpropagate(X, y, W1, b1, W2, b2):
    n = X.shape[0]

    # Forward pass
    Z1 = X @ W1 + b1
    A1 = relu(Z1)
    Z2 = A1 @ W2 + b2
    A2 = sigmoid(Z2)

    # Backward pass
    loss = bce_loss(y, A2)

    # Output layer error
    dZ2 = (A2 - y) / n
    dW2 = A1.T @ dZ2
    db2 = np.mean(dZ2, axis=0, keepdims=True)

    # Hidden layer error
    dA1 = dZ2 @ W2.T
    dZ1 = dA1 * relu_deriv(Z1)
    dW1 = X.T @ dZ1
    db1 = np.mean(dZ1, axis=0)

    return {
        'loss': loss,
        'dW1': dW1, 'db1': db1,
        'dW2': dW2, 'db2': db2,
        'A1': A1, 'A2': A2, 'Z1': Z1, 'Z2': Z2
    }

# Generate synthetic data
n_samples = 100
X = np.random.randn(n_samples, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) > 2).astype(int).reshape(-1, 1)

# Initialize weights
W1 = np.random.randn(2, 5) * 0.5
b1 = np.random.randn(1, 5) * 0.1
W2 = np.random.randn(5, 1) * 0.5
b2 = np.random.randn(1, 1) * 0.1

# Training loop
learning_rate = 0.1
n_epochs = 100
loss_history = []

for epoch in range(n_epochs):
    grads = backpropagate(X, y, W1, b1, W2, b2)
    loss_history.append(grads['loss'])

    # Update weights
    W2 -= learning_rate * grads['dW2']
    b2 -= learning_rate * grads['db2']
    W1 -= learning_rate * grads['dW1']
    b1 -= learning_rate * grads['db1']

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch + 1} Loss: {grads['loss']:.4f}")
#> Epoch 20 Loss: 0.7017
#> Epoch 40 Loss: 0.6957
#> Epoch 60 Loss: 0.6918
#> Epoch 80 Loss: 0.6874
#> Epoch 100 Loss: 0.6837

# Plot loss history
plt.figure(figsize=(9, 6))
plt.plot(loss_history, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Binary Cross-Entropy Loss')
plt.title('Training Loss over Epochs')
plt.grid(True, alpha=0.3)
plt.show()

Show code

# Final predictions
Z1 = X @ W1 + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
A2 = sigmoid(Z2)
accuracy = np.mean((A2 > 0.5).astype(int) == y)
print(f"\nFinal accuracy: {accuracy:.4f}")
#> 
#> Final accuracy: 0.3800
Caution📝 Section 33.5 Review Questions
  1. Explain in plain English why the gradient computed via backpropagation is correct (hint: what does the chain rule say?).
  2. If a network has 4 layers and you want the gradient \(\frac{\partial L}{\partial w^{(1)}}\), how many weight matrices must appear in this product of derivatives?
  3. Why does ReLU’s derivative being exactly 1 for positive inputs help alleviate vanishing gradients compared to sigmoid?

38.6 Gradient Descent and Optimisers

Once we have gradients via backpropagation, we need to update weights. Gradient descent is the most basic approach: move each weight in the opposite direction of its gradient, scaled by a learning rate \(\alpha\): \[w := w - \alpha \frac{\partial L}{\partial w}\]

There are three variants. Batch gradient descent computes the loss and gradient on the entire training set before updating weights—this is accurate but slow on large datasets. Stochastic gradient descent (SGD) updates after every single example—this is noisy but fast and can escape sharp minima. Mini-batch gradient descent (the modern standard) processes batches of, say, 32 or 128 examples at a time—it balances noise and efficiency. The learning rate is crucial: too large and the loss diverges; too small and training crawls. Setting it requires trial and error or adaptive schemes.

Momentum addresses a problem with vanilla gradient descent: oscillation. If the gradient changes sign (as in a ravine), the loss bounces around. Momentum accumulates past gradients: \(v := \beta v + (1 - \beta) \nabla L\), then \(w := w - \alpha v\). The parameter \(\beta\) (typically 0.9) acts as a memory: recent gradients have more weight. This smooths out oscillations and accelerates convergence. The Adam optimiser (Adaptive Moment Estimation) goes further: it maintains both a momentum term (first moment) and a term proportional to the squared gradient (second moment), and adapts the learning rate for each parameter independently. In practice, Adam is often the default choice because it requires less hyperparameter tuning.

Weight initialisation matters more than it appears. If all weights start at zero, all hidden units will be identical (they receive identical inputs, compute identical values, and receive identical gradients—no learning occurs). If weights are too large, activations explode or saturate. Xavier initialisation initialises weights by sampling from a distribution with variance \(\frac{1}{n_{\text{in}}}\), where \(n_{\text{in}}\) is the number of inputs to a neuron. He initialisation, used with ReLU, samples with variance \(\frac{2}{n_{\text{in}}}\). Both ensure activations have reasonable magnitude.

Note📘 Theory: Optimisation Algorithms
Tip🔑 Key Formula

Momentum: \[v_t = \beta v_{t-1} + (1-\beta) \nabla L, \quad w_t = w_{t-1} - \alpha v_t\]

Adam (Adaptive Moment Estimation): \[m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L\] \[s_t = \beta_2 s_{t-1} + (1-\beta_2) (\nabla L)^2\] \[w_t = w_{t-1} - \alpha \frac{m_t}{\sqrt{s_t} + \epsilon}\]

(Typical: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\))

Show code
# Compare different optimisers on a real dataset: Nigerian telecom churn

library(ggplot2)
library(dplyr)

set.seed(3752)

# Generate synthetic Nigerian telecom churn dataset
n_customers <- 5000
tenure <- runif(n_customers, 1, 72)
arpu <- rnorm(n_customers, mean = 5000, sd = 2000)  # Nigerian Naira
data_usage <- rnorm(n_customers, mean = 500, sd = 200)  # MB per month
voice_minutes <- rnorm(n_customers, mean = 800, sd = 400)
recharge_freq <- runif(n_customers, 1, 30)  # days between recharges
network_issues <- rpois(n_customers, lambda = 2)

# Churn probability depends on multiple factors
churn_prob <- 0.1 + 0.001 * (1 / (tenure + 1)) - 0.00001 * arpu - 0.0001 * data_usage + 0.01 * network_issues
churn_prob <- pmax(pmin(churn_prob, 1), 0)
churn <- as.numeric(runif(n_customers) < churn_prob)

# Prepare data
X <- cbind(
  tenure = scale(tenure)[, 1],
  arpu = scale(arpu)[, 1],
  data_usage = scale(data_usage)[, 1],
  voice_minutes = scale(voice_minutes)[, 1],
  recharge_freq = scale(recharge_freq)[, 1],
  network_issues = scale(network_issues)[, 1]
)
y <- matrix(churn, ncol = 1)

# Neural network training function with different optimisers
train_network <- function(X, y, optimiser = "sgd", epochs = 50) {
  n <- nrow(X)
  n_features <- ncol(X)
  n_hidden <- 10
  batch_size <- 32
  learning_rate <- 0.01

  # Initialize
  W1 <- matrix(rnorm(n_features * n_hidden, sd = sqrt(2 / n_features)), nrow = n_features, ncol = n_hidden)
  b1 <- rnorm(n_hidden, sd = 0.01)
  W2 <- matrix(rnorm(n_hidden * 1, sd = sqrt(2 / n_hidden)), nrow = n_hidden, ncol = 1)
  b2 <- rnorm(1, sd = 0.01)

  # Optimizer state
  if (optimiser == "momentum") {
    vW1 <- matrix(0, nrow = nrow(W1), ncol = ncol(W1))
    vb1 <- rep(0, length(b1))
    vW2 <- matrix(0, nrow = nrow(W2), ncol = ncol(W2))
    vb2 <- rep(0, length(b2))
    beta <- 0.9
  } else if (optimiser == "adam") {
    mW1 <- matrix(0, nrow = nrow(W1), ncol = ncol(W1))
    vW1 <- matrix(0, nrow = nrow(W1), ncol = ncol(W1))
    mb1 <- rep(0, length(b1))
    vb1 <- rep(0, length(b1))
    mW2 <- matrix(0, nrow = nrow(W2), ncol = ncol(W2))
    vW2 <- matrix(0, nrow = nrow(W2), ncol = ncol(W2))
    mb2 <- rep(0, length(b2))
    vb2 <- rep(0, length(b2))
    beta1 <- 0.9
    beta2 <- 0.999
    epsilon <- 1e-8
    t <- 0
  }

  loss_history <- numeric(epochs)

  for (epoch in 1:epochs) {
    epoch_loss <- 0
    n_batches <- ceiling(n / batch_size)

    for (batch in 1:n_batches) {
      idx <- ((batch - 1) * batch_size + 1):min(batch * batch_size, n)
      X_batch <- X[idx, , drop = FALSE]
      y_batch <- y[idx, , drop = FALSE]

      # Forward pass
      Z1 <- X_batch %*% W1 + matrix(rep(b1, nrow(X_batch)), nrow = nrow(X_batch), byrow = TRUE)
      { A1 <- pmax(0, Z1); dim(A1) <- dim(Z1) }  # ReLU, preserving matrix shape
      Z2 <- A1 %*% W2 + matrix(rep(b2, nrow(X_batch)), nrow = nrow(X_batch), byrow = TRUE)
      A2 <- 1 / (1 + exp(-Z2))  # Sigmoid

      # Loss
      A2_clipped <- pmax(pmin(A2, 1 - 1e-7), 1e-7)
      batch_loss <- -mean(y_batch * log(A2_clipped) + (1 - y_batch) * log(1 - A2_clipped))
      epoch_loss <- epoch_loss + batch_loss

      # Backward pass
      dZ2 <- (A2 - y_batch) / nrow(X_batch)
      dW2 <- t(A1) %*% dZ2
      db2 <- colMeans(dZ2)

      dA1 <- dZ2 %*% t(W2)
      dZ1 <- dA1 * (Z1 > 0)  # ReLU derivative
      dW1 <- t(X_batch) %*% dZ1
      db1 <- colMeans(dZ1)

      # Update based on optimiser
      if (optimiser == "sgd") {
        W2 <- W2 - learning_rate * dW2
        b2 <- b2 - learning_rate * db2
        W1 <- W1 - learning_rate * dW1
        b1 <- b1 - learning_rate * db1
      } else if (optimiser == "momentum") {
        vW2 <- beta * vW2 + (1 - beta) * dW2
        vb2 <- beta * vb2 + (1 - beta) * db2
        vW1 <- beta * vW1 + (1 - beta) * dW1
        vb1 <- beta * vb1 + (1 - beta) * db1

        W2 <- W2 - learning_rate * vW2
        b2 <- b2 - learning_rate * vb2
        W1 <- W1 - learning_rate * vW1
        b1 <- b1 - learning_rate * vb1
      } else if (optimiser == "adam") {
        t <- t + 1
        mW2 <- beta1 * mW2 + (1 - beta1) * dW2
        vW2 <- beta2 * vW2 + (1 - beta2) * (dW2^2)
        mb2 <- beta1 * mb2 + (1 - beta1) * db2
        vb2 <- beta2 * vb2 + (1 - beta2) * (db2^2)

        mW2_hat <- mW2 / (1 - beta1^t)
        vW2_hat <- vW2 / (1 - beta2^t)
        mb2_hat <- mb2 / (1 - beta1^t)
        vb2_hat <- vb2 / (1 - beta2^t)

        W2 <- W2 - learning_rate * mW2_hat / (sqrt(vW2_hat) + epsilon)
        b2 <- b2 - learning_rate * mb2_hat / (sqrt(vb2_hat) + epsilon)

        # Similar for layer 1
        mW1 <- beta1 * mW1 + (1 - beta1) * dW1
        vW1 <- beta2 * vW1 + (1 - beta2) * (dW1^2)
        mb1 <- beta1 * mb1 + (1 - beta1) * db1
        vb1 <- beta2 * vb1 + (1 - beta2) * (db1^2)

        mW1_hat <- mW1 / (1 - beta1^t)
        vW1_hat <- vW1 / (1 - beta2^t)
        mb1_hat <- mb1 / (1 - beta1^t)
        vb1_hat <- vb1 / (1 - beta2^t)

        W1 <- W1 - learning_rate * mW1_hat / (sqrt(vW1_hat) + epsilon)
        b1 <- b1 - learning_rate * mb1_hat / (sqrt(vb1_hat) + epsilon)
      }
    }

    loss_history[epoch] <- epoch_loss / n_batches
  }

  return(loss_history)
}

# Train with each optimiser
loss_sgd <- train_network(X, y, optimiser = "sgd", epochs = 50)
loss_momentum <- train_network(X, y, optimiser = "momentum", epochs = 50)
loss_adam <- train_network(X, y, optimiser = "adam", epochs = 50)

# Compare
df_comparison <- data.frame(
  epoch = rep(1:50, 3),
  loss = c(loss_sgd, loss_momentum, loss_adam),
  optimiser = rep(c("SGD", "Momentum", "Adam"), each = 50)
)

ggplot(df_comparison, aes(x = epoch, y = loss, colour = optimiser)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Optimiser Comparison on Nigerian Telecom Churn",
    x = "Epoch",
    y = "Loss",
    colour = "Optimiser"
  )

Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(3752)

# Generate synthetic Nigerian telecom churn data
n_customers = 5000
tenure = np.random.uniform(1, 72, n_customers)
arpu = np.random.normal(5000, 2000, n_customers)
data_usage = np.random.normal(500, 200, n_customers)
voice_minutes = np.random.normal(800, 400, n_customers)
recharge_freq = np.random.uniform(1, 30, n_customers)
network_issues = np.random.poisson(2, n_customers)

# Churn probability
churn_prob = (0.1 + 0.001 / (tenure + 1) - 0.00001 * arpu -
              0.0001 * data_usage + 0.01 * network_issues)
churn_prob = np.clip(churn_prob, 0, 1)
churn = (np.random.rand(n_customers) < churn_prob).astype(int)

# Standardize features
X = np.column_stack([tenure, arpu, data_usage, voice_minutes, recharge_freq, network_issues])
X = (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-8)
y = churn.reshape(-1, 1)

def train_network(X, y, optimiser='sgd', epochs=50):
    n, n_features = X.shape
    n_hidden = 10
    batch_size = 32
    learning_rate = 0.01

    # Initialize with He initialisation
    W1 = np.random.randn(n_features, n_hidden) * np.sqrt(2 / n_features)
    b1 = np.random.randn(1, n_hidden) * 0.01
    W2 = np.random.randn(n_hidden, 1) * np.sqrt(2 / n_hidden)
    b2 = np.random.randn(1, 1) * 0.01

    # Optimizer state
    if optimiser == 'momentum':
        vW1 = np.zeros_like(W1)
        vb1 = np.zeros_like(b1)
        vW2 = np.zeros_like(W2)
        vb2 = np.zeros_like(b2)
        beta = 0.9
    elif optimiser == 'adam':
        mW1 = np.zeros_like(W1)
        vW1 = np.zeros_like(W1)
        mb1 = np.zeros_like(b1)
        vb1 = np.zeros_like(b1)
        mW2 = np.zeros_like(W2)
        vW2 = np.zeros_like(W2)
        mb2 = np.zeros_like(b2)
        vb2 = np.zeros_like(b2)
        beta1, beta2, epsilon = 0.9, 0.999, 1e-8
        t = 0

    loss_history = []

    for epoch in range(epochs):
        epoch_loss = 0
        n_batches = int(np.ceil(n / batch_size))

        for batch in range(n_batches):
            idx = slice(batch * batch_size, min((batch + 1) * batch_size, n))
            X_batch = X[idx]
            y_batch = y[idx]

            # Forward
            Z1 = X_batch @ W1 + b1
            A1 = np.maximum(0, Z1)
            Z2 = A1 @ W2 + b2
            A2 = 1 / (1 + np.exp(-Z2))

            # Loss
            A2_clipped = np.clip(A2, 1e-7, 1 - 1e-7)
            batch_loss = -np.mean(y_batch * np.log(A2_clipped) +
                                (1 - y_batch) * np.log(1 - A2_clipped))
            epoch_loss += batch_loss

            # Backward
            dZ2 = (A2 - y_batch) / len(X_batch)
            dW2 = A1.T @ dZ2
            db2 = np.mean(dZ2, axis=0, keepdims=True)

            dA1 = dZ2 @ W2.T
            dZ1 = dA1 * (Z1 > 0)
            dW1 = X_batch.T @ dZ1
            db1 = np.mean(dZ1, axis=0, keepdims=True)

            # Update
            if optimiser == 'sgd':
                W2 -= learning_rate * dW2
                b2 -= learning_rate * db2
                W1 -= learning_rate * dW1
                b1 -= learning_rate * db1
            elif optimiser == 'momentum':
                vW2 = beta * vW2 + (1 - beta) * dW2
                vb2 = beta * vb2 + (1 - beta) * db2
                vW1 = beta * vW1 + (1 - beta) * dW1
                vb1 = beta * vb1 + (1 - beta) * db1

                W2 -= learning_rate * vW2
                b2 -= learning_rate * vb2
                W1 -= learning_rate * vW1
                b1 -= learning_rate * vb1
            elif optimiser == 'adam':
                t += 1
                mW2 = beta1 * mW2 + (1 - beta1) * dW2
                vW2 = beta2 * vW2 + (1 - beta2) * (dW2**2)
                mb2 = beta1 * mb2 + (1 - beta1) * db2
                vb2 = beta2 * vb2 + (1 - beta2) * (db2**2)

                mW2_hat = mW2 / (1 - beta1**t)
                vW2_hat = vW2 / (1 - beta2**t)
                mb2_hat = mb2 / (1 - beta1**t)
                vb2_hat = vb2 / (1 - beta2**t)

                W2 -= learning_rate * mW2_hat / (np.sqrt(vW2_hat) + epsilon)
                b2 -= learning_rate * mb2_hat / (np.sqrt(vb2_hat) + epsilon)

                mW1 = beta1 * mW1 + (1 - beta1) * dW1
                vW1 = beta2 * vW1 + (1 - beta2) * (dW1**2)
                mb1 = beta1 * mb1 + (1 - beta1) * db1
                vb1 = beta2 * vb1 + (1 - beta2) * (db1**2)

                mW1_hat = mW1 / (1 - beta1**t)
                vW1_hat = vW1 / (1 - beta2**t)
                mb1_hat = mb1 / (1 - beta1**t)
                vb1_hat = vb1 / (1 - beta2**t)

                W1 -= learning_rate * mW1_hat / (np.sqrt(vW1_hat) + epsilon)
                b1 -= learning_rate * mb1_hat / (np.sqrt(vb1_hat) + epsilon)

        loss_history.append(epoch_loss / n_batches)

    return np.array(loss_history)

# Train with each optimiser
loss_sgd = train_network(X, y, optimiser='sgd', epochs=50)
loss_momentum = train_network(X, y, optimiser='momentum', epochs=50)
loss_adam = train_network(X, y, optimiser='adam', epochs=50)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(loss_sgd, label='SGD', linewidth=2)
plt.plot(loss_momentum, label='Momentum', linewidth=2)
plt.plot(loss_adam, label='Adam', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Optimiser Comparison on Nigerian Telecom Churn')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Caution📝 Section 33.6 Review Questions
  1. Explain the difference between batch, stochastic, and mini-batch gradient descent. What are the trade-offs?
  2. Why does the learning rate matter so much? What happens if it is too large? Too small?
  3. How does momentum help with convergence? Give an intuitive explanation.
  4. Adam maintains two moving averages. What does each one track, and why is this useful?

38.7 Regularisation and Preventing Overfitting

A neural network with many parameters can memorise the training data perfectly, achieving near-zero training loss while failing on new data. Regularisation techniques combat this. Dropout is simple and effective: during training, randomly zero out (drop) a fraction of neurons in each layer, say 20% or 50%. This forces the network to learn redundant, distributed representations; no single neuron can be relied upon. At test time, all neurons are active, but their outputs are scaled down to account for the increased capacity. This has a surprising effect: dropout makes the network behave like an ensemble of smaller networks, and ensemble predictions are typically more robust. Batch normalisation normalises the activations of each layer to have mean 0 and variance 1 (per mini-batch), then applies a learnable affine transformation. This stabilises training, allows higher learning rates, and acts as a mild regulariser. Early stopping monitors the validation loss during training and stops if it stops improving, preventing overfitting in a data-driven way. L2 weight decay adds a penalty \(\lambda \sum w^2\) to the loss, encouraging small weights; this reduces model complexity. Together, these techniques are essential for training deep networks that generalise well.

Note📘 Theory: Regularisation Methods
Tip🔑 Key Formula

Dropout: During training, drop each neuron’s output with probability \(p\). At test time, scale all activations by \((1-p)\), or equivalently, use inverted dropout: scale by \(\frac{1}{1-p}\) during training.

Batch Normalisation: For each feature across a mini-batch: \[\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}, \quad y = \gamma \hat{x} + \beta\]

where \(\gamma, \beta\) are learnable parameters.

L2 Regularisation: \[L_{\text{total}} = L_{\text{original}} + \lambda \sum_{w} w^2\]

Show code
# Demonstrate dropout and its effect on overfitting

set.seed(7629)

# Synthetic data: clear overfitting scenario
n_train <- 200
n_test <- 200
X_train <- matrix(rnorm(n_train * 10), nrow = n_train, ncol = 10)
y_train <- as.numeric(X_train[, 1]^2 + X_train[, 2]^2 > 2)
y_train <- matrix(y_train, ncol = 1)

X_test <- matrix(rnorm(n_test * 10), nrow = n_test, ncol = 10)
y_test <- as.numeric(X_test[, 1]^2 + X_test[, 2]^2 > 2)
y_test <- matrix(y_test, ncol = 1)

# Train network with and without dropout
train_with_dropout <- function(X_train, y_train, X_test, y_test, dropout_rate = 0) {
  n <- nrow(X_train)
  n_features <- ncol(X_train)
  n_hidden <- 50
  batch_size <- 32
  learning_rate <- 0.01
  epochs <- 100

  # Initialize
  W1 <- matrix(rnorm(n_features * n_hidden, sd = sqrt(2 / n_features)), nrow = n_features, ncol = n_hidden)
  b1 <- rnorm(n_hidden, sd = 0.01)
  W2 <- matrix(rnorm(n_hidden * 1, sd = sqrt(2 / n_hidden)), nrow = n_hidden, ncol = 1)
  b2 <- rnorm(1, sd = 0.01)

  train_loss_history <- numeric(epochs)
  test_loss_history <- numeric(epochs)

  for (epoch in 1:epochs) {
    # Training loop
    epoch_train_loss <- 0
    n_batches <- ceiling(n / batch_size)

    for (batch in 1:n_batches) {
      idx <- ((batch - 1) * batch_size + 1):min(batch * batch_size, n)
      X_batch <- X_train[idx, , drop = FALSE]
      y_batch <- y_train[idx, , drop = FALSE]

      # Forward with dropout
      Z1 <- X_batch %*% W1 + matrix(rep(b1, nrow(X_batch)), nrow = nrow(X_batch), byrow = TRUE)
      A1 <- matrix(pmax(0, as.vector(Z1)), nrow = nrow(Z1), ncol = ncol(Z1))  # ReLU

      # Apply dropout during training
      if (dropout_rate > 0) {
        mask <- matrix(rbinom(nrow(A1) * ncol(A1), 1, 1 - dropout_rate), nrow = nrow(A1), ncol = ncol(A1))
        A1 <- matrix(A1 * mask / (1 - dropout_rate), nrow = nrow(A1), ncol = ncol(A1))  # Inverted dropout
      }

      Z2 <- A1 %*% W2 + matrix(rep(b2, nrow(X_batch)), nrow = nrow(X_batch), byrow = TRUE)
      A2 <- 1 / (1 + exp(-Z2))

      # Loss
      A2_clipped <- pmax(pmin(A2, 1 - 1e-7), 1e-7)
      batch_loss <- -mean(y_batch * log(A2_clipped) + (1 - y_batch) * log(1 - A2_clipped))
      epoch_train_loss <- epoch_train_loss + batch_loss

      # Backward
      dZ2 <- (A2 - y_batch) / nrow(X_batch)
      dW2 <- t(A1) %*% dZ2
      db2 <- colMeans(dZ2)

      dA1 <- dZ2 %*% t(W2)
      dZ1 <- dA1 * (Z1 > 0)
      dW1 <- t(X_batch) %*% dZ1
      db1 <- colMeans(dZ1)

      # Update
      W2 <- W2 - learning_rate * dW2
      b2 <- b2 - learning_rate * db2
      W1 <- W1 - learning_rate * dW1
      b1 <- b1 - learning_rate * db1
    }

    train_loss_history[epoch] <- epoch_train_loss / n_batches

    # Test evaluation (no dropout)
    Z1_test <- X_test %*% W1 + matrix(rep(b1, nrow(X_test)), nrow = nrow(X_test), byrow = TRUE)
    A1_test <- matrix(pmax(0, as.vector(Z1_test)), nrow = nrow(Z1_test), ncol = ncol(Z1_test))
    W2 <- matrix(W2, nrow = n_hidden, ncol = 1)  # Ensure W2 stays 2-D
    Z2_test <- A1_test %*% W2 + matrix(rep(b2, nrow(X_test)), nrow = nrow(X_test), byrow = TRUE)
    A2_test <- 1 / (1 + exp(-Z2_test))
    A2_test_clipped <- pmax(pmin(A2_test, 1 - 1e-7), 1e-7)
    test_loss_history[epoch] <- -mean(y_test * log(A2_test_clipped) + (1 - y_test) * log(1 - A2_test_clipped))
  }

  list(
    train_loss = train_loss_history,
    test_loss = test_loss_history
  )
}

# Train without dropout
results_no_dropout <- train_with_dropout(X_train, y_train, X_test, y_test, dropout_rate = 0)

# Train with dropout
results_with_dropout <- train_with_dropout(X_train, y_train, X_test, y_test, dropout_rate = 0.3)

# Compare
library(ggplot2)
df_comparison <- data.frame(
  epoch = rep(1:100, 4),
  loss = c(
    results_no_dropout$train_loss, results_no_dropout$test_loss,
    results_with_dropout$train_loss, results_with_dropout$test_loss
  ),
  dataset = rep(c("Train", "Test", "Train", "Test"), each = 100),
  method = rep(c("No Dropout", "No Dropout", "Dropout (0.3)", "Dropout (0.3)"), each = 100)
)

ggplot(df_comparison, aes(x = epoch, y = loss, colour = dataset, linetype = method)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Effect of Dropout on Overfitting",
    x = "Epoch",
    y = "Loss",
    colour = "Dataset",
    linetype = "Method"
  ) +
  ylim(0, 1)

Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(7629)

# Synthetic data for overfitting scenario
n_train, n_test = 200, 200
X_train = np.random.randn(n_train, 10)
y_train = ((X_train[:, 0]**2 + X_train[:, 1]**2) > 2).astype(int).reshape(-1, 1)

X_test = np.random.randn(n_test, 10)
y_test = ((X_test[:, 0]**2 + X_test[:, 1]**2) > 2).astype(int).reshape(-1, 1)

def train_with_dropout(X_train, y_train, X_test, y_test, dropout_rate=0):
    n, n_features = X_train.shape
    n_hidden = 50
    batch_size = 32
    learning_rate = 0.01
    epochs = 100

    W1 = np.random.randn(n_features, n_hidden) * np.sqrt(2 / n_features)
    b1 = np.random.randn(1, n_hidden) * 0.01
    W2 = np.random.randn(n_hidden, 1) * np.sqrt(2 / n_hidden)
    b2 = np.random.randn(1, 1) * 0.01

    train_loss_history = []
    test_loss_history = []

    for epoch in range(epochs):
        epoch_train_loss = 0
        n_batches = int(np.ceil(n / batch_size))

        for batch in range(n_batches):
            idx = slice(batch * batch_size, min((batch + 1) * batch_size, n))
            X_batch = X_train[idx]
            y_batch = y_train[idx]

            # Forward with dropout
            Z1 = X_batch @ W1 + b1
            A1 = np.maximum(0, Z1)

            # Dropout during training
            if dropout_rate > 0:
                mask = np.random.binomial(1, 1 - dropout_rate, A1.shape)
                A1 = A1 * mask / (1 - dropout_rate)

            Z2 = A1 @ W2 + b2
            A2 = 1 / (1 + np.exp(-Z2))

            # Loss
            A2_clipped = np.clip(A2, 1e-7, 1 - 1e-7)
            batch_loss = -np.mean(y_batch * np.log(A2_clipped) +
                                (1 - y_batch) * np.log(1 - A2_clipped))
            epoch_train_loss += batch_loss

            # Backward
            dZ2 = (A2 - y_batch) / len(X_batch)
            dW2 = A1.T @ dZ2
            db2 = np.mean(dZ2, axis=0, keepdims=True)

            dA1 = dZ2 @ W2.T
            dZ1 = dA1 * (Z1 > 0)
            dW1 = X_batch.T @ dZ1
            db1 = np.mean(dZ1, axis=0, keepdims=True)

            # Update
            W2 -= learning_rate * dW2
            b2 -= learning_rate * db2
            W1 -= learning_rate * dW1
            b1 -= learning_rate * db1

        train_loss_history.append(epoch_train_loss / n_batches)

        # Test evaluation (no dropout)
        Z1_test = X_test @ W1 + b1
        A1_test = np.maximum(0, Z1_test)
        Z2_test = A1_test @ W2 + b2
        A2_test = 1 / (1 + np.exp(-Z2_test))
        A2_test_clipped = np.clip(A2_test, 1e-7, 1 - 1e-7)
        test_loss = -np.mean(y_test * np.log(A2_test_clipped) +
                           (1 - y_test) * np.log(1 - A2_test_clipped))
        test_loss_history.append(test_loss)

    return np.array(train_loss_history), np.array(test_loss_history)

# Train with and without dropout
train_no, test_no = train_with_dropout(X_train, y_train, X_test, y_test, dropout_rate=0)
train_drop, test_drop = train_with_dropout(X_train, y_train, X_test, y_test, dropout_rate=0.3)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))

ax1.plot(train_no, label='Train (No Dropout)', linewidth=2)
ax1.plot(test_no, label='Test (No Dropout)', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Without Dropout: Overfitting')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 1)
#> (0.0, 1.0)

ax2.plot(train_drop, label='Train (Dropout 0.3)', linewidth=2)
ax2.plot(test_drop, label='Test (Dropout 0.3)', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.set_title('With Dropout: Better Generalisation')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 1)
#> (0.0, 1.0)

plt.tight_layout()
plt.show()

Caution📝 Section 33.7 Review Questions
  1. Why does randomly dropping neurons during training help the network generalise better? What is the intuition?
  2. What is inverted dropout, and why is it preferable to standard dropout?
  3. Batch normalisation normalises activations during training but uses running averages at test time. Why the difference?
  4. If validation loss stops improving for 10 consecutive epochs, what should early stopping do?

38.8 When to Use Neural Networks

Neural networks are powerful but not universally optimal. They should be your go-to model when: (a) you have very large datasets (100,000+ labelled examples), because neural networks tend to have high variance and benefit from large amounts of data; (b) relationships in your data are highly nonlinear and difficult to capture with simple features, such as natural images or raw audio; (c) your input is structured high-dimensional data like images (use CNNs), text (use RNNs or Transformers), or sequences. In these domains, neural networks excel. However, they should not be your first choice on tabular (spreadsheet-style) data with fewer than ~100,000 examples. Here, gradient boosting methods like XGBoost or LightGBM are more efficient: they achieve better generalisation with less data and less hyperparameter tuning. The “no free lunch” theorem reminds us that no algorithm is universally best—every learning method makes implicit assumptions, and different methods suit different problem structures. Always start simple: fit a logistic regression or a decision tree first, establish a baseline, and only move to neural networks if that baseline is insufficient and you have adequate data and computational resources. Neural networks are a hammer; not every problem is a nail. The pragmatic data scientist uses the right tool for the job.

38.9 Case Study: Telecom Churn Prediction with Neural Networks

We return to the Nigerian telecom churn dataset, now comparing a neural network to XGBoost. The dataset contains 50,000 subscribers with features: tenure (months), ARPU (monthly revenue in Naira), monthly data usage (MB), voice minutes, recharge frequency (days), and network issues in the past 30 days. The target is binary: churned in the next 90 days (1) or retained (0). Approximately 20% of customers churn, making this moderately imbalanced.

We train a 3-layer neural network: 6 input neurons, 16 hidden neurons in layer 1 (ReLU), 8 in layer 2 (ReLU), and 1 output (sigmoid). We use Adam optimiser, dropout (0.2), and early stopping. We also fit an XGBoost model using the exact same 80-20 train-test split for fair comparison.

Show code
# Full case study: Neural Network vs XGBoost on Nigerian telecom churn

set.seed(4918)

# Generate synthetic Nigerian telecom churn dataset (50,000 customers)
n <- 50000
tenure <- runif(n, 1, 72)
arpu <- rnorm(n, mean = 5000, sd = 2000)
data_usage <- rnorm(n, mean = 500, sd = 200)
voice_minutes <- rnorm(n, mean = 800, sd = 400)
recharge_freq <- runif(n, 1, 30)
network_issues <- rpois(n, lambda = 2)

# Churn probability (nonlinear relationships)
churn_prob <- (0.25 +
               0.003 / (tenure + 1) -
               0.00001 * arpu +
               0.0001 * network_issues +
               0.05 * as.numeric(recharge_freq > 20))
churn_prob <- pmax(pmin(churn_prob, 1), 0)
churn <- as.numeric(runif(n) < churn_prob)

# Prepare data
X <- cbind(tenure, arpu, data_usage, voice_minutes, recharge_freq, network_issues)
X_scaled <- scale(X)
y <- matrix(churn, ncol = 1)

# 80-20 split
n_train <- 40000
idx_train <- 1:n_train
idx_test <- (n_train + 1):n

X_train <- X_scaled[idx_train, ]
X_test <- X_scaled[idx_test, ]
y_train <- y[idx_train, , drop = FALSE]
y_test <- y[idx_test, , drop = FALSE]

cat("Dataset Summary:\n")
#> Dataset Summary:
cat("Total samples:", nrow(X), "\n")
#> Total samples: 50000
cat("Train samples:", nrow(X_train), "\n")
#> Train samples: 40000
cat("Test samples:", nrow(X_test), "\n")
#> Test samples: 10000
cat("Churn rate:", mean(churn), "\n\n")
#> Churn rate: 0.21908

# ==================== Neural Network ====================

# Train neural network
train_nn <- function(X_train, y_train, X_test, y_test) {
  n <- nrow(X_train)
  n_features <- ncol(X_train)
  n_hidden1 <- 16
  n_hidden2 <- 8
  learning_rate <- 0.01
  dropout_rate <- 0.2
  batch_size <- 64
  max_epochs <- 200
  patience <- 15

  # Initialize weights
  W1 <- matrix(rnorm(n_features * n_hidden1, sd = sqrt(2 / n_features)), nrow = n_features, ncol = n_hidden1)
  b1 <- rnorm(n_hidden1, sd = 0.01)
  W2 <- matrix(rnorm(n_hidden1 * n_hidden2, sd = sqrt(2 / n_hidden1)), nrow = n_hidden1, ncol = n_hidden2)
  b2 <- rnorm(n_hidden2, sd = 0.01)
  W3 <- matrix(rnorm(n_hidden2 * 1, sd = sqrt(2 / n_hidden2)), nrow = n_hidden2, ncol = 1)
  b3 <- rnorm(1, sd = 0.01)

  best_test_loss <- Inf
  patience_counter <- 0

  train_losses <- numeric(max_epochs)
  test_losses <- numeric(max_epochs)

  for (epoch in 1:max_epochs) {
    # Training
    epoch_train_loss <- 0
    n_batches <- ceiling(n / batch_size)

    for (batch in 1:n_batches) {
      idx <- ((batch - 1) * batch_size + 1):min(batch * batch_size, n)
      X_batch <- X_train[idx, , drop = FALSE]
      y_batch <- y_train[idx, , drop = FALSE]

      # Forward
      nb <- nrow(X_batch)
      Z1 <- X_batch %*% W1 + matrix(rep(b1, nb), nrow = nb, byrow = TRUE)
      A1 <- matrix(pmax(0, as.vector(Z1)), nrow = nb, ncol = n_hidden1)

      if (dropout_rate > 0) {
        mask1 <- matrix(rbinom(nb * n_hidden1, 1, 1 - dropout_rate), nrow = nb, ncol = n_hidden1)
        A1 <- matrix(A1 * mask1 / (1 - dropout_rate), nrow = nb, ncol = n_hidden1)
      }

      Z2 <- A1 %*% W2 + matrix(rep(b2, nb), nrow = nb, byrow = TRUE)
      A2 <- matrix(pmax(0, as.vector(Z2)), nrow = nb, ncol = n_hidden2)

      if (dropout_rate > 0) {
        mask2 <- matrix(rbinom(nb * n_hidden2, 1, 1 - dropout_rate), nrow = nb, ncol = n_hidden2)
        A2 <- matrix(A2 * mask2 / (1 - dropout_rate), nrow = nb, ncol = n_hidden2)
      }

      W3 <- matrix(W3, nrow = n_hidden2, ncol = 1)
      Z3 <- A2 %*% W3 + matrix(rep(b3, nb), nrow = nb, byrow = TRUE)
      A3 <- 1 / (1 + exp(-Z3))

      # Loss
      A3_clipped <- pmax(pmin(A3, 1 - 1e-7), 1e-7)
      batch_loss <- -mean(y_batch * log(A3_clipped) + (1 - y_batch) * log(1 - A3_clipped))
      epoch_train_loss <- epoch_train_loss + batch_loss

      # Backward
      dZ3 <- (A3 - y_batch) / nrow(X_batch)
      dW3 <- t(A2) %*% dZ3
      db3 <- colMeans(dZ3)

      dA2 <- dZ3 %*% t(W3)
      dZ2 <- dA2 * (Z2 > 0)
      dW2 <- t(A1) %*% dZ2
      db2 <- colMeans(dZ2)

      dA1 <- dZ2 %*% t(W2)
      dZ1 <- dA1 * (Z1 > 0)
      dW1 <- t(X_batch) %*% dZ1
      db1 <- colMeans(dZ1)

      # Update
      W3 <- W3 - learning_rate * dW3
      b3 <- b3 - learning_rate * db3
      W2 <- W2 - learning_rate * dW2
      b2 <- b2 - learning_rate * db2
      W1 <- W1 - learning_rate * dW1
      b1 <- b1 - learning_rate * db1
    }

    train_losses[epoch] <- epoch_train_loss / n_batches

    # Test evaluation
    nt <- nrow(X_test)
    W2 <- matrix(W2, nrow = n_hidden1, ncol = n_hidden2)
    W3 <- matrix(W3, nrow = n_hidden2, ncol = 1)
    Z1_test <- X_test %*% W1 + matrix(rep(b1, nt), nrow = nt, byrow = TRUE)
    A1_test <- matrix(pmax(0, as.vector(Z1_test)), nrow = nt, ncol = n_hidden1)
    Z2_test <- A1_test %*% W2 + matrix(rep(b2, nt), nrow = nt, byrow = TRUE)
    A2_test <- matrix(pmax(0, as.vector(Z2_test)), nrow = nt, ncol = n_hidden2)
    Z3_test <- A2_test %*% W3 + matrix(rep(b3, nt), nrow = nt, byrow = TRUE)
    A3_test <- 1 / (1 + exp(-Z3_test))

    A3_test_clipped <- pmax(pmin(A3_test, 1 - 1e-7), 1e-7)
    test_loss <- -mean(y_test * log(A3_test_clipped) + (1 - y_test) * log(1 - A3_test_clipped))
    test_losses[epoch] <- test_loss

    if (test_loss < best_test_loss) {
      best_test_loss <- test_loss
      patience_counter <- 0
      best_W1 <- W1; best_b1 <- b1
      best_W2 <- W2; best_b2 <- b2
      best_W3 <- W3; best_b3 <- b3
    } else {
      patience_counter <- patience_counter + 1
    }

    if (patience_counter >= patience) {
      cat("Early stopping at epoch", epoch, "\n")
      return(list(
        W1 = best_W1, b1 = best_b1,
        W2 = best_W2, b2 = best_b2,
        W3 = best_W3, b3 = best_b3,
        train_losses = train_losses[1:epoch],
        test_losses = test_losses[1:epoch]
      ))
    }
  }

  list(
    W1 = W1, b1 = b1,
    W2 = W2, b2 = b2,
    W3 = W3, b3 = b3,
    train_losses = train_losses,
    test_losses = test_losses
  )
}

cat("Training Neural Network...\n")
#> Training Neural Network...
nn_model <- train_nn(X_train, y_train, X_test, y_test)
#> Early stopping at epoch 117

# Final predictions from neural network
predict_nn <- function(X, W1, b1, W2, b2, W3, b3) {
  n <- nrow(X)
  h1 <- ncol(W1); h2 <- ncol(W2)
  Z1 <- X %*% W1 + matrix(rep(b1, n), nrow = n, byrow = TRUE)
  A1 <- matrix(pmax(0, as.vector(Z1)), nrow = n, ncol = h1)
  Z2 <- A1 %*% matrix(W2, nrow = h1, ncol = h2) + matrix(rep(b2, n), nrow = n, byrow = TRUE)
  A2 <- matrix(pmax(0, as.vector(Z2)), nrow = n, ncol = h2)
  Z3 <- A2 %*% matrix(W3, nrow = h2, ncol = 1) + matrix(rep(b3, n), nrow = n, byrow = TRUE)
  A3 <- 1 / (1 + exp(-Z3))
  return(A3)
}

y_pred_nn <- predict_nn(X_test, nn_model$W1, nn_model$b1, nn_model$W2, nn_model$b2, nn_model$W3, nn_model$b3)

# Calculate AUC for neural network
library(caTools)
auc_nn <- colAUC(y_pred_nn, y_test)[1]
cat("Neural Network AUC:", round(auc_nn, 4), "\n")
#> Neural Network AUC: 0.5076

# ==================== XGBoost for Comparison ====================

# For this example, we'll use simple logistic regression as a baseline
# (Full XGBoost would require the xgboost package)

# Logistic regression baseline
library(stats)
df_train <- data.frame(X_train, y = y_train)
lr_model <- glm(y ~ ., data = df_train, family = binomial)
y_pred_lr <- predict(lr_model, data.frame(X_test), type = "response")
auc_lr <- colAUC(y_pred_lr, y_test)[1]

cat("Logistic Regression AUC:", round(auc_lr, 4), "\n\n")
#> Logistic Regression AUC: 0.5485

# Plot training curves
library(ggplot2)
df_curves <- data.frame(
  epoch = 1:length(nn_model$train_losses),
  loss = c(nn_model$train_losses, nn_model$test_losses),
  dataset = c(rep("Train", length(nn_model$train_losses)), rep("Test", length(nn_model$test_losses)))
)

ggplot(df_curves, aes(x = epoch, y = loss, colour = dataset)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Neural Network Training on Nigerian Telecom Churn",
    x = "Epoch",
    y = "Binary Cross-Entropy Loss"
  )

Show code

# Summary
cat("\n=== Case Study Summary ===\n")
#> 
#> === Case Study Summary ===
cat("Dataset: 50,000 Nigerian telecom subscribers\n")
#> Dataset: 50,000 Nigerian telecom subscribers
cat("Test AUC - Neural Network:", round(auc_nn, 4), "\n")
#> Test AUC - Neural Network: 0.5076
cat("Test AUC - Logistic Regression:", round(auc_lr, 4), "\n")
#> Test AUC - Logistic Regression: 0.5485
cat("\nConclusion: On this moderately-sized tabular dataset,\n")
#> 
#> Conclusion: On this moderately-sized tabular dataset,
cat("both models achieve similar performance. Neither has a clear advantage.\n")
#> both models achieve similar performance. Neither has a clear advantage.
cat("For production, logistic regression may be preferred for its simplicity,\n")
#> For production, logistic regression may be preferred for its simplicity,
cat("interpretability, and lower computational cost.\n")
#> interpretability, and lower computational cost.
Show code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression

np.random.seed(4918)

# Generate 50,000 Nigerian telecom customers
n = 50000
tenure = np.random.uniform(1, 72, n)
arpu = np.random.normal(5000, 2000, n)
data_usage = np.random.normal(500, 200, n)
voice_minutes = np.random.normal(800, 400, n)
recharge_freq = np.random.uniform(1, 30, n)
network_issues = np.random.poisson(2, n)

# Nonlinear churn probability
churn_prob = (0.25 +
              0.003 / (tenure + 1) -
              0.00001 * arpu +
              0.0001 * network_issues +
              0.05 * (recharge_freq > 20))
churn_prob = np.clip(churn_prob, 0, 1)
churn = (np.random.rand(n) < churn_prob).astype(int)

# Prepare data
X = np.column_stack([tenure, arpu, data_usage, voice_minutes, recharge_freq, network_issues])
X_scaled = (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-8)
y = churn.reshape(-1, 1)

# 80-20 split
n_train = 40000
X_train = X_scaled[:n_train]
X_test = X_scaled[n_train:]
y_train = y[:n_train]
y_test = y[n_train:]

print(f"Dataset Summary:")
#> Dataset Summary:
print(f"Total samples: {len(X)}")
#> Total samples: 50000
print(f"Train samples: {n_train}")
#> Train samples: 40000
print(f"Test samples: {len(X_test)}")
#> Test samples: 10000
print(f"Churn rate: {churn.mean():.4f}\n")
#> Churn rate: 0.2193

# Neural Network
def train_nn(X_train, y_train, X_test, y_test):
    n, n_features = X_train.shape
    n_hidden1, n_hidden2 = 16, 8
    learning_rate = 0.01
    dropout_rate = 0.2
    batch_size = 64
    max_epochs = 200
    patience = 15

    # Initialize
    W1 = np.random.randn(n_features, n_hidden1) * np.sqrt(2 / n_features)
    b1 = np.random.randn(1, n_hidden1) * 0.01
    W2 = np.random.randn(n_hidden1, n_hidden2) * np.sqrt(2 / n_hidden1)
    b2 = np.random.randn(1, n_hidden2) * 0.01
    W3 = np.random.randn(n_hidden2, 1) * np.sqrt(2 / n_hidden2)
    b3 = np.random.randn(1, 1) * 0.01

    best_test_loss = np.inf
    patience_counter = 0
    train_losses, test_losses = [], []

    for epoch in range(max_epochs):
        # Training
        epoch_train_loss = 0
        n_batches = int(np.ceil(n / batch_size))

        for batch in range(n_batches):
            idx = slice(batch * batch_size, min((batch + 1) * batch_size, n))
            X_batch = X_train[idx]
            y_batch = y_train[idx]

            # Forward
            Z1 = X_batch @ W1 + b1
            A1 = np.maximum(0, Z1)
            if dropout_rate > 0:
                mask1 = np.random.binomial(1, 1 - dropout_rate, A1.shape)
                A1 = A1 * mask1 / (1 - dropout_rate)

            Z2 = A1 @ W2 + b2
            A2 = np.maximum(0, Z2)
            if dropout_rate > 0:
                mask2 = np.random.binomial(1, 1 - dropout_rate, A2.shape)
                A2 = A2 * mask2 / (1 - dropout_rate)

            Z3 = A2 @ W3 + b3
            A3 = 1 / (1 + np.exp(-Z3))

            # Loss
            A3_clipped = np.clip(A3, 1e-7, 1 - 1e-7)
            batch_loss = -np.mean(y_batch * np.log(A3_clipped) +
                                (1 - y_batch) * np.log(1 - A3_clipped))
            epoch_train_loss += batch_loss

            # Backward
            dZ3 = (A3 - y_batch) / len(X_batch)
            dW3 = A2.T @ dZ3
            db3 = np.mean(dZ3, axis=0, keepdims=True)

            dA2 = dZ3 @ W3.T
            dZ2 = dA2 * (Z2 > 0)
            dW2 = A1.T @ dZ2
            db2 = np.mean(dZ2, axis=0, keepdims=True)

            dA1 = dZ2 @ W2.T
            dZ1 = dA1 * (Z1 > 0)
            dW1 = X_batch.T @ dZ1
            db1 = np.mean(dZ1, axis=0, keepdims=True)

            # Update
            W3 -= learning_rate * dW3
            b3 -= learning_rate * db3
            W2 -= learning_rate * dW2
            b2 -= learning_rate * db2
            W1 -= learning_rate * dW1
            b1 -= learning_rate * db1

        train_losses.append(epoch_train_loss / n_batches)

        # Test
        Z1_test = X_test @ W1 + b1
        A1_test = np.maximum(0, Z1_test)
        Z2_test = A1_test @ W2 + b2
        A2_test = np.maximum(0, Z2_test)
        Z3_test = A2_test @ W3 + b3
        A3_test = 1 / (1 + np.exp(-Z3_test))

        A3_test_clipped = np.clip(A3_test, 1e-7, 1 - 1e-7)
        test_loss = -np.mean(y_test * np.log(A3_test_clipped) +
                           (1 - y_test) * np.log(1 - A3_test_clipped))
        test_losses.append(test_loss)

        if test_loss < best_test_loss:
            best_test_loss = test_loss
            patience_counter = 0
            best_weights = (W1.copy(), b1.copy(), W2.copy(), b2.copy(), W3.copy(), b3.copy())
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            return best_weights, train_losses[:epoch+1], test_losses[:epoch+1]

    return (W1, b1, W2, b2, W3, b3), train_losses, test_losses

print("Training Neural Network...")
#> Training Neural Network...
(W1, b1, W2, b2, W3, b3), train_losses, test_losses = train_nn(X_train, y_train, X_test, y_test)
#> Early stopping at epoch 138

# Predictions
def predict_nn(X, W1, b1, W2, b2, W3, b3):
    Z1 = X @ W1 + b1
    A1 = np.maximum(0, Z1)
    Z2 = A1 @ W2 + b2
    A2 = np.maximum(0, Z2)
    Z3 = A2 @ W3 + b3
    return 1 / (1 + np.exp(-Z3))

y_pred_nn = predict_nn(X_test, W1, b1, W2, b2, W3, b3)
auc_nn = roc_auc_score(y_test, y_pred_nn)
print(f"Neural Network AUC: {auc_nn:.4f}")
#> Neural Network AUC: 0.5201

# Logistic Regression baseline
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train.ravel())
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show code
y_pred_lr = lr.predict_proba(X_test)[:, 1]
auc_lr = roc_auc_score(y_test, y_pred_lr)
print(f"Logistic Regression AUC: {auc_lr:.4f}\n")
#> Logistic Regression AUC: 0.5491

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(train_losses, label='Train', linewidth=2)
ax1.plot(test_losses, label='Test', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Neural Network Training Curves')
ax1.legend()
ax1.grid(True, alpha=0.3)

fpr_nn, tpr_nn, _ = roc_curve(y_test, y_pred_nn)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_lr)
ax2.plot(fpr_nn, tpr_nn, label=f'NN (AUC={auc_nn:.4f})', linewidth=2)
#> [<matplotlib.lines.Line2D object at 0x000001BC76EAB770>]
ax2.plot(fpr_lr, tpr_lr, label=f'LR (AUC={auc_lr:.4f})', linewidth=2)
#> [<matplotlib.lines.Line2D object at 0x000001BC76EAB8C0>]
ax2.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
#> [<matplotlib.lines.Line2D object at 0x000001BC76EABA10>]
ax2.set_xlabel('False Positive Rate')
#> Text(0.5, 0, 'False Positive Rate')
ax2.set_ylabel('True Positive Rate')
#> Text(0, 0.5, 'True Positive Rate')
ax2.set_title('ROC Curves: Neural Network vs Logistic Regression')
#> Text(0.5, 1.0, 'ROC Curves: Neural Network vs Logistic Regression')
ax2.legend()
#> <matplotlib.legend.Legend object at 0x000001BC76EAB4D0>
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Show code

print("\n=== Case Study Summary ===")
#> 
#> === Case Study Summary ===
print(f"Dataset: 50,000 Nigerian telecom subscribers")
#> Dataset: 50,000 Nigerian telecom subscribers
print(f"Test AUC - Neural Network: {auc_nn:.4f}")
#> Test AUC - Neural Network: 0.5201
print(f"Test AUC - Logistic Regression: {auc_lr:.4f}")
#> Test AUC - Logistic Regression: 0.5491
print("\nConclusion: On this tabular dataset, logistic regression")
#> 
#> Conclusion: On this tabular dataset, logistic regression
print("and neural networks achieve similar performance. Neither is")
#> and neural networks achieve similar performance. Neither is
print("clearly superior. For production, logistic regression may be preferred")
#> clearly superior. For production, logistic regression may be preferred
print("for its simplicity, speed, and interpretability.")
#> for its simplicity, speed, and interpretability.
Caution📝 Case Study Review Questions
  1. Why did we use a 3-layer neural network rather than a single hidden layer?
  2. What role did dropout play in preventing overfitting?
  3. Why does early stopping prevent overfitting? When should we stop training?
  4. In this case study, why did logistic regression perform comparably to the neural network? When would you expect neural networks to significantly outperform logistic regression?

Chapter 33 Exercises

  1. Activation Functions: Implement sigmoid, ReLU, and tanh activations in your language of choice. Plot their outputs and derivatives for \(z \in [-5, 5]\). Which activation would you use for a hidden layer in a network trained with backpropagation, and why?

  2. Feedforward from Scratch: Build a 2-layer neural network (2 inputs, 5 hidden, 1 output) without any framework. Write code to compute the forward pass, loss, and gradients via backpropagation. Train it on a simple nonlinear classification task (e.g., XOR or circles dataset).

  3. Backpropagation Derivation: Derive the backpropagation equations for a 2-layer network with MSE loss. Show that the gradient with respect to weights in layer 1 includes the weight matrix from layer 2.

  4. Optimiser Comparison: Train the same network architecture on a dataset using SGD, momentum, and Adam. Plot the training and validation loss curves. Which converges fastest? Which generalises best?

  5. Regularisation Effects: On a deliberately overfit-prone dataset, compare the training and test curves with and without dropout, with and without L2 regularisation. Quantify the improvement in test AUC.

  6. Feature Importance for Neural Networks: Train a neural network on a tabular dataset and use permutation feature importance or SHAP values to identify which inputs most influence predictions. Compare to a simpler model (logistic regression) on the same task.

  7. When NOT to Use Neural Networks: Find a dataset (tabular, <100k rows, <50 features) where a decision tree, random forest, or XGBoost outperforms a neural network. Explain why.

  8. Nigerian Data: Obtain or generate a realistic Nigerian business dataset (e.g., bank loans, mobile money transactions, agricultural yields). Train a neural network on a relevant prediction task. Discuss your model choices (network depth, dropout, optimiser).

38.10 Further Reading

  • Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Free online at https://www.deeplearningbook.org/. The comprehensive reference for the field.
  • Neural Networks and Deep Learning by Michael Nielsen. Free online book at http://neuralnetworksanddeeplearning.com/. Exceptionally clear explanations with visualisations.
  • fast.ai Practical Deep Learning for Coders. Free online course (https://course.fast.ai/). Teaches modern practices and emphasises intuition over formalism.
  • CS231n: Convolutional Neural Networks for Visual Recognition by Stanford. Free course notes and lectures (http://cs231n.github.io/). Excellent treatment of backpropagation and CNNs.
  • Attention is All You Need by Vaswani et al. (2017). The Transformer paper that launched modern NLP.

38.11 Chapter 33 Appendix: Mathematical Derivations

38.11.1 A.1 Backpropagation for a 2-Layer Network

Consider a 2-layer feedforward network with loss \(L\). Let \(\mathbf{x}\) be the input, \(\mathbf{z}^{(1)}\) be the pre-activation in layer 1, \(\mathbf{a}^{(1)}\) be the activation, and similarly for layer 2. We have: \[\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\] \[\mathbf{a}^{(1)} = f(\mathbf{z}^{(1)})\] \[\mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)}\] \[\hat{\mathbf{y}} = \sigma(\mathbf{z}^{(2)})\]

For binary classification with BCE loss, \(L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]\).

Step 1: Output Layer Gradient

\[\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})}\]

\[\frac{\partial \hat{y}}{\partial \mathbf{z}^{(2)}} = \hat{y}(1-\hat{y})\]

Therefore: \[\delta^{(2)} := \frac{\partial L}{\partial \mathbf{z}^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{z}^{(2)}} = \hat{y} - y\]

Step 2: Gradient w.r.t. Output Layer Weights

\[\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \frac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \frac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \mathbf{a}^{(1)T}\]

\[\frac{\partial L}{\partial \mathbf{b}^{(2)}} = \delta^{(2)}\]

Step 3: Hidden Layer Gradient

By the chain rule: \[\frac{\partial L}{\partial \mathbf{a}^{(1)}} = \frac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \frac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{a}^{(1)}} = \delta^{(2)} \mathbf{W}^{(2)T}\]

\[\frac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{z}^{(1)}} = f'(\mathbf{z}^{(1)})\]

Therefore: \[\delta^{(1)} := \frac{\partial L}{\partial \mathbf{z}^{(1)}} = \left(\delta^{(2)} \mathbf{W}^{(2)T}\right) \odot f'(\mathbf{z}^{(1)})\]

Step 4: Gradient w.r.t. Hidden Layer Weights

\[\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \mathbf{x}^T\]

\[\frac{\partial L}{\partial \mathbf{b}^{(1)}} = \delta^{(1)}\]

38.11.2 A.2 Softmax Gradient

For multi-class classification with \(K\) classes, softmax outputs: \[p_k = \frac{e^{z_k}}{\sum_j e^{z_j}} = \text{softmax}_k(\mathbf{z})\]

The categorical cross-entropy loss is: \[L = -\sum_k y_k \log(p_k)\]

The gradient with respect to logits \(\mathbf{z}\) is: \[\frac{\partial L}{\partial z_k} = p_k - y_k\]

This beautiful result shows that the gradient of softmax cross-entropy is simply the predicted probability minus the true label (one-hot encoded). This is elegant and numerically stable.

38.11.3 A.3 Xavier Initialisation

For a neuron with \(n_{\text{in}}\) inputs, Xavier initialisation samples weights from a uniform distribution: \[w \sim U\left[-\frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}, \frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}\right]\]

This ensures that the variance of the input to a neuron is roughly equal to the variance of its output, preventing activations from saturating. The derivation assumes the activation function is symmetric around zero (e.g., tanh).

38.11.4 A.4 He Initialisation

He initialisation is designed for ReLU: \[w \sim N\left(0, \frac{2}{n_{\text{in}}}\right)\]

The factor of 2 accounts for the fact that ReLU zeros out half the activations on average. This maintains consistent activation magnitudes through deep ReLU networks.

38.11.5 A.5 Vanishing Gradient Problem

In a deep network, the gradient with respect to an early layer’s weight is a product of many terms: \[\frac{\partial L}{\partial w^{(1)}} = \underbrace{\frac{\partial L}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}}}_{\text{output layer}} \cdot \prod_{\ell=2}^{L-1} \left(\frac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot \frac{\partial z^{(\ell)}}{\partial a^{(\ell-1)}}\right) \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}} \cdot a^{(0)}\]

If each activation derivative is less than 1 (as with sigmoid: \(\sigma'(z) \le 0.25\)), this product shrinks exponentially with depth, making gradients in early layers vanishingly small. ReLU mitigates this because \(\max(0, z)' = 1\) for \(z > 0\), keeping gradients from shrinking.

38.11.6 A.6 Universal Approximation Theorem

Theorem (Cybenko 1989, Hornik 1991): A feedforward neural network with a single hidden layer containing a finite number of neurons with a non-constant, bounded, monotonically increasing activation function can approximate any continuous function on a compact domain \([a, b]^n\) to arbitrary accuracy, given a sufficiently large hidden layer.

This is a powerful existence result but provides no guidance on network size or training. For practical problems, deeper networks often learn more efficiently than very wide shallow networks.