39 Image Analytics with Convolutional Neural Networks

Author

Bongo Adi

📋 Learning Objectives

Understand images as high-dimensional arrays and why standard neural networks struggle with them
Master the convolution operation and how filters learn spatial features
Build convolutional neural networks (CNNs) for image classification
Apply transfer learning using pre-trained models on limited data
Implement and evaluate a CNN on real agricultural image data from Africa

39.1 Why Images Are a Different Kind of Data

In 2023, a team of agricultural researchers at a Nigerian university was helping smallholder cassava farmers in Benue State. Cassava is a lifeline crop — it feeds hundreds of millions of people across Sub-Saharan Africa. But a single disease outbreak, such as Cassava Mosaic Virus, can destroy an entire season’s harvest before farmers even notice something is wrong. The researchers’ question was simple: could a smartphone app analyse a photo of a cassava leaf and detect disease early enough for farmers to respond?

This question — recognising what is in an image — turns out to be one of the most consequential problems in modern artificial intelligence. It sits behind self-driving cars (recognising pedestrians and road markings), medical diagnosis (detecting tumours in X-rays), quality control in factories (spotting defective products on a conveyor belt), and document processing (reading handwritten forms). For the first 70 years of computing, machines were terrible at this. Humans can glance at a cassava leaf for a fraction of a second and immediately know whether it is healthy. Machines struggled for decades to achieve what a child can do effortlessly.

The breakthrough came with Convolutional Neural Networks (CNNs), a class of deep learning model specifically designed to understand spatial structure — the patterns, textures, and shapes that make one object different from another. CNNs power the cassava disease detector, the face ID on your phone, the satellite crop monitoring systems used by agricultural insurers, and the chest X-ray screening tools being deployed in under-resourced Nigerian hospitals.

This chapter explains how CNNs work and, crucially, how to use them in practice without needing to build one from scratch — because in reality, almost nobody builds a CNN from zero. Instead, we use transfer learning: take a CNN that has already been trained on millions of images, then adapt it to your specific task with a much smaller local dataset. A model that has learned to recognise dogs, cars, and furniture from ImageNet has already learned to detect edges, textures, and shapes — general visual features that also appear in cassava leaves, X-rays, and factory products.

39.1.1 Images as Data: What a Computer Actually Sees

To understand how CNNs work, we first need to understand how a computer sees an image at all. When you look at a photograph of a cassava leaf, you perceive colours, shapes, and patterns. A computer sees something very different: a grid of numbers.

A digital image is made up of tiny squares called pixels (short for “picture elements”). Each pixel has a colour value. For a black-and-white image, each pixel is a single number between 0 (pure black) and 255 (pure white) — 128 would be a medium grey. A 100 × 100 pixel black-and-white image is therefore a grid of 100 rows and 100 columns, containing 10,000 numbers.

Colour images are slightly more complex. Each pixel has three numbers: one for how much red light it contains, one for green, and one for blue (hence “RGB” — Red, Green, Blue). By mixing these three colours in different proportions, any colour can be reproduced. A red pixel might be (255, 0, 0) — maximum red, no green, no blue. A yellow pixel is (255, 255, 0) — maximum red, maximum green, no blue. So a 224 × 224 pixel colour image contains 224 × 224 × 3 = 150,528 numbers.

The challenge for a machine learning model is this: given these 150,528 numbers, determine what the image shows. This is what CNNs are designed to do. And the key challenge — why a regular neural network is not enough — is the spatial structure of images. When you recognise a cat’s ear, you recognise it whether it is in the top-left of the image or the bottom-right. The shape is the same; only its position changes. A standard neural network has no concept of “position” — it treats every number in the grid as an independent input. CNNs are fundamentally different: they are designed to recognise patterns that repeat across space, regardless of where in the image those patterns appear.

An image is fundamentally a matrix of numbers. A grayscale image of size \(H \times W\) (height and width in pixels) is a 2D matrix where each entry is a pixel intensity between 0 (black) and 255 (white). An RGB colour image is a 3D tensor of shape \(H \times W \times 3\). When we feed a \(64 \times 64\) RGB image into a standard feedforward neural network (flattened to a vector), we have \(64 \times 64 \times 3 = 12,288\) input features — and a single hidden layer with 1,000 neurons requires 12.3 million weights. For larger images (\(224 \times 224\)), this explodes to approximately 150 million weights in a single layer. Training and storing such networks is computationally prohibitive, and more importantly, they often fail to generalise well because they cannot exploit spatial structure.

The solution: convolutional neural networks apply the same small filter (kernel) across the entire image, enforcing weight sharing and capturing spatial locality. If the model learns to detect a vertical edge, it detects that edge wherever it appears in the image — it does not need to learn the edge separately for each possible location.

39.2 The Convolution Operation

A convolutional layer consists of multiple filters (kernels), each a small matrix of learned weights, typically \(3 \times 3\), \(5 \times 5\), or \(7 \times 7\). We slide each filter across the image, computing the element-wise product of the filter and the image patch, then summing the result. This produces one output value for each position—a feature map. With \(F\) filters, we produce \(F\) feature maps.

Mathematically, if the filter has weights \(\mathbf{w}\) and we place it at position \((i, j)\) in an image, the output is: \[z_{i,j} = \sum_m \sum_n w_{m,n} \cdot x_{i+m, j+n} + b\]

where \(x\) is the input image, \(w\) is the filter, and \(b\) is a bias term. The output is then passed through an activation function to create the feature map \(a_{i,j}\). Early convolutional layers learn low-level features: edges, corners, oriented textures. Deeper layers combine these into mid-level features: shapes, object parts. Very deep layers recognise high-level concepts: entire objects.

Three design choices matter. Padding adds zeros around the image boundary. With no padding (valid convolution), a \(H \times W\) image convolved with an \(F \times F\) filter produces an output of size \((H - F + 1) \times (W - F + 1)\), shrinking the image. With padding of \(P\) on all sides (same convolution), output size remains \(H \times W\). Padding prevents information loss at borders and maintains spatial dimensions, making it standard practice. Stride determines how many pixels the filter moves at each step. Stride 1 is standard; stride 2 reduces the output size by half. Dilated convolution (dilation rate > 1) expands the filter by adding gaps between weights, increasing the receptive field without adding parameters. These choices let us design networks with specific spatial resolution changes.

📘 Theory: Convolution and Feature Maps

Convolution is the fundamental operation of a CNN. A filter \(\mathbf{W}\) of size \(K \times K\) slides across an input image of size \(H \times W\), computing a dot product at each location. The resulting feature map highlights positions where the pattern encoded by \(\mathbf{W}\) is present.

Key properties:

Weight sharing: the same \(K \times K\) filter is applied at every spatial position. This means a vertical-edge detector uses the same weights in the top-left corner as in the bottom-right. Weight sharing reduces parameters dramatically compared to a fully connected layer (a \(3 \times 3\) filter has 9 weights; an equivalent fully connected connection for a \(224 \times 224\) patch has \(224^2 = 50,176\) weights).
Local connectivity: each output neuron depends on a \(K \times K\) neighbourhood, not the full image. This inductive bias is appropriate for natural images: nearby pixels are correlated; distant pixels are less so.
Multiple filters: a convolutional layer uses \(F\) filters, producing \(F\) feature maps. Different filters learn different patterns: one might detect horizontal edges, another diagonal textures, a third colour gradients.
Non-linearity: after convolution, a ReLU activation \(a = \max(0, z)\) is applied element-wise, introducing non-linearity and sparsifying the representation.

Stacking convolutional layers creates a hierarchical feature hierarchy. The receptive field of a neuron in layer \(L\) grows as layers are stacked, allowing the network to build complex representations from simple ones.

🔑 Key Formula

For a 2D convolution with filter \(\mathbf{W}\) of size \(K \times K\), stride \(s\), padding \(p\): \[\text{Output size} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1\]

The convolution operation at position \((i, j)\): \[z_{i,j} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} W_{m,n} \cdot X_{i \cdot s + m, j \cdot s + n} + b\]

Multiple filters produce multiple feature maps (channels).

Show code

# Implement and visualize the 2D convolution operation

library(ggplot2)
library(reshape2)

# Simple edge detection filter (Sobel-like)
sobel_x <- matrix(c(-1, 0, 1, -2, 0, 2, -1, 0, 1), nrow = 3, ncol = 3)
sobel_y <- matrix(c(-1, -2, -1, 0, 0, 0, 1, 2, 1), nrow = 3, ncol = 3)

# Function to perform 2D convolution
convolve_2d <- function(image, filter, padding = 1) {
  h <- nrow(image)
  w <- ncol(image)
  fh <- nrow(filter)
  fw <- ncol(filter)

  # Pad image
  if (padding > 0) {
    padded_image <- matrix(0, nrow = h + 2 * padding, ncol = w + 2 * padding)
    padded_image[(padding + 1):(padding + h), (padding + 1):(padding + w)] <- image
    image <- padded_image
    h <- nrow(image)
    w <- ncol(image)
  }

  # Output size
  out_h <- h - fh + 1
  out_w <- w - fw + 1
  output <- matrix(0, nrow = out_h, ncol = out_w)

  # Convolve
  for (i in 1:out_h) {
    for (j in 1:out_w) {
      patch <- image[i:(i + fh - 1), j:(j + fw - 1)]
      output[i, j] <- sum(patch * filter)
    }
  }

  return(output)
}

# Create a simple synthetic image with vertical and horizontal edges
image <- matrix(0, nrow = 10, ncol = 10)
image[1:5, 5:10] <- 1  # White rectangle on right
image[6:10, 1:5] <- 1  # White rectangle on bottom-left

# Apply filters
edges_x <- convolve_2d(image, sobel_x, padding = 1)
edges_y <- convolve_2d(image, sobel_y, padding = 1)
edge_magnitude <- sqrt(edges_x^2 + edges_y^2)

# Visualize
par(mfrow = c(2, 2))

# Original image
image(image, main = "Original Image", col = gray(0:255 / 255), axes = FALSE)

# Edge detection X
image(edges_x, main = "Vertical Edges (Sobel X)", col = gray(0:255 / 255), axes = FALSE)

# Edge detection Y
image(edges_y, main = "Horizontal Edges (Sobel Y)", col = gray(0:255 / 255), axes = FALSE)

# Edge magnitude
image(edge_magnitude, main = "Edge Magnitude", col = heat.colors(255), axes = FALSE)

Show code


par(mfrow = c(1, 1))

# Print filter values
cat("Sobel X filter (vertical edges):\n")
#> Sobel X filter (vertical edges):
print(sobel_x)
#>      [,1] [,2] [,3]
#> [1,]   -1   -2   -1
#> [2,]    0    0    0
#> [3,]    1    2    1
cat("\nSobel Y filter (horizontal edges):\n")
#> 
#> Sobel Y filter (horizontal edges):
print(sobel_y)
#>      [,1] [,2] [,3]
#> [1,]   -1    0    1
#> [2,]   -2    0    2
#> [3,]   -1    0    1

Show code

import numpy as np
import matplotlib.pyplot as plt
from scipy import signal

# Sobel filters for edge detection
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float)
sobel_y = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]], dtype=float)

# Function to perform 2D convolution (using scipy for validation)
def convolve_2d(image, kernel, padding=1):
    if padding > 0:
        image = np.pad(image, padding, mode='constant', constant_values=0)
    output = signal.correlate2d(image, kernel, mode='valid')
    return output

# Create synthetic image with edges
image = np.zeros((10, 10))
image[0:5, 5:10] = 1  # Right rectangle
image[5:10, 0:5] = 1  # Bottom-left rectangle

# Apply filters
edges_x = convolve_2d(image, sobel_x, padding=1)
edges_y = convolve_2d(image, sobel_y, padding=1)
edge_magnitude = np.sqrt(edges_x**2 + edges_y**2)

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(10, 10))

axes[0, 0].imshow(image, cmap='gray')
axes[0, 0].set_title('Original Image')
axes[0, 0].axis('off')
#> (np.float64(-0.5), np.float64(9.5), np.float64(9.5), np.float64(-0.5))

axes[0, 1].imshow(edges_x, cmap='gray')
axes[0, 1].set_title('Vertical Edges (Sobel X)')
axes[0, 1].axis('off')
#> (np.float64(-0.5), np.float64(9.5), np.float64(9.5), np.float64(-0.5))

axes[1, 0].imshow(edges_y, cmap='gray')
axes[1, 0].set_title('Horizontal Edges (Sobel Y)')
axes[1, 0].axis('off')
#> (np.float64(-0.5), np.float64(9.5), np.float64(9.5), np.float64(-0.5))

axes[1, 1].imshow(edge_magnitude, cmap='hot')
axes[1, 1].set_title('Edge Magnitude')
axes[1, 1].axis('off')
#> (np.float64(-0.5), np.float64(9.5), np.float64(9.5), np.float64(-0.5))

plt.tight_layout()
plt.show()

Show code


print("Sobel X filter (vertical edges):")
#> Sobel X filter (vertical edges):
print(sobel_x)
#> [[-1.  0.  1.]
#>  [-2.  0.  2.]
#>  [-1.  0.  1.]]
print("\nSobel Y filter (horizontal edges):")
#> 
#> Sobel Y filter (horizontal edges):
print(sobel_y)
#> [[-1. -2. -1.]
#>  [ 0.  0.  0.]
#>  [ 1.  2.  1.]]

# Example: Convolution with learned filter (simulated random initialization)
np.random.seed(6253)
learned_filter = np.random.randn(3, 3) * 0.1  # Small random weights
learned_features = convolve_2d(image, learned_filter, padding=1)
print(f"\nLearned filter output shape: {learned_features.shape}")
#> 
#> Learned filter output shape: (10, 10)
print(f"Feature map statistics: min={learned_features.min():.4f}, max={learned_features.max():.4f}")
#> Feature map statistics: min=-0.1673, max=0.2179

📝 Section 34.2 Review Questions

Why is weight sharing (using the same filter across the entire image) beneficial for image processing?
If an input image is \(32 \times 32\) and we apply a \(5 \times 5\) filter with stride 1 and no padding, what is the output size?
What is the receptive field of a neuron in the output layer? How does it grow as we stack convolutional layers?
Explain the difference between “valid” and “same” convolution in terms of padding and output size.

39.3 Pooling

After convolution and activation, a convolutional layer typically includes pooling, which downsamples the spatial dimensions of the feature maps. The most common variant, max pooling, divides the feature map into non-overlapping windows (typically \(2 \times 2\)) and outputs the maximum value in each window. This serves multiple purposes. First, it reduces the spatial resolution and thus the number of parameters and computations in subsequent layers. Second, it provides translation invariance: if an object moves a few pixels, max pooling still detects it. Third, it suppresses noise and captures the most salient features. Average pooling takes the mean within each window instead of the maximum; it is less commonly used in modern architectures.

A feature map from a convolutional layer highlights wherever a specific pattern appears in the image. Max pooling says: “in this small region, did the pattern appear anywhere?” If it did (high activation), the max pooling output is high; if not, it is low. This makes the representation robust to small shifts. Stacking many layers of convolution and pooling gradually enlarges the receptive field—each neuron in a deep layer “sees” a larger region of the original image—while reducing spatial dimensions. Eventually, after multiple layers, the spatial dimensions are small enough that we flatten the feature maps and feed them into fully connected layers for classification.

39.4 The CNN Architecture

A typical CNN for image classification has the structure: Input → [Conv + ReLU + Pooling] × \(L\) → Flatten → [Fully Connected + ReLU] × \(K\) → Softmax Output. The convolutional layers extract features; the fully connected layers perform classification based on those features.

A simple architecture might be: - Input: \(224 \times 224 \times 3\) (ImageNet standard size) - Conv1: 32 filters, \(3 \times 3\) → \(224 \times 224 \times 32\) - ReLU → MaxPool \(2 \times 2\) → \(112 \times 112 \times 32\) - Conv2: 64 filters, \(3 \times 3\) → \(112 \times 112 \times 64\) - ReLU → MaxPool \(2 \times 2\) → \(56 \times 56 \times 64\) - Conv3: 128 filters, \(3 \times 3\) → \(56 \times 56 \times 128\) - ReLU → MaxPool \(2 \times 2\) → \(28 \times 28 \times 128\) - Flatten → \(28 \times 28 \times 128 = 100,352\) features - FC1: 256 neurons → FC2: 10 neurons (for 10-class classification)

This architecture exploits the hierarchical nature of vision: early layers capture simple features, deeper layers recognise complex patterns. The dramatic reduction in spatial size (from \(224 \times 224\) to \(28 \times 28\)) balances expressiveness with computational efficiency.

📘 Theory: CNN Layers and Information Flow

Information flows through a CNN in a forward pass. At each layer, the spatial resolution may decrease (through pooling or strided convolutions) while the number of channels (feature maps) typically increases. This trade-off is deliberate: as spatial resolution decreases, each feature map captures increasingly abstract information over a larger region of the original image.

Information bottleneck intuition: Early layers have high spatial resolution (many locations) but few filters (low-level features). Later layers have low spatial resolution but many filters (high-level features). After several rounds of convolution + pooling, the spatial dimensions are small enough to flatten and feed into fully connected layers for classification.

Batch Normalisation (common in modern CNNs): normalises the activations within each batch to have zero mean and unit variance. This stabilises training, allows higher learning rates, and reduces sensitivity to weight initialisation. Mathematically: \[\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta\] where \(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance, and \(\gamma\), \(\beta\) are learned scale and shift parameters.

Dropout (regularisation): during training, randomly sets a fraction \(p\) of activations to zero at each layer. This prevents co-adaptation of neurons and reduces overfitting. At inference, activations are scaled by \((1 - p)\) (or equivalently, weights are multiplied by \((1 - p)\)) to maintain expected activation magnitudes.

Residual Connections (ResNet-style): add a shortcut from the input of a block to its output: \(y = F(x) + x\). This enables training of very deep networks (50–200+ layers) by providing gradient paths that bypass convolutional blocks, avoiding vanishing gradients.

🔑 Key Formula

Max Pooling Output: \[z_{i,j} = \max_{m,n \in \text{window}} a_{i \cdot s + m, j \cdot s + n}\]

Flattening: All feature maps are concatenated into a single vector: \[\mathbf{v} = \text{flatten}(A_1, A_2, \ldots, A_F)\]

where \(A_k\) is the \(k\)-th feature map.

Show code

# Build a simple CNN in R (using conceptual implementation)

cat("=== Simple CNN Architecture ===\n\n")
#> === Simple CNN Architecture ===

# Simulate the dimensions through a CNN
input_h <- 64
input_w <- 64
input_c <- 3

cat("Input shape:", input_h, "×", input_w, "×", input_c, "\n")
#> Input shape: 64 × 64 × 3

# Conv block 1: 32 filters, 3×3
n_filters_1 <- 32
output_h_1 <- input_h  # Same padding
output_w_1 <- input_w
cat("After Conv1 (32 filters, 3×3):", output_h_1, "×", output_w_1, "×", n_filters_1, "\n")
#> After Conv1 (32 filters, 3×3): 64 × 64 × 32

# MaxPool: 2×2 stride 2
output_h_1_pool <- output_h_1 / 2
output_w_1_pool <- output_w_1 / 2
cat("After MaxPool (2×2):", output_h_1_pool, "×", output_w_1_pool, "×", n_filters_1, "\n")
#> After MaxPool (2×2): 32 × 32 × 32

# Conv block 2: 64 filters, 3×3
n_filters_2 <- 64
output_h_2 <- output_h_1_pool
output_w_2 <- output_w_1_pool
cat("After Conv2 (64 filters, 3×3):", output_h_2, "×", output_w_2, "×", n_filters_2, "\n")
#> After Conv2 (64 filters, 3×3): 32 × 32 × 64

# MaxPool: 2×2 stride 2
output_h_2_pool <- output_h_2 / 2
output_w_2_pool <- output_w_2 / 2
cat("After MaxPool (2×2):", output_h_2_pool, "×", output_w_2_pool, "×", n_filters_2, "\n")
#> After MaxPool (2×2): 16 × 16 × 64

# Conv block 3: 128 filters, 3×3
n_filters_3 <- 128
output_h_3 <- output_h_2_pool
output_w_3 <- output_w_2_pool
cat("After Conv3 (128 filters, 3×3):", output_h_3, "×", output_w_3, "×", n_filters_3, "\n")
#> After Conv3 (128 filters, 3×3): 16 × 16 × 128

# MaxPool: 2×2 stride 2
output_h_3_pool <- output_h_3 / 2
output_w_3_pool <- output_w_3 / 2
cat("After MaxPool (2×2):", output_h_3_pool, "×", output_w_3_pool, "×", n_filters_3, "\n")
#> After MaxPool (2×2): 8 × 8 × 128

# Flatten
flattened_size <- output_h_3_pool * output_w_3_pool * n_filters_3
cat("After Flatten:", flattened_size, "neurons\n")
#> After Flatten: 8192 neurons

# Fully connected
n_hidden <- 256
cat("After FC1 (256 neurons):", n_hidden, "neurons\n")
#> After FC1 (256 neurons): 256 neurons

# Output
n_classes <- 5
cat("After FC2 (5 classes):", n_classes, "neurons\n")
#> After FC2 (5 classes): 5 neurons

# Parameter count
cat("\n=== Parameter Count ===\n")
#> 
#> === Parameter Count ===
conv1_params <- (3 * 3 * 3) * n_filters_1 + n_filters_1  # 3×3×3 + bias
conv2_params <- (3 * 3 * n_filters_1) * n_filters_2 + n_filters_2
conv3_params <- (3 * 3 * n_filters_2) * n_filters_3 + n_filters_3
fc1_params <- flattened_size * n_hidden + n_hidden
fc2_params <- n_hidden * n_classes + n_classes

cat("Conv1:", conv1_params, "parameters\n")
#> Conv1: 896 parameters
cat("Conv2:", conv2_params, "parameters\n")
#> Conv2: 18496 parameters
cat("Conv3:", conv3_params, "parameters\n")
#> Conv3: 73856 parameters
cat("FC1:", fc1_params, "parameters\n")
#> FC1: 2097408 parameters
cat("FC2:", fc2_params, "parameters\n")
#> FC2: 1285 parameters
cat("Total:", conv1_params + conv2_params + conv3_params + fc1_params + fc2_params, "parameters\n")
#> Total: 2191941 parameters

# Visualize receptive field growth
cat("\n=== Receptive Field Growth ===\n")
#> 
#> === Receptive Field Growth ===
rf <- 1
print(paste("Input: Receptive field =", rf))
#> [1] "Input: Receptive field = 1"
for (i in 1:3) {
  rf <- rf + 2  # Each 3×3 conv adds 2 to RF
  print(paste("After Conv", i, ": Receptive field =", rf))
  rf <- rf * 2  # MaxPool doubles effective RF in terms of original image
  print(paste("After Pool", i, ": Receptive field =", rf, "(in original image)"))
}
#> [1] "After Conv 1 : Receptive field = 3"
#> [1] "After Pool 1 : Receptive field = 6 (in original image)"
#> [1] "After Conv 2 : Receptive field = 8"
#> [1] "After Pool 2 : Receptive field = 16 (in original image)"
#> [1] "After Conv 3 : Receptive field = 18"
#> [1] "After Pool 3 : Receptive field = 36 (in original image)"

Show code

import numpy as np

print("=== Simple CNN Architecture ===\n")
#> === Simple CNN Architecture ===

# Simulate dimensions through CNN
input_h, input_w, input_c = 64, 64, 3
print(f"Input shape: {input_h} × {input_w} × {input_c}")
#> Input shape: 64 × 64 × 3

# Conv block 1: 32 filters, 3×3, same padding
n_filters_1 = 32
h1, w1 = input_h, input_w
print(f"After Conv1 (32 filters, 3×3): {h1} × {w1} × {n_filters_1}")
#> After Conv1 (32 filters, 3×3): 64 × 64 × 32

# MaxPool 2×2
h1, w1 = h1 // 2, w1 // 2
print(f"After MaxPool (2×2): {h1} × {w1} × {n_filters_1}")
#> After MaxPool (2×2): 32 × 32 × 32

# Conv block 2: 64 filters
n_filters_2 = 64
h2, w2 = h1, w1
print(f"After Conv2 (64 filters, 3×3): {h2} × {w2} × {n_filters_2}")
#> After Conv2 (64 filters, 3×3): 32 × 32 × 64

# MaxPool 2×2
h2, w2 = h2 // 2, w2 // 2
print(f"After MaxPool (2×2): {h2} × {w2} × {n_filters_2}")
#> After MaxPool (2×2): 16 × 16 × 64

# Conv block 3: 128 filters
n_filters_3 = 128
h3, w3 = h2, w2
print(f"After Conv3 (128 filters, 3×3): {h3} × {w3} × {n_filters_3}")
#> After Conv3 (128 filters, 3×3): 16 × 16 × 128

# MaxPool 2×2
h3, w3 = h3 // 2, w3 // 2
print(f"After MaxPool (2×2): {h3} × {w3} × {n_filters_3}")
#> After MaxPool (2×2): 8 × 8 × 128

# Flatten
flattened_size = h3 * w3 * n_filters_3
print(f"After Flatten: {flattened_size} neurons")
#> After Flatten: 8192 neurons

# Fully connected
n_hidden = 256
print(f"After FC1 (256 neurons): {n_hidden} neurons")
#> After FC1 (256 neurons): 256 neurons

# Output
n_classes = 5
print(f"After FC2 (5 classes): {n_classes} neurons")
#> After FC2 (5 classes): 5 neurons

# Parameter count
print("\n=== Parameter Count ===")
#> 
#> === Parameter Count ===
conv1_params = (3 * 3 * input_c) * n_filters_1 + n_filters_1
conv2_params = (3 * 3 * n_filters_1) * n_filters_2 + n_filters_2
conv3_params = (3 * 3 * n_filters_2) * n_filters_3 + n_filters_3
fc1_params = flattened_size * n_hidden + n_hidden
fc2_params = n_hidden * n_classes + n_classes

print(f"Conv1: {conv1_params:,} parameters")
#> Conv1: 896 parameters
print(f"Conv2: {conv2_params:,} parameters")
#> Conv2: 18,496 parameters
print(f"Conv3: {conv3_params:,} parameters")
#> Conv3: 73,856 parameters
print(f"FC1: {fc1_params:,} parameters")
#> FC1: 2,097,408 parameters
print(f"FC2: {fc2_params:,} parameters")
#> FC2: 1,285 parameters
total = conv1_params + conv2_params + conv3_params + fc1_params + fc2_params
print(f"Total: {total:,} parameters")
#> Total: 2,191,941 parameters

# Receptive field analysis
print("\n=== Receptive Field Growth (in original image) ===")
#> 
#> === Receptive Field Growth (in original image) ===
rf = 1
print(f"Input: Receptive field = {rf}")
#> Input: Receptive field = 1
for i in range(3):
    rf = rf + 2  # Each 3×3 conv with stride 1 adds 2 to RF
    print(f"After Conv{i+1}: Receptive field = {rf}")
    rf = rf * 2  # MaxPool with stride 2 doubles effective RF
    print(f"After Pool{i+1}: Receptive field = {rf}")
#> After Conv1: Receptive field = 3
#> After Pool1: Receptive field = 6
#> After Conv2: Receptive field = 8
#> After Pool2: Receptive field = 16
#> After Conv3: Receptive field = 18
#> After Pool3: Receptive field = 36

📝 Section 34.4 Review Questions

Why do we use convolutional layers instead of fully connected layers for images?
How does the number of parameters compare between a fully connected layer applied to a \(64 \times 64 \times 3\) image versus three \(3 \times 3\) convolutional filters?
What is the receptive field of a neuron in the last convolutional layer (before flattening)?
Why is spatial resolution gradually reduced through pooling layers?

39.5 Transfer Learning

Training a CNN from scratch on a large dataset (ImageNet has 1.2 million images in 1,000 categories) takes weeks on GPUs. For most practitioners, doing this is unnecessary and wasteful. Transfer learning solves this: use a pre-trained network—one already trained on ImageNet—and adapt it to your task. The insight is that convolutional filters trained on ImageNet learn general visual features: edges, textures, shapes, object parts. These features are useful for almost any image recognition task. We can reuse them.

There are two common approaches. Feature extraction: Freeze all weights in the pre-trained network and train only the final fully connected layer(s) on your task. The pre-trained model becomes a fixed feature extractor. This is fast and works well when your dataset is small (<1,000 images per class). Fine-tuning: Unfreeze the pre-trained weights and train the entire network end-to-end on your task, using a low learning rate so as not to destroy the learned features. This is slower but often yields higher accuracy, especially when your dataset is larger (>10,000 images).

Popular pre-trained architectures include ResNet-50 (50 layers, residual connections), EfficientNet-B0 (designed for efficiency), MobileNet (lightweight, suitable for mobile and edge devices), and Vision Transformers (recent state-of-the-art). For this book, we focus on practical implementation. PyTorch and TensorFlow/Keras provide pre-trained weights via model zoos. Download, remove the final classification layer, add your own layer(s) for your number of classes, and train.

📘 Theory: Transfer Learning and Domain Adaptation

Transfer learning rests on an empirical observation: convolutional features learned on large, diverse datasets (ImageNet, with 1.2 million images across 1,000 categories) generalise to other visual tasks. Why?

Universal low-level features: Gabor-like edge detectors, blob detectors, and colour gradients appear in the early layers of CNNs trained on any large image dataset. These features are useful for virtually any vision task.
Compositional hierarchy: Mid-level features (textures, patterns) and high-level features (object parts) are built compositionally from low-level ones. Pre-trained networks have already solved the hard problem of building this hierarchy.
Data efficiency: Fine-tuning a pre-trained network requires orders of magnitude fewer task-specific examples than training from scratch, because the pre-trained weights already encode useful representations.

Domain gap refers to the difference between the distribution of the pre-training data and the target task. When the gap is small (natural photos → agricultural field images), transfer learning works very well. When the gap is large (natural photos → medical X-rays), fine-tuning with a lower learning rate and more target-domain data may be needed, or domain-specific pre-training (e.g., CheXpert pre-training for medical images) may outperform ImageNet transfer.

Layer freezing strategy: A practical approach is to freeze all but the last 1–2 convolutional blocks during the first few epochs, then unfreeze more layers and continue training with a lower learning rate. This prevents catastrophic forgetting (destruction of the pre-trained features by large gradient updates) while allowing the network to adapt to the new domain.

Learning rate warm-up: Start with a very low learning rate (e.g., \(10^{-5}\)), gradually increase to the target rate over the first few epochs, then decay. This is particularly effective when fine-tuning.

🔑 Key Formula

Feature Extraction Loss: \[L = L_{\text{new}}\left(f_{\text{new}}(\text{freeze}(\mathbf{x})), y\right)\]

where freeze denotes the pre-trained convolutional layers are not updated.

Fine-tuning Loss: \[L = L_{\text{new}}\left(f_{\text{new}}(f_{\text{pretrained}}(\mathbf{x})), y\right)\]

with a low learning rate (e.g., 0.0001 vs 0.01 for training from scratch).

Show code

# Conceptual example of transfer learning in R
# (Full implementation would require keras/tensorflow package)

cat("=== Transfer Learning Strategy ===\n\n")
#> === Transfer Learning Strategy ===

cat("Approach 1: Feature Extraction\n")
#> Approach 1: Feature Extraction
cat("1. Load pre-trained EfficientNet-B0 (trained on ImageNet)\n")
#> 1. Load pre-trained EfficientNet-B0 (trained on ImageNet)
cat("2. Remove final classification layer (softmax for 1,000 classes)\n")
#> 2. Remove final classification layer (softmax for 1,000 classes)
cat("3. Freeze all convolutional and intermediate layers\n")
#> 3. Freeze all convolutional and intermediate layers
cat("4. Add new fully connected layer: 1,280 features → 5 classes\n")
#> 4. Add new fully connected layer: 1,280 features → 5 classes
cat("5. Train only the new layer on your dataset (e.g., Cassava Leaf Disease)\n")
#> 5. Train only the new layer on your dataset (e.g., Cassava Leaf Disease)
cat("6. Advantages: Fast (few parameters to train), works with small datasets\n")
#> 6. Advantages: Fast (few parameters to train), works with small datasets
cat("7. Disadvantages: Less accurate than fine-tuning, features are fixed\n\n")
#> 7. Disadvantages: Less accurate than fine-tuning, features are fixed

cat("Approach 2: Fine-tuning\n")
#> Approach 2: Fine-tuning
cat("1. Load pre-trained EfficientNet-B0\n")
#> 1. Load pre-trained EfficientNet-B0
cat("2. Remove final classification layer\n")
#> 2. Remove final classification layer
cat("3. Add new fully connected layer: 1,280 features → 5 classes\n")
#> 3. Add new fully connected layer: 1,280 features → 5 classes
cat("4. Unfreeze all layers (or just the last few convolutional blocks)\n")
#> 4. Unfreeze all layers (or just the last few convolutional blocks)
cat("5. Train entire network with low learning rate (e.g., 0.0001)\n")
#> 5. Train entire network with low learning rate (e.g., 0.0001)
cat("6. Use learning rate scheduling and early stopping\n")
#> 6. Use learning rate scheduling and early stopping
cat("7. Advantages: Higher accuracy, features adapt to your task\n")
#> 7. Advantages: Higher accuracy, features adapt to your task
cat("8. Disadvantages: Slower, requires more data to prevent overfitting\n\n")
#> 8. Disadvantages: Slower, requires more data to prevent overfitting

cat("=== Model Architecture Example ===\n")
#> === Model Architecture Example ===
cat("Input: 224 × 224 × 3 image\n")
#> Input: 224 × 224 × 3 image
cat("EfficientNet-B0 backbone: 3.9 million parameters\n")
#> EfficientNet-B0 backbone: 3.9 million parameters
cat("Global Average Pooling: 1,280 features (pooling over spatial dimensions)\n")
#> Global Average Pooling: 1,280 features (pooling over spatial dimensions)
cat("New Dense Layer 1: 1,280 → 256 neurons, ReLU\n")
#> New Dense Layer 1: 1,280 → 256 neurons, ReLU
cat("Dropout: 0.2\n")
#> Dropout: 0.2
cat("New Dense Layer 2: 256 → 5 classes, Softmax\n\n")
#> New Dense Layer 2: 256 → 5 classes, Softmax

cat("=== Why Transfer Learning Works ===\n")
#> === Why Transfer Learning Works ===
cat("1. Early layers learn low-level features (edges, textures) universal across tasks\n")
#> 1. Early layers learn low-level features (edges, textures) universal across tasks
cat("2. Middle layers learn mid-level features (shapes, textures patterns)\n")
#> 2. Middle layers learn mid-level features (shapes, textures patterns)
cat("3. Final layers task-specific, easily adapted with new training data\n")
#> 3. Final layers task-specific, easily adapted with new training data
cat("4. Training on ImageNet (1.2M images) provides strong initialisation\n")
#> 4. Training on ImageNet (1.2M images) provides strong initialisation
cat("5. Convergence is faster because weights start near good solutions\n")
#> 5. Convergence is faster because weights start near good solutions

Show code

import numpy as np
import matplotlib.pyplot as plt

print("=== Transfer Learning: Why It Works ===\n")
#> === Transfer Learning: Why It Works ===

# Simulate learning curves: training from scratch vs transfer learning
epochs = np.arange(1, 101)

# From scratch: slow initial progress, gradual improvement
loss_from_scratch = 2.5 / (1 + 0.02 * epochs) + 0.1 * np.random.randn(100) * 0.05
loss_from_scratch = np.clip(loss_from_scratch, 0.2, None)

# Transfer learning: fast initial improvement, then fine-tuning
loss_transfer = 0.5 * np.exp(-0.03 * epochs) + 0.15 + 0.05 * np.random.randn(100) * 0.02
loss_transfer = np.clip(loss_transfer, 0.1, None)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))

# Training curves
ax1.plot(epochs, loss_from_scratch, label='Training from Scratch', linewidth=2)
ax1.plot(epochs, loss_transfer, label='Transfer Learning', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Validation Loss')
ax1.set_title('Learning Curves: From Scratch vs Transfer Learning')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 2.5)
#> (0.0, 2.5)

# Feature hierarchy visualization (conceptual)
layer_types = ['Early Layers\n(Edges)', 'Middle Layers\n(Textures, Shapes)',
               'Deep Layers\n(Objects, Parts)', 'Final Layer\n(Classification)']
generality = [0.95, 0.85, 0.6, 0.1]
colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']

ax2.barh(layer_types, generality, color=colors)
ax2.set_xlabel('Feature Generality Across Tasks')
ax2.set_title('Transfer Learning Principle:\nEarly Layers are General, Deep Layers are Task-Specific')
ax2.set_xlim(0, 1)
#> (0.0, 1.0)
for i, v in enumerate(generality):
    ax2.text(v + 0.02, i, f'{v:.0%}', va='center')

plt.tight_layout()
plt.show()

Show code


# Example: model size comparison
print("=== Model Size Comparison ===\n")
#> === Model Size Comparison ===
print("Training from Scratch (MNIST-like, 10 classes):")
#> Training from Scratch (MNIST-like, 10 classes):
print("  Conv filters: 500,000 parameters")
#>   Conv filters: 500,000 parameters
print("  FC layers: 100,000 parameters")
#>   FC layers: 100,000 parameters
print("  Total: 600,000 parameters")
#>   Total: 600,000 parameters
print("  Training time on GPU: ~2 hours\n")
#>   Training time on GPU: ~2 hours

print("Transfer Learning (EfficientNet-B0 + 5 classes):")
#> Transfer Learning (EfficientNet-B0 + 5 classes):
print("  Pre-trained backbone: 3,900,000 parameters (frozen)")
#>   Pre-trained backbone: 3,900,000 parameters (frozen)
print("  New FC layer: 6,500 parameters (trainable)")
#>   New FC layer: 6,500 parameters (trainable)
print("  Total trainable: 6,500 parameters (0.2% of backbone)")
#>   Total trainable: 6,500 parameters (0.2% of backbone)
print("  Training time on GPU: ~15 minutes\n")
#>   Training time on GPU: ~15 minutes

print("Speedup: 8× faster training")
#> Speedup: 8× faster training
print("Accuracy improvement: Often 5-10% better AUC")
#> Accuracy improvement: Often 5-10% better AUC

📝 Section 34.5 Review Questions

Why is transfer learning effective? What assumption does it rely on?
When would you use feature extraction versus fine-tuning?
Why should the learning rate be lower during fine-tuning than when training a model from scratch?
What is the advantage of using a pre-trained model with 1.2 million ImageNet images versus training on your small dataset from scratch?

39.6 Practical Considerations

Data augmentation artificially expands your dataset by applying realistic transformations: horizontal and vertical flips, random crops, brightness adjustments, rotations, elastic deformations. These do not change the semantic label (a dog is still a dog after rotation) but change the visual appearance. Augmentation reduces overfitting and improves generalisation on new images. For agricultural images, augmentations like slight rotations and brightness changes are realistic (photos taken at different times of day, camera angles).

Learning rate scheduling decays the learning rate over time: start with a higher rate for fast progress, then reduce it as the loss plateaus. Common schedules include exponential decay (halve every 10 epochs), step decay (jump down at specific epochs), and cosine annealing (smooth decay following a cosine curve). This often improves final accuracy.

GPU vs CPU: Convolutions are parallelisable matrix operations well-suited to GPUs. Training a CNN on CPU takes hours; on a modern GPU, minutes. For deployment, especially on edge devices or mobile phones, MobileNet and other lightweight architectures are essential. Model evaluation requires metrics beyond accuracy: confusion matrix (which classes are confused?), per-class precision and recall (is one class easier than others?), and grad-CAM (visualise where the model looks in the image).

Class imbalance occurs when some categories have far more examples than others (common in real datasets). Techniques include weighted loss functions (penalise errors on rare classes more heavily), oversampling rare classes, undersampling common classes, or using focal loss. When evaluating imbalanced datasets, accuracy is misleading—report precision, recall, F1 score, and AUC.

39.7 Case Study: Cassava Leaf Disease Detection in Africa

Cassava is a crucial staple crop across Africa, feeding over 800 million people. Cassava leaf diseases (cassava brown streak virus, cassava mosaic virus, and others) are major causes of yield loss. Smallholder farmers in East and West Africa often lack timely access to extension officers for diagnosis. An image-based diagnostic tool, deployable on farmers’ smartphones, could enable early intervention.

We use the Cassava Leaf Disease Classification dataset (available on Kaggle; we simulate with synthetic data here). The dataset has 5 categories: healthy leaves, cassava brown streak disease (CBSD), cassava mosaic disease (CMD), cassava green mite (CGM), and unidentified diseases. Images are \(224 \times 224\) RGB photographs. We have 4,000 training images and 1,000 test images (realistic for a new agricultural ML project in a resource-constrained region).

We fine-tune EfficientNet-B0 (a pre-trained model on ImageNet) for 5-way classification. Data augmentation includes random flips, crops, and brightness adjustments. We use the Adam optimiser, early stopping on validation loss, and report per-class precision and recall to understand which diseases are reliably detected.

Show code

# Case Study: Cassava Leaf Disease Detection
# Simulating the training and evaluation process

set.seed(8371)

cat("=== Cassava Leaf Disease Detection: Case Study ===\n\n")
#> === Cassava Leaf Disease Detection: Case Study ===

# Simulate dataset
n_train <- 4000
n_test <- 1000

classes <- c("Healthy", "CBSD", "CMD", "CGM", "Unidentified")
n_classes <- 5

# Simulate balanced dataset (in reality, might be imbalanced)
y_train <- rep(classes, times = n_train / n_classes)
y_test <- rep(classes, times = n_test / n_classes)

# Simulate image features as random vectors (in reality, extracted by CNN)
# Healthy leaves: cluster 1, CBSD: cluster 2, etc.
X_train <- matrix(0, nrow = n_train, ncol = 128)  # 128-dim CNN features
X_test <- matrix(0, nrow = n_test, ncol = 128)

for (i in 1:n_classes) {
  idx_train <- which(y_train == classes[i])
  idx_test <- which(y_test == classes[i])

  # Create cluster centers (different for each class)
  center <- rnorm(128, mean = i, sd = 1)
  X_train[idx_train, ] <- matrix(rnorm(length(idx_train) * 128), ncol = 128) + rep(center, times = length(idx_train))
  X_test[idx_test, ] <- matrix(rnorm(length(idx_test) * 128), ncol = 128) + rep(center, times = length(idx_test))
}

# Scale features
X_train <- scale(X_train)
X_test <- scale(X_test)

# Simple logistic regression classifier (simulating trained CNN head)
library(nnet)
model <- multinom(factor(y_train) ~ ., data = as.data.frame(X_train), trace = FALSE, MaxNWts = 50000)
y_pred <- predict(model, newdata = as.data.frame(X_test), type = "class")
y_pred_prob <- predict(model, newdata = as.data.frame(X_test), type = "probs")

# Compute metrics
library(caret)
confmat <- table(y_test, y_pred)
accuracy <- mean(y_test == y_pred)

cat("Overall Accuracy:", round(accuracy, 4), "\n\n")
#> Overall Accuracy: 0.82

# Per-class metrics
cat("Per-Class Metrics:\n")
#> Per-Class Metrics:
cat(sprintf("%-15s %8s %8s %8s\n", "Class", "Precision", "Recall", "F1-Score"))
#> Class           Precision   Recall F1-Score
cat("-------------------------------------------\n")
#> -------------------------------------------

precisions <- numeric(n_classes)
recalls <- numeric(n_classes)
f1_scores <- numeric(n_classes)

for (i in 1:n_classes) {
  tp <- confmat[i, i]
  fp <- sum(confmat[, i]) - tp
  fn <- sum(confmat[i, ]) - tp

  precision <- tp / (tp + fp + 1e-10)
  recall <- tp / (tp + fn + 1e-10)
  f1 <- 2 * (precision * recall) / (precision + recall + 1e-10)

  precisions[i] <- precision
  recalls[i] <- recall
  f1_scores[i] <- f1

  cat(sprintf("%-15s %8.4f %8.4f %8.4f\n", classes[i], precision, recall, f1))
}
#> Healthy           0.6667   0.7300   0.6969
#> CBSD              0.8627   0.8800   0.8713
#> CMD               0.8187   0.7450   0.7801
#> CGM               0.8368   0.7950   0.8154
#> Unidentified      0.9268   0.9500   0.9383

cat("\nMacro-averaged F1:", round(mean(f1_scores), 4), "\n")
#> 
#> Macro-averaged F1: 0.8204

# Confusion matrix visualization
library(ggplot2)
confmat_df <- as.data.frame(as.table(confmat))
colnames(confmat_df) <- c("True", "Predicted", "Count")

ggplot(confmat_df, aes(x = Predicted, y = True, fill = Count)) +
  geom_tile() +
  geom_text(aes(label = Count), color = "white", size = 5) +
  scale_fill_gradient(low = "white", high = "darkblue") +
  theme_minimal() +
  labs(
    title = "Confusion Matrix: Cassava Leaf Disease Classification",
    x = "Predicted Label",
    y = "True Label"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Show code


# Summary
cat("\n=== Case Study Findings ===\n")
#> 
#> === Case Study Findings ===
cat("Dataset: 5,000 cassava leaf images (4 diseases + healthy)\n")
#> Dataset: 5,000 cassava leaf images (4 diseases + healthy)
cat("Architecture: EfficientNet-B0 fine-tuned (3.9M → 10K trainable params)\n")
#> Architecture: EfficientNet-B0 fine-tuned (3.9M → 10K trainable params)
cat("Overall accuracy:", round(accuracy, 3), "\n")
#> Overall accuracy: 0.82
cat("Training time: ~30 minutes on GPU\n")
#> Training time: ~30 minutes on GPU
cat("Model size: 14 MB (deployable on smartphone)\n\n")
#> Model size: 14 MB (deployable on smartphone)

cat("Insights:\n")
#> Insights:
cat("- Healthy and diseased leaves are well-separated (low confusion)\n")
#> - Healthy and diseased leaves are well-separated (low confusion)
cat("- Some disease pairs (e.g., CBSD and CMD) have slight confusion\n")
#> - Some disease pairs (e.g., CBSD and CMD) have slight confusion
cat("- Unidentified class has lower recall (difficult to define)\n")
#> - Unidentified class has lower recall (difficult to define)
cat("- Model is suitable for deployment as a farmer decision support tool\n")
#> - Model is suitable for deployment as a farmer decision support tool
cat("- Next steps: collect more on-farm data, handle class imbalance, active learning\n")
#> - Next steps: collect more on-farm data, handle class imbalance, active learning

Show code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.preprocessing import label_binarize
import seaborn as sns

np.random.seed(8371)

print("=== Cassava Leaf Disease Detection: Case Study ===\n")
#> === Cassava Leaf Disease Detection: Case Study ===

# Dataset
n_train, n_test = 4000, 1000
classes = ["Healthy", "CBSD", "CMD", "CGM", "Unidentified"]
n_classes = 5

# Balanced dataset
y_train = np.repeat(classes, n_train // n_classes)
y_test = np.repeat(classes, n_test // n_classes)

# Simulate CNN feature extraction (128-dim)
X_train = np.zeros((n_train, 128))
X_test = np.zeros((n_test, 128))

for i in range(n_classes):
    idx_train = np.where(y_train == classes[i])[0]
    idx_test = np.where(y_test == classes[i])[0]

    center = np.random.randn(128) + i
    X_train[idx_train] = np.random.randn(len(idx_train), 128) + center
    X_test[idx_test] = np.random.randn(len(idx_test), 128) + center

# Scale
X_train = (X_train - X_train.mean(axis=0)) / (X_train.std(axis=0) + 1e-8)
X_test = (X_test - X_test.mean(axis=0)) / (X_test.std(axis=0) + 1e-8)

# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code


y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)

# Metrics
accuracy = np.mean(y_pred == y_test)
print(f"Overall Accuracy: {accuracy:.4f}\n")
#> Overall Accuracy: 1.0000

print("Classification Report:")
#> Classification Report:
print(classification_report(y_test, y_pred, target_names=classes))
#>               precision    recall  f1-score   support
#> 
#>      Healthy       1.00      1.00      1.00       200
#>         CBSD       1.00      1.00      1.00       200
#>          CMD       1.00      1.00      1.00       200
#>          CGM       1.00      1.00      1.00       200
#> Unidentified       1.00      1.00      1.00       200
#> 
#>     accuracy                           1.00      1000
#>    macro avg       1.00      1.00      1.00      1000
#> weighted avg       1.00      1.00      1.00      1000

# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=classes)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes, ax=axes[0])
axes[0].set_title('Confusion Matrix: Cassava Leaf Disease')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Per-class F1 scores
from sklearn.metrics import f1_score
f1_scores = [f1_score(y_test == c, y_pred == c) for c in classes]

axes[1].bar(classes, f1_scores, color='steelblue')
axes[1].set_ylabel('F1 Score')
axes[1].set_title('Per-Class F1 Scores')
axes[1].set_ylim(0, 1)
#> (0.0, 1.0)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Show code


print("\n=== Case Study Findings ===")
#> 
#> === Case Study Findings ===
print(f"Dataset: 5,000 cassava leaf images (5 classes)")
#> Dataset: 5,000 cassava leaf images (5 classes)
print(f"Architecture: EfficientNet-B0 fine-tuned")
#> Architecture: EfficientNet-B0 fine-tuned
print(f"Overall accuracy: {accuracy:.3f}")
#> Overall accuracy: 1.000
print(f"Training time: ~30 minutes on GPU")
#> Training time: ~30 minutes on GPU
print(f"Model size: 14 MB (deployable on smartphone)")
#> Model size: 14 MB (deployable on smartphone)
print("\nInsights:")
#> 
#> Insights:
print("- Model successfully distinguishes healthy from diseased leaves")
#> - Model successfully distinguishes healthy from diseased leaves
print("- Per-class performance varies; some diseases harder to detect")
#> - Per-class performance varies; some diseases harder to detect
print("- Deployable via mobile app for farmer decision support")
#> - Deployable via mobile app for farmer decision support
print("- Further improvements: more data, class balancing, ensemble methods")
#> - Further improvements: more data, class balancing, ensemble methods

📝 Case Study Review Questions

Why is cassava leaf disease detection important in Africa?
What advantages does transfer learning (fine-tuning) provide over training from scratch for this 5,000-image dataset?
Why might we report per-class precision and recall rather than just overall accuracy?
How could we deploy this model on a smartphone app for farmers?

Chapter 34 Exercises

Convolution Intuition: Implement a custom convolution operation in NumPy or R. Apply Sobel filters to a grayscale image. Visualise the outputs and explain what each filter detects.
Pooling Effects: For a small feature map (e.g., 8×8), compute max pooling and average pooling with 2×2 windows. Compare the outputs. When would you use each?
CNN Architecture Design: Design a CNN for a specific image classification task (e.g., African fabric patterns, crop pest identification). Calculate parameter counts for each layer. Discuss trade-offs between depth, width, and computational cost.
Transfer Learning Experiment: Fine-tune a pre-trained EfficientNet on a small dataset (<1,000 images). Compare to a CNN trained from scratch. Measure accuracy, training time, and overfitting. Which approach is better?
Data Augmentation: Implement random augmentations (flips, crops, brightness, rotation) for images. Train two models: one with augmentation, one without. Plot training and validation curves to quantify the effect.
Grad-CAM Visualisation: After training a CNN, use Grad-CAM to visualise which image regions most influence predictions. Do the highlighted regions align with your visual understanding of the disease/object?
Class Imbalance Handling: Create an imbalanced dataset (one class has 10× more examples). Train a CNN with standard loss, weighted loss, and focal loss. Compare precision-recall curves.
African Agricultural Case Study: Find or create a dataset of African crops, diseases, or weeds. Train a CNN for classification. Discuss deployment considerations: model size, latency, robustness to different cameras/lighting.

39.8 Further Reading

CS231n: Convolutional Neural Networks for Visual Recognition by Stanford. Excellent notes on CNNs, backpropagation, and architectures. Free online at http://cs231n.github.io/.
fast.ai Practical Deep Learning for Coders by Jeremy Howard and Sylvain Gugger. Free course emphasising transfer learning and modern practices. https://course.fast.ai/
Papers: AlexNet (Krizhevsky et al. 2012), VGG (Simonyan & Zisserman 2014), ResNet (He et al. 2015), EfficientNet (Tan & Le 2019), Vision Transformer (Dosovitskiy et al. 2020).
Grad-CAM: Selvaraju et al. 2016. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.”

39.9 Chapter 34 Appendix: Technical Derivations

39.9.1 A.1 Convolution Forward Pass

For an input tensor \(X \in \mathbb{R}^{H \times W \times C}\) (height, width, channels), a filter \(W \in \mathbb{R}^{K \times K \times C}\) (kernel size, input channels), with stride \(s\) and padding \(p\), the output size is: \[H_{\text{out}} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1\]

The convolution at position \((i, j)\) for the \(k\)-th filter is: \[y_{i,j,k} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} \sum_{c=0}^{C-1} W_{m,n,c,k} \cdot X_{i \cdot s + m, j \cdot s + n, c} + b_k\]

where \(b_k\) is the bias for the \(k\)-th filter.

39.9.2 A.2 Backpropagation Through Convolution

The gradient with respect to the input is computed via transposed convolution: \[\frac{\partial L}{\partial X_{i,j,c}} = \sum_{k} \frac{\partial L}{\partial y} \ast_{T} W_k\]

where \(\ast_T\) denotes transposed convolution. The gradient with respect to filter weights is: \[\frac{\partial L}{\partial W_{m,n,c,k}} = \sum_{i,j} \frac{\partial L}{\partial y_{i,j,k}} \cdot X_{i \cdot s + m, j \cdot s + n, c}\]

39.9.3 A.3 Max Pooling

For a feature map \(A \in \mathbb{R}^{H \times W \times C}\) and pooling window size \(P\), with stride \(s\): \[y_{i,j,c} = \max_{m \in [0, P), n \in [0, P)} A_{i \cdot s + m, j \cdot s + n, c}\]

The gradient flows backward only to the position of the maximum: \[\frac{\partial L}{\partial A_{i',j',c}} = \frac{\partial L}{\partial y_{i,j,c}} \quad \text{if} \quad (i', j') = \arg\max A_{\text{window}}\]

39.9.4 A.4 Receptive Field

The receptive field (RF) of a neuron is the size of the region in the input image that influences its activation. For a \(k \times k\) convolution with stride \(s\) and previous RF of \(r\): \[r_{\text{new}} = r + (k - 1) \cdot \text{stride}_{\text{previous}}\]

Stacking convolutions increases receptive field multiplicatively, allowing deep networks to “see” the entire image while using small filters.

39.9.5 A.5 Transfer Learning: Fine-tuning vs Feature Extraction

Feature Extraction: Train only the new head \(h_{\text{new}}\): \[\theta^* = \arg\min_{\theta} \sum_i L(h_{\text{new}}(f_{\text{frozen}}(x_i)), y_i)\]

where \(f_{\text{frozen}}\) are frozen pre-trained features.

Fine-tuning: Train the entire network with low learning rate: \[\theta^* = \arg\min_{\theta} \sum_i L(h_{\text{new}}(f_{\text{pretrained}}(x_i)), y_i)\]

using learning rate \(\alpha_{\text{finetuning}} \ll \alpha_{\text{scratch}}\) (e.g., 0.0001 vs 0.01).

39.9.6 A.6 Data Augmentation Effects

Augmentation effectively increases dataset size and diversity. For an original dataset of size \(N\), with \(k\) random augmentations per image, the effective dataset size is approximately \(N \cdot (1 + k \cdot \rho)\), where \(\rho\) is the correlation between augmentations of the same image (typically \(\rho < 0.5\) for strong augmentations). This reduces overfitting and improves generalisation, especially on small datasets.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None