---
title: "Image Analytics with Convolutional Neural Networks"
author: "Bongo Adi"
---
```{python}
#| label: python-setup-34-image-analytics-cnn
#| include: false
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import signal
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.metrics import f1_score
```
::: {.callout-note icon="false"}
## 📋 Learning Objectives
- Understand images as high-dimensional arrays and why standard neural networks struggle with them
- Master the convolution operation and how filters learn spatial features
- Build convolutional neural networks (CNNs) for image classification
- Apply transfer learning using pre-trained models on limited data
- Implement and evaluate a CNN on real agricultural image data from Africa
:::
## Why Images Are a Different Kind of Data
In 2023, a team of agricultural researchers at a Nigerian university was helping smallholder cassava farmers in Benue State. Cassava is a lifeline crop — it feeds hundreds of millions of people across Sub-Saharan Africa. But a single disease outbreak, such as Cassava Mosaic Virus, can destroy an entire season's harvest before farmers even notice something is wrong. The researchers' question was simple: could a smartphone app analyse a photo of a cassava leaf and detect disease early enough for farmers to respond?
This question — recognising what is in an image — turns out to be one of the most consequential problems in modern artificial intelligence. It sits behind self-driving cars (recognising pedestrians and road markings), medical diagnosis (detecting tumours in X-rays), quality control in factories (spotting defective products on a conveyor belt), and document processing (reading handwritten forms). For the first 70 years of computing, machines were terrible at this. Humans can glance at a cassava leaf for a fraction of a second and immediately know whether it is healthy. Machines struggled for decades to achieve what a child can do effortlessly.
The breakthrough came with **Convolutional Neural Networks (CNNs)**, a class of deep learning model specifically designed to understand spatial structure — the patterns, textures, and shapes that make one object different from another. CNNs power the cassava disease detector, the face ID on your phone, the satellite crop monitoring systems used by agricultural insurers, and the chest X-ray screening tools being deployed in under-resourced Nigerian hospitals.
This chapter explains how CNNs work and, crucially, how to use them in practice without needing to build one from scratch — because in reality, almost nobody builds a CNN from zero. Instead, we use **transfer learning**: take a CNN that has already been trained on millions of images, then adapt it to your specific task with a much smaller local dataset. A model that has learned to recognise dogs, cars, and furniture from ImageNet has already learned to detect edges, textures, and shapes — general visual features that also appear in cassava leaves, X-rays, and factory products.
### Images as Data: What a Computer Actually Sees
To understand how CNNs work, we first need to understand how a computer sees an image at all. When you look at a photograph of a cassava leaf, you perceive colours, shapes, and patterns. A computer sees something very different: a grid of numbers.
A digital image is made up of tiny squares called **pixels** (short for "picture elements"). Each pixel has a colour value. For a black-and-white image, each pixel is a single number between 0 (pure black) and 255 (pure white) — 128 would be a medium grey. A 100 × 100 pixel black-and-white image is therefore a grid of 100 rows and 100 columns, containing 10,000 numbers.
Colour images are slightly more complex. Each pixel has three numbers: one for how much **red** light it contains, one for **green**, and one for **blue** (hence "RGB" — Red, Green, Blue). By mixing these three colours in different proportions, any colour can be reproduced. A red pixel might be (255, 0, 0) — maximum red, no green, no blue. A yellow pixel is (255, 255, 0) — maximum red, maximum green, no blue. So a 224 × 224 pixel colour image contains 224 × 224 × 3 = 150,528 numbers.
The challenge for a machine learning model is this: given these 150,528 numbers, determine what the image shows. This is what CNNs are designed to do. And the key challenge — why a regular neural network is not enough — is the **spatial structure** of images. When you recognise a cat's ear, you recognise it whether it is in the top-left of the image or the bottom-right. The shape is the same; only its position changes. A standard neural network has no concept of "position" — it treats every number in the grid as an independent input. CNNs are fundamentally different: they are designed to recognise **patterns that repeat across space**, regardless of where in the image those patterns appear.
An image is fundamentally a matrix of numbers. A grayscale image of size $H \times W$ (height and width in pixels) is a 2D matrix where each entry is a pixel intensity between 0 (black) and 255 (white). An RGB colour image is a 3D tensor of shape $H \times W \times 3$. When we feed a $64 \times 64$ RGB image into a standard feedforward neural network (flattened to a vector), we have $64 \times 64 \times 3 = 12,288$ input features — and a single hidden layer with 1,000 neurons requires 12.3 million weights. For larger images ($224 \times 224$), this explodes to approximately 150 million weights in a single layer. Training and storing such networks is computationally prohibitive, and more importantly, they often fail to generalise well because they cannot exploit spatial structure.
The solution: convolutional neural networks apply the same small filter (kernel) across the entire image, enforcing **weight sharing** and capturing spatial locality. If the model learns to detect a vertical edge, it detects that edge wherever it appears in the image — it does not need to learn the edge separately for each possible location.
## The Convolution Operation
A convolutional layer consists of multiple filters (kernels), each a small matrix of learned weights, typically $3 \times 3$, $5 \times 5$, or $7 \times 7$. We slide each filter across the image, computing the element-wise product of the filter and the image patch, then summing the result. This produces one output value for each position—a feature map. With $F$ filters, we produce $F$ feature maps.
Mathematically, if the filter has weights $\mathbf{w}$ and we place it at position $(i, j)$ in an image, the output is:
$$z_{i,j} = \sum_m \sum_n w_{m,n} \cdot x_{i+m, j+n} + b$$
where $x$ is the input image, $w$ is the filter, and $b$ is a bias term. The output is then passed through an activation function to create the feature map $a_{i,j}$. Early convolutional layers learn low-level features: edges, corners, oriented textures. Deeper layers combine these into mid-level features: shapes, object parts. Very deep layers recognise high-level concepts: entire objects.
Three design choices matter. **Padding** adds zeros around the image boundary. With no padding (valid convolution), a $H \times W$ image convolved with an $F \times F$ filter produces an output of size $(H - F + 1) \times (W - F + 1)$, shrinking the image. With padding of $P$ on all sides (same convolution), output size remains $H \times W$. Padding prevents information loss at borders and maintains spatial dimensions, making it standard practice. **Stride** determines how many pixels the filter moves at each step. Stride 1 is standard; stride 2 reduces the output size by half. **Dilated convolution** (dilation rate > 1) expands the filter by adding gaps between weights, increasing the receptive field without adding parameters. These choices let us design networks with specific spatial resolution changes.
::: {.callout-note icon="false"}
## 📘 Theory: Convolution and Feature Maps
Convolution is the fundamental operation of a CNN. A filter $\mathbf{W}$ of size $K \times K$ slides across an input image of size $H \times W$, computing a dot product at each location. The resulting **feature map** highlights positions where the pattern encoded by $\mathbf{W}$ is present.
Key properties:
1. **Weight sharing**: the same $K \times K$ filter is applied at every spatial position. This means a vertical-edge detector uses the same weights in the top-left corner as in the bottom-right. Weight sharing reduces parameters dramatically compared to a fully connected layer (a $3 \times 3$ filter has 9 weights; an equivalent fully connected connection for a $224 \times 224$ patch has $224^2 = 50,176$ weights).
2. **Local connectivity**: each output neuron depends on a $K \times K$ neighbourhood, not the full image. This inductive bias is appropriate for natural images: nearby pixels are correlated; distant pixels are less so.
3. **Multiple filters**: a convolutional layer uses $F$ filters, producing $F$ feature maps. Different filters learn different patterns: one might detect horizontal edges, another diagonal textures, a third colour gradients.
4. **Non-linearity**: after convolution, a ReLU activation $a = \max(0, z)$ is applied element-wise, introducing non-linearity and sparsifying the representation.
Stacking convolutional layers creates a **hierarchical feature hierarchy**. The receptive field of a neuron in layer $L$ grows as layers are stacked, allowing the network to build complex representations from simple ones.
:::
::: {.callout-tip icon="false"}
## 🔑 Key Formula
For a 2D convolution with filter $\mathbf{W}$ of size $K \times K$, stride $s$, padding $p$:
$$\text{Output size} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1$$
The convolution operation at position $(i, j)$:
$$z_{i,j} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} W_{m,n} \cdot X_{i \cdot s + m, j \cdot s + n} + b$$
Multiple filters produce multiple feature maps (channels).
:::
::: {.panel-tabset}
## R
```{r}
#| label: convolution-operation
# Implement and visualize the 2D convolution operation
library(ggplot2)
library(reshape2)
# Simple edge detection filter (Sobel-like)
sobel_x <- matrix(c(-1, 0, 1, -2, 0, 2, -1, 0, 1), nrow = 3, ncol = 3)
sobel_y <- matrix(c(-1, -2, -1, 0, 0, 0, 1, 2, 1), nrow = 3, ncol = 3)
# Function to perform 2D convolution
convolve_2d <- function(image, filter, padding = 1) {
h <- nrow(image)
w <- ncol(image)
fh <- nrow(filter)
fw <- ncol(filter)
# Pad image
if (padding > 0) {
padded_image <- matrix(0, nrow = h + 2 * padding, ncol = w + 2 * padding)
padded_image[(padding + 1):(padding + h), (padding + 1):(padding + w)] <- image
image <- padded_image
h <- nrow(image)
w <- ncol(image)
}
# Output size
out_h <- h - fh + 1
out_w <- w - fw + 1
output <- matrix(0, nrow = out_h, ncol = out_w)
# Convolve
for (i in 1:out_h) {
for (j in 1:out_w) {
patch <- image[i:(i + fh - 1), j:(j + fw - 1)]
output[i, j] <- sum(patch * filter)
}
}
return(output)
}
# Create a simple synthetic image with vertical and horizontal edges
image <- matrix(0, nrow = 10, ncol = 10)
image[1:5, 5:10] <- 1 # White rectangle on right
image[6:10, 1:5] <- 1 # White rectangle on bottom-left
# Apply filters
edges_x <- convolve_2d(image, sobel_x, padding = 1)
edges_y <- convolve_2d(image, sobel_y, padding = 1)
edge_magnitude <- sqrt(edges_x^2 + edges_y^2)
# Visualize
par(mfrow = c(2, 2))
# Original image
image(image, main = "Original Image", col = gray(0:255 / 255), axes = FALSE)
# Edge detection X
image(edges_x, main = "Vertical Edges (Sobel X)", col = gray(0:255 / 255), axes = FALSE)
# Edge detection Y
image(edges_y, main = "Horizontal Edges (Sobel Y)", col = gray(0:255 / 255), axes = FALSE)
# Edge magnitude
image(edge_magnitude, main = "Edge Magnitude", col = heat.colors(255), axes = FALSE)
par(mfrow = c(1, 1))
# Print filter values
cat("Sobel X filter (vertical edges):\n")
print(sobel_x)
cat("\nSobel Y filter (horizontal edges):\n")
print(sobel_y)
```
## Python
```{python}
#| label: py-convolution-operation
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
# Sobel filters for edge detection
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float)
sobel_y = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]], dtype=float)
# Function to perform 2D convolution (using scipy for validation)
def convolve_2d(image, kernel, padding=1):
if padding > 0:
image = np.pad(image, padding, mode='constant', constant_values=0)
output = signal.correlate2d(image, kernel, mode='valid')
return output
# Create synthetic image with edges
image = np.zeros((10, 10))
image[0:5, 5:10] = 1 # Right rectangle
image[5:10, 0:5] = 1 # Bottom-left rectangle
# Apply filters
edges_x = convolve_2d(image, sobel_x, padding=1)
edges_y = convolve_2d(image, sobel_y, padding=1)
edge_magnitude = np.sqrt(edges_x**2 + edges_y**2)
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes[0, 0].imshow(image, cmap='gray')
axes[0, 0].set_title('Original Image')
axes[0, 0].axis('off')
axes[0, 1].imshow(edges_x, cmap='gray')
axes[0, 1].set_title('Vertical Edges (Sobel X)')
axes[0, 1].axis('off')
axes[1, 0].imshow(edges_y, cmap='gray')
axes[1, 0].set_title('Horizontal Edges (Sobel Y)')
axes[1, 0].axis('off')
axes[1, 1].imshow(edge_magnitude, cmap='hot')
axes[1, 1].set_title('Edge Magnitude')
axes[1, 1].axis('off')
plt.tight_layout()
plt.show()
print("Sobel X filter (vertical edges):")
print(sobel_x)
print("\nSobel Y filter (horizontal edges):")
print(sobel_y)
# Example: Convolution with learned filter (simulated random initialization)
np.random.seed(6253)
learned_filter = np.random.randn(3, 3) * 0.1 # Small random weights
learned_features = convolve_2d(image, learned_filter, padding=1)
print(f"\nLearned filter output shape: {learned_features.shape}")
print(f"Feature map statistics: min={learned_features.min():.4f}, max={learned_features.max():.4f}")
```
:::
::: {.callout-caution icon="false"}
## 📝 Section 34.2 Review Questions
1. Why is weight sharing (using the same filter across the entire image) beneficial for image processing?
2. If an input image is $32 \times 32$ and we apply a $5 \times 5$ filter with stride 1 and no padding, what is the output size?
3. What is the receptive field of a neuron in the output layer? How does it grow as we stack convolutional layers?
4. Explain the difference between "valid" and "same" convolution in terms of padding and output size.
:::
## Pooling
After convolution and activation, a convolutional layer typically includes pooling, which downsamples the spatial dimensions of the feature maps. The most common variant, max pooling, divides the feature map into non-overlapping windows (typically $2 \times 2$) and outputs the maximum value in each window. This serves multiple purposes. First, it reduces the spatial resolution and thus the number of parameters and computations in subsequent layers. Second, it provides translation invariance: if an object moves a few pixels, max pooling still detects it. Third, it suppresses noise and captures the most salient features. Average pooling takes the mean within each window instead of the maximum; it is less commonly used in modern architectures.
A feature map from a convolutional layer highlights wherever a specific pattern appears in the image. Max pooling says: "in this small region, did the pattern appear anywhere?" If it did (high activation), the max pooling output is high; if not, it is low. This makes the representation robust to small shifts. Stacking many layers of convolution and pooling gradually enlarges the receptive field—each neuron in a deep layer "sees" a larger region of the original image—while reducing spatial dimensions. Eventually, after multiple layers, the spatial dimensions are small enough that we flatten the feature maps and feed them into fully connected layers for classification.
## The CNN Architecture
A typical CNN for image classification has the structure: **Input → [Conv + ReLU + Pooling] × $L$ → Flatten → [Fully Connected + ReLU] × $K$ → Softmax Output**. The convolutional layers extract features; the fully connected layers perform classification based on those features.
A simple architecture might be:
- Input: $224 \times 224 \times 3$ (ImageNet standard size)
- Conv1: 32 filters, $3 \times 3$ → $224 \times 224 \times 32$
- ReLU → MaxPool $2 \times 2$ → $112 \times 112 \times 32$
- Conv2: 64 filters, $3 \times 3$ → $112 \times 112 \times 64$
- ReLU → MaxPool $2 \times 2$ → $56 \times 56 \times 64$
- Conv3: 128 filters, $3 \times 3$ → $56 \times 56 \times 128$
- ReLU → MaxPool $2 \times 2$ → $28 \times 28 \times 128$
- Flatten → $28 \times 28 \times 128 = 100,352$ features
- FC1: 256 neurons → FC2: 10 neurons (for 10-class classification)
This architecture exploits the hierarchical nature of vision: early layers capture simple features, deeper layers recognise complex patterns. The dramatic reduction in spatial size (from $224 \times 224$ to $28 \times 28$) balances expressiveness with computational efficiency.
::: {.callout-note icon="false"}
## 📘 Theory: CNN Layers and Information Flow
Information flows through a CNN in a forward pass. At each layer, the spatial resolution may decrease (through pooling or strided convolutions) while the number of channels (feature maps) typically increases. This trade-off is deliberate: as spatial resolution decreases, each feature map captures increasingly abstract information over a larger region of the original image.
**Information bottleneck intuition**: Early layers have high spatial resolution (many locations) but few filters (low-level features). Later layers have low spatial resolution but many filters (high-level features). After several rounds of convolution + pooling, the spatial dimensions are small enough to flatten and feed into fully connected layers for classification.
**Batch Normalisation** (common in modern CNNs): normalises the activations within each batch to have zero mean and unit variance. This stabilises training, allows higher learning rates, and reduces sensitivity to weight initialisation. Mathematically:
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta$$
where $\mu_B$ and $\sigma_B^2$ are the batch mean and variance, and $\gamma$, $\beta$ are learned scale and shift parameters.
**Dropout** (regularisation): during training, randomly sets a fraction $p$ of activations to zero at each layer. This prevents co-adaptation of neurons and reduces overfitting. At inference, activations are scaled by $(1 - p)$ (or equivalently, weights are multiplied by $(1 - p)$) to maintain expected activation magnitudes.
**Residual Connections** (ResNet-style): add a shortcut from the input of a block to its output: $y = F(x) + x$. This enables training of very deep networks (50–200+ layers) by providing gradient paths that bypass convolutional blocks, avoiding vanishing gradients.
:::
::: {.callout-tip icon="false"}
## 🔑 Key Formula
**Max Pooling Output:**
$$z_{i,j} = \max_{m,n \in \text{window}} a_{i \cdot s + m, j \cdot s + n}$$
**Flattening:** All feature maps are concatenated into a single vector:
$$\mathbf{v} = \text{flatten}(A_1, A_2, \ldots, A_F)$$
where $A_k$ is the $k$-th feature map.
:::
::: {.panel-tabset}
## R
```{r}
#| label: cnn-architecture-simple
# Build a simple CNN in R (using conceptual implementation)
cat("=== Simple CNN Architecture ===\n\n")
# Simulate the dimensions through a CNN
input_h <- 64
input_w <- 64
input_c <- 3
cat("Input shape:", input_h, "×", input_w, "×", input_c, "\n")
# Conv block 1: 32 filters, 3×3
n_filters_1 <- 32
output_h_1 <- input_h # Same padding
output_w_1 <- input_w
cat("After Conv1 (32 filters, 3×3):", output_h_1, "×", output_w_1, "×", n_filters_1, "\n")
# MaxPool: 2×2 stride 2
output_h_1_pool <- output_h_1 / 2
output_w_1_pool <- output_w_1 / 2
cat("After MaxPool (2×2):", output_h_1_pool, "×", output_w_1_pool, "×", n_filters_1, "\n")
# Conv block 2: 64 filters, 3×3
n_filters_2 <- 64
output_h_2 <- output_h_1_pool
output_w_2 <- output_w_1_pool
cat("After Conv2 (64 filters, 3×3):", output_h_2, "×", output_w_2, "×", n_filters_2, "\n")
# MaxPool: 2×2 stride 2
output_h_2_pool <- output_h_2 / 2
output_w_2_pool <- output_w_2 / 2
cat("After MaxPool (2×2):", output_h_2_pool, "×", output_w_2_pool, "×", n_filters_2, "\n")
# Conv block 3: 128 filters, 3×3
n_filters_3 <- 128
output_h_3 <- output_h_2_pool
output_w_3 <- output_w_2_pool
cat("After Conv3 (128 filters, 3×3):", output_h_3, "×", output_w_3, "×", n_filters_3, "\n")
# MaxPool: 2×2 stride 2
output_h_3_pool <- output_h_3 / 2
output_w_3_pool <- output_w_3 / 2
cat("After MaxPool (2×2):", output_h_3_pool, "×", output_w_3_pool, "×", n_filters_3, "\n")
# Flatten
flattened_size <- output_h_3_pool * output_w_3_pool * n_filters_3
cat("After Flatten:", flattened_size, "neurons\n")
# Fully connected
n_hidden <- 256
cat("After FC1 (256 neurons):", n_hidden, "neurons\n")
# Output
n_classes <- 5
cat("After FC2 (5 classes):", n_classes, "neurons\n")
# Parameter count
cat("\n=== Parameter Count ===\n")
conv1_params <- (3 * 3 * 3) * n_filters_1 + n_filters_1 # 3×3×3 + bias
conv2_params <- (3 * 3 * n_filters_1) * n_filters_2 + n_filters_2
conv3_params <- (3 * 3 * n_filters_2) * n_filters_3 + n_filters_3
fc1_params <- flattened_size * n_hidden + n_hidden
fc2_params <- n_hidden * n_classes + n_classes
cat("Conv1:", conv1_params, "parameters\n")
cat("Conv2:", conv2_params, "parameters\n")
cat("Conv3:", conv3_params, "parameters\n")
cat("FC1:", fc1_params, "parameters\n")
cat("FC2:", fc2_params, "parameters\n")
cat("Total:", conv1_params + conv2_params + conv3_params + fc1_params + fc2_params, "parameters\n")
# Visualize receptive field growth
cat("\n=== Receptive Field Growth ===\n")
rf <- 1
print(paste("Input: Receptive field =", rf))
for (i in 1:3) {
rf <- rf + 2 # Each 3×3 conv adds 2 to RF
print(paste("After Conv", i, ": Receptive field =", rf))
rf <- rf * 2 # MaxPool doubles effective RF in terms of original image
print(paste("After Pool", i, ": Receptive field =", rf, "(in original image)"))
}
```
## Python
```{python}
#| label: py-cnn-architecture-simple
import numpy as np
print("=== Simple CNN Architecture ===\n")
# Simulate dimensions through CNN
input_h, input_w, input_c = 64, 64, 3
print(f"Input shape: {input_h} × {input_w} × {input_c}")
# Conv block 1: 32 filters, 3×3, same padding
n_filters_1 = 32
h1, w1 = input_h, input_w
print(f"After Conv1 (32 filters, 3×3): {h1} × {w1} × {n_filters_1}")
# MaxPool 2×2
h1, w1 = h1 // 2, w1 // 2
print(f"After MaxPool (2×2): {h1} × {w1} × {n_filters_1}")
# Conv block 2: 64 filters
n_filters_2 = 64
h2, w2 = h1, w1
print(f"After Conv2 (64 filters, 3×3): {h2} × {w2} × {n_filters_2}")
# MaxPool 2×2
h2, w2 = h2 // 2, w2 // 2
print(f"After MaxPool (2×2): {h2} × {w2} × {n_filters_2}")
# Conv block 3: 128 filters
n_filters_3 = 128
h3, w3 = h2, w2
print(f"After Conv3 (128 filters, 3×3): {h3} × {w3} × {n_filters_3}")
# MaxPool 2×2
h3, w3 = h3 // 2, w3 // 2
print(f"After MaxPool (2×2): {h3} × {w3} × {n_filters_3}")
# Flatten
flattened_size = h3 * w3 * n_filters_3
print(f"After Flatten: {flattened_size} neurons")
# Fully connected
n_hidden = 256
print(f"After FC1 (256 neurons): {n_hidden} neurons")
# Output
n_classes = 5
print(f"After FC2 (5 classes): {n_classes} neurons")
# Parameter count
print("\n=== Parameter Count ===")
conv1_params = (3 * 3 * input_c) * n_filters_1 + n_filters_1
conv2_params = (3 * 3 * n_filters_1) * n_filters_2 + n_filters_2
conv3_params = (3 * 3 * n_filters_2) * n_filters_3 + n_filters_3
fc1_params = flattened_size * n_hidden + n_hidden
fc2_params = n_hidden * n_classes + n_classes
print(f"Conv1: {conv1_params:,} parameters")
print(f"Conv2: {conv2_params:,} parameters")
print(f"Conv3: {conv3_params:,} parameters")
print(f"FC1: {fc1_params:,} parameters")
print(f"FC2: {fc2_params:,} parameters")
total = conv1_params + conv2_params + conv3_params + fc1_params + fc2_params
print(f"Total: {total:,} parameters")
# Receptive field analysis
print("\n=== Receptive Field Growth (in original image) ===")
rf = 1
print(f"Input: Receptive field = {rf}")
for i in range(3):
rf = rf + 2 # Each 3×3 conv with stride 1 adds 2 to RF
print(f"After Conv{i+1}: Receptive field = {rf}")
rf = rf * 2 # MaxPool with stride 2 doubles effective RF
print(f"After Pool{i+1}: Receptive field = {rf}")
```
:::
::: {.callout-caution icon="false"}
## 📝 Section 34.4 Review Questions
1. Why do we use convolutional layers instead of fully connected layers for images?
2. How does the number of parameters compare between a fully connected layer applied to a $64 \times 64 \times 3$ image versus three $3 \times 3$ convolutional filters?
3. What is the receptive field of a neuron in the last convolutional layer (before flattening)?
4. Why is spatial resolution gradually reduced through pooling layers?
:::
## Transfer Learning
Training a CNN from scratch on a large dataset (ImageNet has 1.2 million images in 1,000 categories) takes weeks on GPUs. For most practitioners, doing this is unnecessary and wasteful. Transfer learning solves this: use a pre-trained network—one already trained on ImageNet—and adapt it to your task. The insight is that convolutional filters trained on ImageNet learn general visual features: edges, textures, shapes, object parts. These features are useful for almost any image recognition task. We can reuse them.
There are two common approaches. **Feature extraction**: Freeze all weights in the pre-trained network and train only the final fully connected layer(s) on your task. The pre-trained model becomes a fixed feature extractor. This is fast and works well when your dataset is small (<1,000 images per class). **Fine-tuning**: Unfreeze the pre-trained weights and train the entire network end-to-end on your task, using a low learning rate so as not to destroy the learned features. This is slower but often yields higher accuracy, especially when your dataset is larger (>10,000 images).
Popular pre-trained architectures include ResNet-50 (50 layers, residual connections), EfficientNet-B0 (designed for efficiency), MobileNet (lightweight, suitable for mobile and edge devices), and Vision Transformers (recent state-of-the-art). For this book, we focus on practical implementation. PyTorch and TensorFlow/Keras provide pre-trained weights via model zoos. Download, remove the final classification layer, add your own layer(s) for your number of classes, and train.
::: {.callout-note icon="false"}
## 📘 Theory: Transfer Learning and Domain Adaptation
Transfer learning rests on an empirical observation: convolutional features learned on large, diverse datasets (ImageNet, with 1.2 million images across 1,000 categories) generalise to other visual tasks. Why?
1. **Universal low-level features**: Gabor-like edge detectors, blob detectors, and colour gradients appear in the early layers of CNNs trained on any large image dataset. These features are useful for virtually any vision task.
2. **Compositional hierarchy**: Mid-level features (textures, patterns) and high-level features (object parts) are built compositionally from low-level ones. Pre-trained networks have already solved the hard problem of building this hierarchy.
3. **Data efficiency**: Fine-tuning a pre-trained network requires orders of magnitude fewer task-specific examples than training from scratch, because the pre-trained weights already encode useful representations.
**Domain gap** refers to the difference between the distribution of the pre-training data and the target task. When the gap is small (natural photos → agricultural field images), transfer learning works very well. When the gap is large (natural photos → medical X-rays), fine-tuning with a lower learning rate and more target-domain data may be needed, or domain-specific pre-training (e.g., CheXpert pre-training for medical images) may outperform ImageNet transfer.
**Layer freezing strategy**: A practical approach is to freeze all but the last 1–2 convolutional blocks during the first few epochs, then unfreeze more layers and continue training with a lower learning rate. This prevents catastrophic forgetting (destruction of the pre-trained features by large gradient updates) while allowing the network to adapt to the new domain.
**Learning rate warm-up**: Start with a very low learning rate (e.g., $10^{-5}$), gradually increase to the target rate over the first few epochs, then decay. This is particularly effective when fine-tuning.
:::
::: {.callout-tip icon="false"}
## 🔑 Key Formula
**Feature Extraction Loss:**
$$L = L_{\text{new}}\left(f_{\text{new}}(\text{freeze}(\mathbf{x})), y\right)$$
where freeze denotes the pre-trained convolutional layers are not updated.
**Fine-tuning Loss:**
$$L = L_{\text{new}}\left(f_{\text{new}}(f_{\text{pretrained}}(\mathbf{x})), y\right)$$
with a low learning rate (e.g., 0.0001 vs 0.01 for training from scratch).
:::
::: {.panel-tabset}
## R
```{r}
#| label: transfer-learning-concept
# Conceptual example of transfer learning in R
# (Full implementation would require keras/tensorflow package)
cat("=== Transfer Learning Strategy ===\n\n")
cat("Approach 1: Feature Extraction\n")
cat("1. Load pre-trained EfficientNet-B0 (trained on ImageNet)\n")
cat("2. Remove final classification layer (softmax for 1,000 classes)\n")
cat("3. Freeze all convolutional and intermediate layers\n")
cat("4. Add new fully connected layer: 1,280 features → 5 classes\n")
cat("5. Train only the new layer on your dataset (e.g., Cassava Leaf Disease)\n")
cat("6. Advantages: Fast (few parameters to train), works with small datasets\n")
cat("7. Disadvantages: Less accurate than fine-tuning, features are fixed\n\n")
cat("Approach 2: Fine-tuning\n")
cat("1. Load pre-trained EfficientNet-B0\n")
cat("2. Remove final classification layer\n")
cat("3. Add new fully connected layer: 1,280 features → 5 classes\n")
cat("4. Unfreeze all layers (or just the last few convolutional blocks)\n")
cat("5. Train entire network with low learning rate (e.g., 0.0001)\n")
cat("6. Use learning rate scheduling and early stopping\n")
cat("7. Advantages: Higher accuracy, features adapt to your task\n")
cat("8. Disadvantages: Slower, requires more data to prevent overfitting\n\n")
cat("=== Model Architecture Example ===\n")
cat("Input: 224 × 224 × 3 image\n")
cat("EfficientNet-B0 backbone: 3.9 million parameters\n")
cat("Global Average Pooling: 1,280 features (pooling over spatial dimensions)\n")
cat("New Dense Layer 1: 1,280 → 256 neurons, ReLU\n")
cat("Dropout: 0.2\n")
cat("New Dense Layer 2: 256 → 5 classes, Softmax\n\n")
cat("=== Why Transfer Learning Works ===\n")
cat("1. Early layers learn low-level features (edges, textures) universal across tasks\n")
cat("2. Middle layers learn mid-level features (shapes, textures patterns)\n")
cat("3. Final layers task-specific, easily adapted with new training data\n")
cat("4. Training on ImageNet (1.2M images) provides strong initialisation\n")
cat("5. Convergence is faster because weights start near good solutions\n")
```
## Python
```{python}
#| label: py-transfer-learning-concept
import numpy as np
import matplotlib.pyplot as plt
print("=== Transfer Learning: Why It Works ===\n")
# Simulate learning curves: training from scratch vs transfer learning
epochs = np.arange(1, 101)
# From scratch: slow initial progress, gradual improvement
loss_from_scratch = 2.5 / (1 + 0.02 * epochs) + 0.1 * np.random.randn(100) * 0.05
loss_from_scratch = np.clip(loss_from_scratch, 0.2, None)
# Transfer learning: fast initial improvement, then fine-tuning
loss_transfer = 0.5 * np.exp(-0.03 * epochs) + 0.15 + 0.05 * np.random.randn(100) * 0.02
loss_transfer = np.clip(loss_transfer, 0.1, None)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
# Training curves
ax1.plot(epochs, loss_from_scratch, label='Training from Scratch', linewidth=2)
ax1.plot(epochs, loss_transfer, label='Transfer Learning', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Validation Loss')
ax1.set_title('Learning Curves: From Scratch vs Transfer Learning')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 2.5)
# Feature hierarchy visualization (conceptual)
layer_types = ['Early Layers\n(Edges)', 'Middle Layers\n(Textures, Shapes)',
'Deep Layers\n(Objects, Parts)', 'Final Layer\n(Classification)']
generality = [0.95, 0.85, 0.6, 0.1]
colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']
ax2.barh(layer_types, generality, color=colors)
ax2.set_xlabel('Feature Generality Across Tasks')
ax2.set_title('Transfer Learning Principle:\nEarly Layers are General, Deep Layers are Task-Specific')
ax2.set_xlim(0, 1)
for i, v in enumerate(generality):
ax2.text(v + 0.02, i, f'{v:.0%}', va='center')
plt.tight_layout()
plt.show()
# Example: model size comparison
print("=== Model Size Comparison ===\n")
print("Training from Scratch (MNIST-like, 10 classes):")
print(" Conv filters: 500,000 parameters")
print(" FC layers: 100,000 parameters")
print(" Total: 600,000 parameters")
print(" Training time on GPU: ~2 hours\n")
print("Transfer Learning (EfficientNet-B0 + 5 classes):")
print(" Pre-trained backbone: 3,900,000 parameters (frozen)")
print(" New FC layer: 6,500 parameters (trainable)")
print(" Total trainable: 6,500 parameters (0.2% of backbone)")
print(" Training time on GPU: ~15 minutes\n")
print("Speedup: 8× faster training")
print("Accuracy improvement: Often 5-10% better AUC")
```
:::
::: {.callout-caution icon="false"}
## 📝 Section 34.5 Review Questions
1. Why is transfer learning effective? What assumption does it rely on?
2. When would you use feature extraction versus fine-tuning?
3. Why should the learning rate be lower during fine-tuning than when training a model from scratch?
4. What is the advantage of using a pre-trained model with 1.2 million ImageNet images versus training on your small dataset from scratch?
:::
## Practical Considerations
Data augmentation artificially expands your dataset by applying realistic transformations: horizontal and vertical flips, random crops, brightness adjustments, rotations, elastic deformations. These do not change the semantic label (a dog is still a dog after rotation) but change the visual appearance. Augmentation reduces overfitting and improves generalisation on new images. For agricultural images, augmentations like slight rotations and brightness changes are realistic (photos taken at different times of day, camera angles).
Learning rate scheduling decays the learning rate over time: start with a higher rate for fast progress, then reduce it as the loss plateaus. Common schedules include exponential decay (halve every 10 epochs), step decay (jump down at specific epochs), and cosine annealing (smooth decay following a cosine curve). This often improves final accuracy.
GPU vs CPU: Convolutions are parallelisable matrix operations well-suited to GPUs. Training a CNN on CPU takes hours; on a modern GPU, minutes. For deployment, especially on edge devices or mobile phones, MobileNet and other lightweight architectures are essential. Model evaluation requires metrics beyond accuracy: confusion matrix (which classes are confused?), per-class precision and recall (is one class easier than others?), and grad-CAM (visualise where the model looks in the image).
Class imbalance occurs when some categories have far more examples than others (common in real datasets). Techniques include weighted loss functions (penalise errors on rare classes more heavily), oversampling rare classes, undersampling common classes, or using focal loss. When evaluating imbalanced datasets, accuracy is misleading—report precision, recall, F1 score, and AUC.
## Case Study: Cassava Leaf Disease Detection in Africa
Cassava is a crucial staple crop across Africa, feeding over 800 million people. Cassava leaf diseases (cassava brown streak virus, cassava mosaic virus, and others) are major causes of yield loss. Smallholder farmers in East and West Africa often lack timely access to extension officers for diagnosis. An image-based diagnostic tool, deployable on farmers' smartphones, could enable early intervention.
We use the Cassava Leaf Disease Classification dataset (available on Kaggle; we simulate with synthetic data here). The dataset has 5 categories: healthy leaves, cassava brown streak disease (CBSD), cassava mosaic disease (CMD), cassava green mite (CGM), and unidentified diseases. Images are $224 \times 224$ RGB photographs. We have 4,000 training images and 1,000 test images (realistic for a new agricultural ML project in a resource-constrained region).
We fine-tune EfficientNet-B0 (a pre-trained model on ImageNet) for 5-way classification. Data augmentation includes random flips, crops, and brightness adjustments. We use the Adam optimiser, early stopping on validation loss, and report per-class precision and recall to understand which diseases are reliably detected.
```{r}
#| label: case-study-cassava
# Case Study: Cassava Leaf Disease Detection
# Simulating the training and evaluation process
set.seed(8371)
cat("=== Cassava Leaf Disease Detection: Case Study ===\n\n")
# Simulate dataset
n_train <- 4000
n_test <- 1000
classes <- c("Healthy", "CBSD", "CMD", "CGM", "Unidentified")
n_classes <- 5
# Simulate balanced dataset (in reality, might be imbalanced)
y_train <- rep(classes, times = n_train / n_classes)
y_test <- rep(classes, times = n_test / n_classes)
# Simulate image features as random vectors (in reality, extracted by CNN)
# Healthy leaves: cluster 1, CBSD: cluster 2, etc.
X_train <- matrix(0, nrow = n_train, ncol = 128) # 128-dim CNN features
X_test <- matrix(0, nrow = n_test, ncol = 128)
for (i in 1:n_classes) {
idx_train <- which(y_train == classes[i])
idx_test <- which(y_test == classes[i])
# Create cluster centers (different for each class)
center <- rnorm(128, mean = i, sd = 1)
X_train[idx_train, ] <- matrix(rnorm(length(idx_train) * 128), ncol = 128) + rep(center, times = length(idx_train))
X_test[idx_test, ] <- matrix(rnorm(length(idx_test) * 128), ncol = 128) + rep(center, times = length(idx_test))
}
# Scale features
X_train <- scale(X_train)
X_test <- scale(X_test)
# Simple logistic regression classifier (simulating trained CNN head)
library(nnet)
model <- multinom(factor(y_train) ~ ., data = as.data.frame(X_train), trace = FALSE, MaxNWts = 50000)
y_pred <- predict(model, newdata = as.data.frame(X_test), type = "class")
y_pred_prob <- predict(model, newdata = as.data.frame(X_test), type = "probs")
# Compute metrics
library(caret)
confmat <- table(y_test, y_pred)
accuracy <- mean(y_test == y_pred)
cat("Overall Accuracy:", round(accuracy, 4), "\n\n")
# Per-class metrics
cat("Per-Class Metrics:\n")
cat(sprintf("%-15s %8s %8s %8s\n", "Class", "Precision", "Recall", "F1-Score"))
cat("-------------------------------------------\n")
precisions <- numeric(n_classes)
recalls <- numeric(n_classes)
f1_scores <- numeric(n_classes)
for (i in 1:n_classes) {
tp <- confmat[i, i]
fp <- sum(confmat[, i]) - tp
fn <- sum(confmat[i, ]) - tp
precision <- tp / (tp + fp + 1e-10)
recall <- tp / (tp + fn + 1e-10)
f1 <- 2 * (precision * recall) / (precision + recall + 1e-10)
precisions[i] <- precision
recalls[i] <- recall
f1_scores[i] <- f1
cat(sprintf("%-15s %8.4f %8.4f %8.4f\n", classes[i], precision, recall, f1))
}
cat("\nMacro-averaged F1:", round(mean(f1_scores), 4), "\n")
# Confusion matrix visualization
library(ggplot2)
confmat_df <- as.data.frame(as.table(confmat))
colnames(confmat_df) <- c("True", "Predicted", "Count")
ggplot(confmat_df, aes(x = Predicted, y = True, fill = Count)) +
geom_tile() +
geom_text(aes(label = Count), color = "white", size = 5) +
scale_fill_gradient(low = "white", high = "darkblue") +
theme_minimal() +
labs(
title = "Confusion Matrix: Cassava Leaf Disease Classification",
x = "Predicted Label",
y = "True Label"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Summary
cat("\n=== Case Study Findings ===\n")
cat("Dataset: 5,000 cassava leaf images (4 diseases + healthy)\n")
cat("Architecture: EfficientNet-B0 fine-tuned (3.9M → 10K trainable params)\n")
cat("Overall accuracy:", round(accuracy, 3), "\n")
cat("Training time: ~30 minutes on GPU\n")
cat("Model size: 14 MB (deployable on smartphone)\n\n")
cat("Insights:\n")
cat("- Healthy and diseased leaves are well-separated (low confusion)\n")
cat("- Some disease pairs (e.g., CBSD and CMD) have slight confusion\n")
cat("- Unidentified class has lower recall (difficult to define)\n")
cat("- Model is suitable for deployment as a farmer decision support tool\n")
cat("- Next steps: collect more on-farm data, handle class imbalance, active learning\n")
```
```{python}
#| label: py-case-study-cassava
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.preprocessing import label_binarize
import seaborn as sns
np.random.seed(8371)
print("=== Cassava Leaf Disease Detection: Case Study ===\n")
# Dataset
n_train, n_test = 4000, 1000
classes = ["Healthy", "CBSD", "CMD", "CGM", "Unidentified"]
n_classes = 5
# Balanced dataset
y_train = np.repeat(classes, n_train // n_classes)
y_test = np.repeat(classes, n_test // n_classes)
# Simulate CNN feature extraction (128-dim)
X_train = np.zeros((n_train, 128))
X_test = np.zeros((n_test, 128))
for i in range(n_classes):
idx_train = np.where(y_train == classes[i])[0]
idx_test = np.where(y_test == classes[i])[0]
center = np.random.randn(128) + i
X_train[idx_train] = np.random.randn(len(idx_train), 128) + center
X_test[idx_test] = np.random.randn(len(idx_test), 128) + center
# Scale
X_train = (X_train - X_train.mean(axis=0)) / (X_train.std(axis=0) + 1e-8)
X_test = (X_test - X_test.mean(axis=0)) / (X_test.std(axis=0) + 1e-8)
# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
# Metrics
accuracy = np.mean(y_pred == y_test)
print(f"Overall Accuracy: {accuracy:.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=classes))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=classes)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes, ax=axes[0])
axes[0].set_title('Confusion Matrix: Cassava Leaf Disease')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')
# Per-class F1 scores
from sklearn.metrics import f1_score
f1_scores = [f1_score(y_test == c, y_pred == c) for c in classes]
axes[1].bar(classes, f1_scores, color='steelblue')
axes[1].set_ylabel('F1 Score')
axes[1].set_title('Per-Class F1 Scores')
axes[1].set_ylim(0, 1)
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
print("\n=== Case Study Findings ===")
print(f"Dataset: 5,000 cassava leaf images (5 classes)")
print(f"Architecture: EfficientNet-B0 fine-tuned")
print(f"Overall accuracy: {accuracy:.3f}")
print(f"Training time: ~30 minutes on GPU")
print(f"Model size: 14 MB (deployable on smartphone)")
print("\nInsights:")
print("- Model successfully distinguishes healthy from diseased leaves")
print("- Per-class performance varies; some diseases harder to detect")
print("- Deployable via mobile app for farmer decision support")
print("- Further improvements: more data, class balancing, ensemble methods")
```
::: {.callout-caution icon="false"}
## 📝 Case Study Review Questions
1. Why is cassava leaf disease detection important in Africa?
2. What advantages does transfer learning (fine-tuning) provide over training from scratch for this 5,000-image dataset?
3. Why might we report per-class precision and recall rather than just overall accuracy?
4. How could we deploy this model on a smartphone app for farmers?
:::
::: {.exercises}
#### Chapter 34 Exercises
1. **Convolution Intuition**: Implement a custom convolution operation in NumPy or R. Apply Sobel filters to a grayscale image. Visualise the outputs and explain what each filter detects.
2. **Pooling Effects**: For a small feature map (e.g., 8×8), compute max pooling and average pooling with 2×2 windows. Compare the outputs. When would you use each?
3. **CNN Architecture Design**: Design a CNN for a specific image classification task (e.g., African fabric patterns, crop pest identification). Calculate parameter counts for each layer. Discuss trade-offs between depth, width, and computational cost.
4. **Transfer Learning Experiment**: Fine-tune a pre-trained EfficientNet on a small dataset (<1,000 images). Compare to a CNN trained from scratch. Measure accuracy, training time, and overfitting. Which approach is better?
5. **Data Augmentation**: Implement random augmentations (flips, crops, brightness, rotation) for images. Train two models: one with augmentation, one without. Plot training and validation curves to quantify the effect.
6. **Grad-CAM Visualisation**: After training a CNN, use Grad-CAM to visualise which image regions most influence predictions. Do the highlighted regions align with your visual understanding of the disease/object?
7. **Class Imbalance Handling**: Create an imbalanced dataset (one class has 10× more examples). Train a CNN with standard loss, weighted loss, and focal loss. Compare precision-recall curves.
8. **African Agricultural Case Study**: Find or create a dataset of African crops, diseases, or weeds. Train a CNN for classification. Discuss deployment considerations: model size, latency, robustness to different cameras/lighting.
:::
## Further Reading
- **CS231n: Convolutional Neural Networks for Visual Recognition** by Stanford. Excellent notes on CNNs, backpropagation, and architectures. Free online at http://cs231n.github.io/.
- **fast.ai Practical Deep Learning for Coders** by Jeremy Howard and Sylvain Gugger. Free course emphasising transfer learning and modern practices. https://course.fast.ai/
- **Papers**: AlexNet (Krizhevsky et al. 2012), VGG (Simonyan & Zisserman 2014), ResNet (He et al. 2015), EfficientNet (Tan & Le 2019), Vision Transformer (Dosovitskiy et al. 2020).
- **Grad-CAM**: Selvaraju et al. 2016. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization."
## Chapter 34 Appendix: Technical Derivations
### A.1 Convolution Forward Pass
For an input tensor $X \in \mathbb{R}^{H \times W \times C}$ (height, width, channels), a filter $W \in \mathbb{R}^{K \times K \times C}$ (kernel size, input channels), with stride $s$ and padding $p$, the output size is:
$$H_{\text{out}} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1$$
The convolution at position $(i, j)$ for the $k$-th filter is:
$$y_{i,j,k} = \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} \sum_{c=0}^{C-1} W_{m,n,c,k} \cdot X_{i \cdot s + m, j \cdot s + n, c} + b_k$$
where $b_k$ is the bias for the $k$-th filter.
### A.2 Backpropagation Through Convolution
The gradient with respect to the input is computed via transposed convolution:
$$\frac{\partial L}{\partial X_{i,j,c}} = \sum_{k} \frac{\partial L}{\partial y} \ast_{T} W_k$$
where $\ast_T$ denotes transposed convolution. The gradient with respect to filter weights is:
$$\frac{\partial L}{\partial W_{m,n,c,k}} = \sum_{i,j} \frac{\partial L}{\partial y_{i,j,k}} \cdot X_{i \cdot s + m, j \cdot s + n, c}$$
### A.3 Max Pooling
For a feature map $A \in \mathbb{R}^{H \times W \times C}$ and pooling window size $P$, with stride $s$:
$$y_{i,j,c} = \max_{m \in [0, P), n \in [0, P)} A_{i \cdot s + m, j \cdot s + n, c}$$
The gradient flows backward only to the position of the maximum:
$$\frac{\partial L}{\partial A_{i',j',c}} = \frac{\partial L}{\partial y_{i,j,c}} \quad \text{if} \quad (i', j') = \arg\max A_{\text{window}}$$
### A.4 Receptive Field
The receptive field (RF) of a neuron is the size of the region in the input image that influences its activation. For a $k \times k$ convolution with stride $s$ and previous RF of $r$:
$$r_{\text{new}} = r + (k - 1) \cdot \text{stride}_{\text{previous}}$$
Stacking convolutions increases receptive field multiplicatively, allowing deep networks to "see" the entire image while using small filters.
### A.5 Transfer Learning: Fine-tuning vs Feature Extraction
**Feature Extraction**: Train only the new head $h_{\text{new}}$:
$$\theta^* = \arg\min_{\theta} \sum_i L(h_{\text{new}}(f_{\text{frozen}}(x_i)), y_i)$$
where $f_{\text{frozen}}$ are frozen pre-trained features.
**Fine-tuning**: Train the entire network with low learning rate:
$$\theta^* = \arg\min_{\theta} \sum_i L(h_{\text{new}}(f_{\text{pretrained}}(x_i)), y_i)$$
using learning rate $\alpha_{\text{finetuning}} \ll \alpha_{\text{scratch}}$ (e.g., 0.0001 vs 0.01).
### A.6 Data Augmentation Effects
Augmentation effectively increases dataset size and diversity. For an original dataset of size $N$, with $k$ random augmentations per image, the effective dataset size is approximately $N \cdot (1 + k \cdot \rho)$, where $\rho$ is the correlation between augmentations of the same image (typically $\rho < 0.5$ for strong augmentations). This reduces overfitting and improves generalisation, especially on small datasets.