Backpropagation

The learning algorithm that trains neural networks

Imagine you're playing darts but blindfolded. You throw a dart, then someone tells you 'you were 10cm too far left and 5cm too high'. You adjust your aim and try again. That's backpropagation! When a neural network makes a prediction, backpropagation calculates how wrong each weight was and tells the network exactly how much to adjust each weight. It works backwards through the network (hence 'back' propagation), using the chain rule from calculus to figure out each weight's contribution to the error. It's like following breadcrumbs backwards to find where you went wrong!

What is Backpropagation?

Backpropagation (backward propagation of errors) is the algorithm used to train neural networks by calculating gradients efficiently. It works by computing the gradient of the loss function with respect to each weight by propagating error signals backward through the network. This enables gradient descent to update weights and minimize the loss. It's the cornerstone of deep learning - without backpropagation, training deep neural networks would be computationally infeasible.

python

# Simple Neural Network to Demonstrate Backpropagation
import numpy as np
# Simple 2-layer network: Input -> Hidden -> Output
# Goal: Learn XOR function
# Activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
    return x * (1 - x)
# Training data: XOR
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])
y = np.array([[0],
              [1],
              [1],
              [0]])
# Initialize weights randomly
np.random.seed(42)
input_size = 2
hidden_size = 4
output_size = 1
weights_input_hidden = np.random.randn(input_size, hidden_size)
weights_hidden_output = np.random.randn(hidden_size, output_size)
learning_rate = 0.5
epochs = 10000
# Training loop
for epoch in range(epochs):
    # === FORWARD PASS ===
    # Layer 1: Input to Hidden
    hidden_input = np.dot(X, weights_input_hidden)
    hidden_output = sigmoid(hidden_input)
    # Layer 2: Hidden to Output
    final_input = np.dot(hidden_output, weights_hidden_output)
    final_output = sigmoid(final_input)
    # Calculate loss (Mean Squared Error)
    loss = np.mean((y - final_output) ** 2)
    # === BACKWARD PASS (Backpropagation) ===
    # Calculate output layer error and gradient
    output_error = y - final_output
    output_gradient = output_error * sigmoid_derivative(final_output)
    # Calculate hidden layer error (propagate error backward!)
    hidden_error = output_gradient.dot(weights_hidden_output.T)
    hidden_gradient = hidden_error * sigmoid_derivative(hidden_output)
    # Update weights using gradients
    weights_hidden_output += hidden_output.T.dot(output_gradient) * learning_rate
    weights_input_hidden += X.T.dot(hidden_gradient) * learning_rate
    if epoch % 2000 == 0:
        print(f"Epoch {epoch}: Loss = {loss:.4f}")
# Test the trained network
print("\nPredictions after training:")
for i in range(len(X)):
    hidden = sigmoid(np.dot(X[i], weights_input_hidden))
    output = sigmoid(np.dot(hidden, weights_hidden_output))
    print(f"Input: {X[i]} -> Output: {output[0]:.4f} (Target: {y[i][0]})")
# Output shows the network learned XOR using backpropagation!
# Epoch 0: Loss = 0.2816
# Epoch 2000: Loss = 0.0065
# Epoch 4000: Loss = 0.0021
# ...
# The gradients computed by backprop enabled the network to learn!

How Does It Work?

Backpropagation consists of two main passes through the network:

1. Forward Pass →

Compute predictions layer by layer from input to output

• Calculate: z = w·x + b
• Apply activation: a = σ(z)
• Store all intermediate values (z, a)
• Get final prediction: ŷ
• Compute loss: L(y, ŷ)

2. Backward Pass ←

Compute gradients layer by layer from output to input

• Start: ∂L/∂ŷ (loss gradient)
• Chain rule: ∂L/∂w = (∂L/∂a)·(∂a/∂z)·(∂z/∂w)
• Propagate gradients backward
• Compute ∂L/∂w for each weight
• Use gradients to update weights

python

# Detailed Forward and Backward Pass Visualization
import numpy as np
# Simple example: 1 input -> 1 hidden neuron -> 1 output
# Goal: Learn y = 2*x
# Training data
x = 3.0
y = 6.0  # Target: 2*x
# Initialize weights
w1 = 0.5  # Input to hidden
w2 = 0.8  # Hidden to output
learning_rate = 0.1
print("=== FORWARD PASS ===")
print(f"Input: x = {x}")
print(f"Target: y = {y}")
print(f"Weights: w1 = {w1}, w2 = {w2}\n")
# Forward pass (no activation for simplicity)
h = w1 * x           # Hidden layer: h = w1 * x
print(f"Step 1: h = w1 * x = {w1} * {x} = {h}")
y_pred = w2 * h      # Output layer: y_pred = w2 * h
print(f"Step 2: y_pred = w2 * h = {w2} * {h} = {y_pred}")
loss = 0.5 * (y - y_pred) ** 2  # Loss: (1/2) * (y - y_pred)^2
print(f"Step 3: Loss = 0.5 * (y - y_pred)^2 = 0.5 * ({y} - {y_pred})^2 = {loss}\n")
print("=== BACKWARD PASS (Backpropagation) ===")
# Backward pass: Compute gradients using chain rule
# ∂Loss/∂y_pred
dL_dy_pred = -(y - y_pred)
print(f"Step 1: ∂Loss/∂y_pred = -(y - y_pred) = -({y} - {y_pred}) = {dL_dy_pred}")
# ∂Loss/∂w2 = (∂Loss/∂y_pred) * (∂y_pred/∂w2)
dy_pred_dw2 = h
dL_dw2 = dL_dy_pred * dy_pred_dw2
print(f"Step 2: ∂Loss/∂w2 = ∂Loss/∂y_pred * ∂y_pred/∂w2 = {dL_dy_pred} * {dy_pred_dw2} = {dL_dw2}")
# ∂Loss/∂h = (∂Loss/∂y_pred) * (∂y_pred/∂h)
dy_pred_dh = w2
dL_dh = dL_dy_pred * dy_pred_dh
print(f"Step 3: ∂Loss/∂h = ∂Loss/∂y_pred * ∂y_pred/∂h = {dL_dy_pred} * {dy_pred_dh} = {dL_dh}")
# ∂Loss/∂w1 = (∂Loss/∂h) * (∂h/∂w1)
dh_dw1 = x
dL_dw1 = dL_dh * dh_dw1
print(f"Step 4: ∂Loss/∂w1 = ∂Loss/∂h * ∂h/∂w1 = {dL_dh} * {dh_dw1} = {dL_dw1}\n")
print("=== WEIGHT UPDATE ===")
# Update weights using gradients
w1_new = w1 - learning_rate * dL_dw1
w2_new = w2 - learning_rate * dL_dw2
print(f"w1 = w1 - lr * ∂Loss/∂w1 = {w1} - {learning_rate} * {dL_dw1} = {w1_new}")
print(f"w2 = w2 - lr * ∂Loss/∂w2 = {w2} - {learning_rate} * {dL_dw2} = {w2_new}")
print("\nBackpropagation computed the exact gradient for each weight!")
print("Now gradient descent can update weights to reduce loss.")

The Chain Rule (Mathematical Foundation)

Backpropagation relies on the chain rule from calculus to compute gradients efficiently:

Chain Rule Formula

If z = f(y) and y = g(x), then:

∂z/∂x = (∂z/∂y) × (∂y/∂x)

Example: If L = (y_pred - y)² and y_pred = w·x, find ∂L/∂w

∂L/∂w = (∂L/∂y_pred) × (∂y_pred/∂w)

∂L/∂y_pred = 2(y_pred - y)

∂y_pred/∂w = x

∂L/∂w = 2(y_pred - y) × x

The chain rule breaks down complex derivatives into simple steps that can be computed efficiently!

python

# Chain Rule in Action: Computing Gradients
import numpy as np
# Example: L = (y_pred - y)^2, where y_pred = w2 * (w1 * x)
# Find ∂L/∂w1 and ∂L/∂w2 using chain rule
x = 2.0
y = 10.0
w1 = 1.5
w2 = 2.0
# Forward pass
z1 = w1 * x        # First layer
y_pred = w2 * z1   # Second layer
L = (y_pred - y) ** 2  # Loss
print(f"Forward Pass:")
print(f"z1 = w1 * x = {w1} * {x} = {z1}")
print(f"y_pred = w2 * z1 = {w2} * {z1} = {y_pred}")
print(f"L = (y_pred - y)^2 = ({y_pred} - {y})^2 = {L}\n")
# Backward pass using chain rule
print(f"Backward Pass (Chain Rule):\n")
# ∂L/∂y_pred
dL_dy_pred = 2 * (y_pred - y)
print(f"∂L/∂y_pred = 2 * (y_pred - y) = 2 * ({y_pred} - {y}) = {dL_dy_pred}")
# ∂L/∂w2 using chain rule: ∂L/∂w2 = (∂L/∂y_pred) * (∂y_pred/∂w2)
dy_pred_dw2 = z1
dL_dw2 = dL_dy_pred * dy_pred_dw2
print(f"\n∂y_pred/∂w2 = z1 = {z1}")
print(f"∂L/∂w2 = (∂L/∂y_pred) * (∂y_pred/∂w2)")
print(f"       = {dL_dy_pred} * {dy_pred_dw2}")
print(f"       = {dL_dw2}")
# ∂L/∂z1 using chain rule: ∂L/∂z1 = (∂L/∂y_pred) * (∂y_pred/∂z1)
dy_pred_dz1 = w2
dL_dz1 = dL_dy_pred * dy_pred_dz1
print(f"\n∂y_pred/∂z1 = w2 = {w2}")
print(f"∂L/∂z1 = (∂L/∂y_pred) * (∂y_pred/∂z1)")
print(f"       = {dL_dy_pred} * {dy_pred_dz1}")
print(f"       = {dL_dz1}")
# ∂L/∂w1 using chain rule: ∂L/∂w1 = (∂L/∂z1) * (∂z1/∂w1)
dz1_dw1 = x
dL_dw1 = dL_dz1 * dz1_dw1
print(f"\n∂z1/∂w1 = x = {x}")
print(f"∂L/∂w1 = (∂L/∂z1) * (∂z1/∂w1)")
print(f"       = {dL_dz1} * {dz1_dw1}")
print(f"       = {dL_dw1}")
print(f"\nChain Rule Summary:")
print(f"∂L/∂w2 = {dL_dw2} (tells us how to adjust w2)")
print(f"∂L/∂w1 = {dL_dw1} (tells us how to adjust w1)")
print("\nThese gradients enable gradient descent to optimize the network!")

The Algorithm Steps

The complete backpropagation algorithm:

1.
Initialize Weights
Set random initial values for all weights and biases
2.
Forward Pass
Compute predictions and store all intermediate activations (z, a)
3.
Compute Loss
Calculate loss L(y, ŷ) using appropriate loss function
4.
Backward Pass (Backprop)
Compute gradients ∂L/∂w for all weights using chain rule, starting from output
5.
Update Weights
Apply gradient descent: w = w - α·∂L/∂w for each weight
6.
Repeat
Continue steps 2-5 for multiple epochs until convergence

Key Concepts

Forward Pass

Compute predictions by passing input through the network layer by layer. Calculate and store intermediate values (activations) needed for backward pass.

Backward Pass

Calculate gradients by propagating error backward from output to input. Use chain rule to compute gradient of loss w.r.t. each weight.

Chain Rule

Mathematical principle that lets us compute derivatives of composite functions. Key: ∂z/∂x = (∂z/∂y) × (∂y/∂x). Allows efficient gradient computation.

Gradient

The direction and magnitude of steepest increase in loss. Negative gradient tells us how to adjust weights to reduce loss.

Interview Tips

💡Explain backprop as: 'algorithm that calculates gradients by propagating errors backward through network using chain rule'
💡Two phases: Forward pass (compute predictions) and Backward pass (compute gradients)
💡Chain rule is key: ∂L/∂w = (∂L/∂y) × (∂y/∂w). Breaks complex derivatives into simple steps
💡Why it's efficient: Reuses intermediate computations. Without it, computing gradients would be O(n²) instead of O(n)
💡Forward pass: store activations. Backward pass: compute gradients using stored values
💡Combined with gradient descent: backprop computes gradients, gradient descent updates weights
💡Vanishing gradient problem: gradients can become very small in deep networks, especially with sigmoid/tanh
💡Modern deep learning: backprop + GPU acceleration + good initialization + ReLU enables training very deep networks
💡Be ready to walk through a simple example: 2-layer network with one sample
💡Know the computational graph perspective: nodes are operations, edges carry gradients backward