Backpropagation
The learning algorithm that trains neural networks
Imagine you're playing darts but blindfolded. You throw a dart, then someone tells you 'you were 10cm too far left and 5cm too high'. You adjust your aim and try again. That's backpropagation! When a neural network makes a prediction, backpropagation calculates how wrong each weight was and tells the network exactly how much to adjust each weight. It works backwards through the network (hence 'back' propagation), using the chain rule from calculus to figure out each weight's contribution to the error. It's like following breadcrumbs backwards to find where you went wrong!
What is Backpropagation?
Backpropagation (backward propagation of errors) is the algorithm used to train neural networks by calculating gradients efficiently. It works by computing the gradient of the loss function with respect to each weight by propagating error signals backward through the network. This enables gradient descent to update weights and minimize the loss. It's the cornerstone of deep learning - without backpropagation, training deep neural networks would be computationally infeasible.
# Simple Neural Network to Demonstrate Backpropagationimport numpy as np# Simple 2-layer network: Input -> Hidden -> Output# Goal: Learn XOR function# Activation functionsdef sigmoid(x): return 1 / (1 + np.exp(-x))def sigmoid_derivative(x): return x * (1 - x)# Training data: XORX = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y = np.array([[0], [1], [1], [0]])# Initialize weights randomlynp.random.seed(42)input_size = 2hidden_size = 4output_size = 1weights_input_hidden = np.random.randn(input_size, hidden_size)weights_hidden_output = np.random.randn(hidden_size, output_size)learning_rate = 0.5epochs = 10000# Training loopfor epoch in range(epochs): # === FORWARD PASS === # Layer 1: Input to Hidden hidden_input = np.dot(X, weights_input_hidden) hidden_output = sigmoid(hidden_input) # Layer 2: Hidden to Output final_input = np.dot(hidden_output, weights_hidden_output) final_output = sigmoid(final_input) # Calculate loss (Mean Squared Error) loss = np.mean((y - final_output) ** 2) # === BACKWARD PASS (Backpropagation) === # Calculate output layer error and gradient output_error = y - final_output output_gradient = output_error * sigmoid_derivative(final_output) # Calculate hidden layer error (propagate error backward!) hidden_error = output_gradient.dot(weights_hidden_output.T) hidden_gradient = hidden_error * sigmoid_derivative(hidden_output) # Update weights using gradients weights_hidden_output += hidden_output.T.dot(output_gradient) * learning_rate weights_input_hidden += X.T.dot(hidden_gradient) * learning_rate if epoch % 2000 == 0: print(f"Epoch {epoch}: Loss = {loss:.4f}")# Test the trained networkprint("\nPredictions after training:")for i in range(len(X)): hidden = sigmoid(np.dot(X[i], weights_input_hidden)) output = sigmoid(np.dot(hidden, weights_hidden_output)) print(f"Input: {X[i]} -> Output: {output[0]:.4f} (Target: {y[i][0]})")# Output shows the network learned XOR using backpropagation!# Epoch 0: Loss = 0.2816# Epoch 2000: Loss = 0.0065# Epoch 4000: Loss = 0.0021# ...# The gradients computed by backprop enabled the network to learn!How Does It Work?
Backpropagation consists of two main passes through the network:
1. Forward Pass →
Compute predictions layer by layer from input to output
- • Calculate: z = w·x + b
- • Apply activation: a = σ(z)
- • Store all intermediate values (z, a)
- • Get final prediction: ŷ
- • Compute loss: L(y, ŷ)
2. Backward Pass ←
Compute gradients layer by layer from output to input
- • Start: ∂L/∂ŷ (loss gradient)
- • Chain rule: ∂L/∂w = (∂L/∂a)·(∂a/∂z)·(∂z/∂w)
- • Propagate gradients backward
- • Compute ∂L/∂w for each weight
- • Use gradients to update weights
# Detailed Forward and Backward Pass Visualizationimport numpy as np# Simple example: 1 input -> 1 hidden neuron -> 1 output# Goal: Learn y = 2*x# Training datax = 3.0y = 6.0 # Target: 2*x# Initialize weightsw1 = 0.5 # Input to hiddenw2 = 0.8 # Hidden to outputlearning_rate = 0.1print("=== FORWARD PASS ===")print(f"Input: x = {x}")print(f"Target: y = {y}")print(f"Weights: w1 = {w1}, w2 = {w2}\n")# Forward pass (no activation for simplicity)h = w1 * x # Hidden layer: h = w1 * xprint(f"Step 1: h = w1 * x = {w1} * {x} = {h}")y_pred = w2 * h # Output layer: y_pred = w2 * hprint(f"Step 2: y_pred = w2 * h = {w2} * {h} = {y_pred}")loss = 0.5 * (y - y_pred) ** 2 # Loss: (1/2) * (y - y_pred)^2print(f"Step 3: Loss = 0.5 * (y - y_pred)^2 = 0.5 * ({y} - {y_pred})^2 = {loss}\n")print("=== BACKWARD PASS (Backpropagation) ===")# Backward pass: Compute gradients using chain rule# ∂Loss/∂y_preddL_dy_pred = -(y - y_pred)print(f"Step 1: ∂Loss/∂y_pred = -(y - y_pred) = -({y} - {y_pred}) = {dL_dy_pred}")# ∂Loss/∂w2 = (∂Loss/∂y_pred) * (∂y_pred/∂w2)dy_pred_dw2 = hdL_dw2 = dL_dy_pred * dy_pred_dw2print(f"Step 2: ∂Loss/∂w2 = ∂Loss/∂y_pred * ∂y_pred/∂w2 = {dL_dy_pred} * {dy_pred_dw2} = {dL_dw2}")# ∂Loss/∂h = (∂Loss/∂y_pred) * (∂y_pred/∂h)dy_pred_dh = w2dL_dh = dL_dy_pred * dy_pred_dhprint(f"Step 3: ∂Loss/∂h = ∂Loss/∂y_pred * ∂y_pred/∂h = {dL_dy_pred} * {dy_pred_dh} = {dL_dh}")# ∂Loss/∂w1 = (∂Loss/∂h) * (∂h/∂w1)dh_dw1 = xdL_dw1 = dL_dh * dh_dw1print(f"Step 4: ∂Loss/∂w1 = ∂Loss/∂h * ∂h/∂w1 = {dL_dh} * {dh_dw1} = {dL_dw1}\n")print("=== WEIGHT UPDATE ===")# Update weights using gradientsw1_new = w1 - learning_rate * dL_dw1w2_new = w2 - learning_rate * dL_dw2print(f"w1 = w1 - lr * ∂Loss/∂w1 = {w1} - {learning_rate} * {dL_dw1} = {w1_new}")print(f"w2 = w2 - lr * ∂Loss/∂w2 = {w2} - {learning_rate} * {dL_dw2} = {w2_new}")print("\nBackpropagation computed the exact gradient for each weight!")print("Now gradient descent can update weights to reduce loss.")The Chain Rule (Mathematical Foundation)
Backpropagation relies on the chain rule from calculus to compute gradients efficiently:
Chain Rule Formula
If z = f(y) and y = g(x), then:
∂z/∂x = (∂z/∂y) × (∂y/∂x)
Example: If L = (y_pred - y)² and y_pred = w·x, find ∂L/∂w
∂L/∂w = (∂L/∂y_pred) × (∂y_pred/∂w)
∂L/∂y_pred = 2(y_pred - y)
∂y_pred/∂w = x
∂L/∂w = 2(y_pred - y) × x
The chain rule breaks down complex derivatives into simple steps that can be computed efficiently!
# Chain Rule in Action: Computing Gradientsimport numpy as np# Example: L = (y_pred - y)^2, where y_pred = w2 * (w1 * x)# Find ∂L/∂w1 and ∂L/∂w2 using chain rulex = 2.0y = 10.0w1 = 1.5w2 = 2.0# Forward passz1 = w1 * x # First layery_pred = w2 * z1 # Second layerL = (y_pred - y) ** 2 # Lossprint(f"Forward Pass:")print(f"z1 = w1 * x = {w1} * {x} = {z1}")print(f"y_pred = w2 * z1 = {w2} * {z1} = {y_pred}")print(f"L = (y_pred - y)^2 = ({y_pred} - {y})^2 = {L}\n")# Backward pass using chain ruleprint(f"Backward Pass (Chain Rule):\n")# ∂L/∂y_preddL_dy_pred = 2 * (y_pred - y)print(f"∂L/∂y_pred = 2 * (y_pred - y) = 2 * ({y_pred} - {y}) = {dL_dy_pred}")# ∂L/∂w2 using chain rule: ∂L/∂w2 = (∂L/∂y_pred) * (∂y_pred/∂w2)dy_pred_dw2 = z1dL_dw2 = dL_dy_pred * dy_pred_dw2print(f"\n∂y_pred/∂w2 = z1 = {z1}")print(f"∂L/∂w2 = (∂L/∂y_pred) * (∂y_pred/∂w2)")print(f" = {dL_dy_pred} * {dy_pred_dw2}")print(f" = {dL_dw2}")# ∂L/∂z1 using chain rule: ∂L/∂z1 = (∂L/∂y_pred) * (∂y_pred/∂z1)dy_pred_dz1 = w2dL_dz1 = dL_dy_pred * dy_pred_dz1print(f"\n∂y_pred/∂z1 = w2 = {w2}")print(f"∂L/∂z1 = (∂L/∂y_pred) * (∂y_pred/∂z1)")print(f" = {dL_dy_pred} * {dy_pred_dz1}")print(f" = {dL_dz1}")# ∂L/∂w1 using chain rule: ∂L/∂w1 = (∂L/∂z1) * (∂z1/∂w1)dz1_dw1 = xdL_dw1 = dL_dz1 * dz1_dw1print(f"\n∂z1/∂w1 = x = {x}")print(f"∂L/∂w1 = (∂L/∂z1) * (∂z1/∂w1)")print(f" = {dL_dz1} * {dz1_dw1}")print(f" = {dL_dw1}")print(f"\nChain Rule Summary:")print(f"∂L/∂w2 = {dL_dw2} (tells us how to adjust w2)")print(f"∂L/∂w1 = {dL_dw1} (tells us how to adjust w1)")print("\nThese gradients enable gradient descent to optimize the network!")The Algorithm Steps
The complete backpropagation algorithm:
- 1.
Initialize Weights
Set random initial values for all weights and biases
- 2.
Forward Pass
Compute predictions and store all intermediate activations (z, a)
- 3.
Compute Loss
Calculate loss L(y, ŷ) using appropriate loss function
- 4.
Backward Pass (Backprop)
Compute gradients ∂L/∂w for all weights using chain rule, starting from output
- 5.
Update Weights
Apply gradient descent: w = w - α·∂L/∂w for each weight
- 6.
Repeat
Continue steps 2-5 for multiple epochs until convergence
Key Concepts
Forward Pass
Compute predictions by passing input through the network layer by layer. Calculate and store intermediate values (activations) needed for backward pass.
Backward Pass
Calculate gradients by propagating error backward from output to input. Use chain rule to compute gradient of loss w.r.t. each weight.
Chain Rule
Mathematical principle that lets us compute derivatives of composite functions. Key: ∂z/∂x = (∂z/∂y) × (∂y/∂x). Allows efficient gradient computation.
Gradient
The direction and magnitude of steepest increase in loss. Negative gradient tells us how to adjust weights to reduce loss.
Interview Tips
- 💡Explain backprop as: 'algorithm that calculates gradients by propagating errors backward through network using chain rule'
- 💡Two phases: Forward pass (compute predictions) and Backward pass (compute gradients)
- 💡Chain rule is key: ∂L/∂w = (∂L/∂y) × (∂y/∂w). Breaks complex derivatives into simple steps
- 💡Why it's efficient: Reuses intermediate computations. Without it, computing gradients would be O(n²) instead of O(n)
- 💡Forward pass: store activations. Backward pass: compute gradients using stored values
- 💡Combined with gradient descent: backprop computes gradients, gradient descent updates weights
- 💡Vanishing gradient problem: gradients can become very small in deep networks, especially with sigmoid/tanh
- 💡Modern deep learning: backprop + GPU acceleration + good initialization + ReLU enables training very deep networks
- 💡Be ready to walk through a simple example: 2-layer network with one sample
- 💡Know the computational graph perspective: nodes are operations, edges carry gradients backward