Gradient Descent

The optimization algorithm that makes machine learning models learn

Imagine you're hiking in foggy mountains and want to reach the lowest valley, but you can't see far! What do you do? You look around, find the steepest downward direction, take a step down, and repeat. Eventually, you'll reach a valley (maybe not THE lowest, but a low point). That's exactly how Gradient Descent works! A machine learning model is trying to find the best parameters (weights) by repeatedly taking steps in the direction that reduces the error most. It's like rolling a ball down a hill until it settles at the bottom!

What is Gradient Descent?

Gradient Descent is an optimization algorithm used to minimize a function (typically a loss/cost function in machine learning). It works by iteratively adjusting parameters in the direction of steepest decrease (negative gradient) to find the minimum value. Think of it as the learning process - the algorithm that tells the model how to adjust its weights to reduce errors and improve predictions.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Simple Gradient Descent Example: Finding Minimum of y =
import numpy as np
import matplotlib.pyplot as plt
# Function: y = (parabola with minimum at x=0)
def f(x):
return x ** 2
# Gradient (derivative): dy/dx = 2x
def gradient(x):
return 2 * x
# Gradient Descent Algorithm
def gradient_descent(start_x, learning_rate, num_iterations):
x = start_x
history = [x]
for i in range(num_iterations):
grad = gradient(x) # Calculate gradient
x = x - learning_rate * grad # Update: move opposite to gradient
history.append(x)
print(f"Iteration {i+1}: x = {x:.4f}, f(x) = {f(x):.4f}, gradient = {grad:.4f}")
return x, history
# Start from x=10, try to reach minimum at x=0
start_x = 10
learning_rate = 0.1
num_iterations = 10
final_x, history = gradient_descent(start_x, learning_rate, num_iterations)
print(f"\nStarted at x = {start_x}, f(x) = {f(start_x)}")
print(f"Ended at x = {final_x:.4f}, f(x) = {f(final_x):.4f}")
print("Found minimum! (x 0)")
# Output shows x moving closer to 0 (the minimum) each iteration:
# Iteration 1: x = 8.0000, f(x) = 64.0000, gradient = 20.0000
# Iteration 2: x = 6.4000, f(x) = 40.9600, gradient = 16.0000
# Iteration 3: x = 5.1200, f(x) = 26.2144, gradient = 12.8000
# ...
# The algorithm found the minimum by following the negative gradient!

How Does It Work?

The algorithm follows a simple iterative process:

  1. 1.

    Initialize Parameters

    Start with random weights/parameters (θ)

  2. 2.

    Compute Loss

    Calculate how wrong the predictions are using a loss function J(θ)

  3. 3.

    Calculate Gradient

    Find ∇J(θ) - the direction of steepest increase (derivative of loss w.r.t. parameters)

  4. 4.

    Update Parameters

    Move in opposite direction: θ = θ - α·∇J(θ), where α is learning rate

  5. 5.

    Repeat

    Continue steps 2-4 until loss stops decreasing (convergence)

The Core Formula

θ = θ - α · ∇J(θ)

θ: Model parameters (weights)

α: Learning rate (step size)

∇J(θ): Gradient (derivative of loss)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Complete Gradient Descent: Linear Regression Example
import numpy as np
# Dataset: Hours studied vs Exam Score
X = np.array([1, 2, 3, 4, 5]) # Hours studied
y = np.array([2, 4, 5, 4, 5]) # Exam score
# Model: y_pred = mx + b (linear regression)
# Initialize parameters randomly
m = 0.0 # slope
b = 0.0 # intercept
# Hyperparameters
learning_rate = 0.01
epochs = 1000
n = len(X)
# Training using Gradient Descent
for epoch in range(epochs):
# 1. Make predictions
y_pred = m * X + b
# 2. Calculate loss (Mean Squared Error)
loss = (1/n) * np.sum((y - y_pred) ** 2)
# 3. Calculate gradients
dm = -(2/n) * np.sum(X * (y - y_pred)) # ∂Loss/∂m
db = -(2/n) * np.sum(y - y_pred) # ∂Loss/∂b
# 4. Update parameters (Gradient Descent step!)
m = m - learning_rate * dm
b = b - learning_rate * db
# Print progress
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {loss:.4f}, m = {m:.4f}, b = {b:.4f}")
print(f"\nFinal Model: y = {m:.2f}x + {b:.2f}")
print(f"Prediction for 6 hours studied: {m * 6 + b:.2f}")
# Output:
# Epoch 0: Loss = 10.0000, m = 0.4000, b = 0.2800
# Epoch 100: Loss = 0.6863, m = 0.5918, b = 1.6294
# ...
# Epoch 900: Loss = 0.4714, m = 0.6000, b = 2.2000
#
# Final Model: y = 0.60x + 2.20
# Prediction for 6 hours studied: 5.80
# Gradient Descent found the best fit line by minimizing loss!

Types of Gradient Descent

Three main variants based on how much data is used per update:

1. Batch GD

Uses entire dataset for each update

Pros

  • Stable convergence
  • Precise gradient

Cons

  • Slow for large datasets
  • Memory intensive

2. Stochastic GD

Uses 1 sample for each update

Pros

  • Fast updates
  • Can escape local minima

Cons

  • Noisy updates
  • Oscillates near minimum

3. Mini-Batch GD ⭐

Uses small batch (e.g., 32-256 samples)

Pros (Best of Both!)

  • Fast and stable
  • GPU-efficient
  • Most widely used

💡 Default Choice

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Comparing Three Types of Gradient Descent
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
def compute_gradient(X, y, theta):
m = len(y)
predictions = X.dot(theta)
errors = predictions - y
gradient = (1/m) * X.T.dot(errors)
return gradient
# 1. BATCH GRADIENT DESCENT
def batch_gd(X, y, learning_rate=0.01, epochs=100):
theta = np.zeros(X.shape[1])
for epoch in range(epochs):
gradient = compute_gradient(X, y, theta) # Use ALL data
theta = theta - learning_rate * gradient
return theta
# 2. STOCHASTIC GRADIENT DESCENT
def stochastic_gd(X, y, learning_rate=0.01, epochs=100):
theta = np.zeros(X.shape[1])
for epoch in range(epochs):
for i in range(len(X)):
xi = X[i:i+1] # Use ONE sample
yi = y[i:i+1]
gradient = compute_gradient(xi, yi, theta)
theta = theta - learning_rate * gradient
return theta
# 3. MINI-BATCH GRADIENT DESCENT
def mini_batch_gd(X, y, batch_size=2, learning_rate=0.01, epochs=100):
theta = np.zeros(X.shape[1])
for epoch in range(epochs):
indices = np.random.permutation(len(X))
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, len(X), batch_size):
Xi = X_shuffled[i:i+batch_size] # Use BATCH of samples
yi = y_shuffled[i:i+batch_size]
gradient = compute_gradient(Xi, yi, theta)
theta = theta - learning_rate * gradient
return theta
# Compare
print("Batch GD:", batch_gd(X, y))
print("Stochastic GD:", stochastic_gd(X, y))
print("Mini-Batch GD:", mini_batch_gd(X, y))
# All converge to similar solutions, but with different speeds and stability!

Advanced Variants

Improved versions that address limitations of basic gradient descent:

Momentum

v = βv + ∇J(θ); θ = θ - αv

Accumulates past gradients to accelerate in consistent directions and dampen oscillations

Faster convergence, smoother updates

RMSprop

E[g²] = βE[g²] + (1-β)g²; θ = θ - α·g/√E[g²]

Adapts learning rate per parameter based on recent gradient magnitudes

Handles sparse data, different learning rates per parameter

Adam (Adaptive Moment) ⭐

Combines Momentum + RMSprop

Computes adaptive learning rates for each parameter using both first and second moments of gradients

Most popular optimizer! Works well in practice

AdaGrad

G = G + g²; θ = θ - α·g/√G

Adapts learning rate based on all past gradients (accumulates sum of squared gradients)

Good for sparse features (NLP, recommender systems)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Implementing Adam Optimizer (Most Popular!)
import numpy as np
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1 # Momentum decay
self.beta2 = beta2 # RMSprop decay
self.epsilon = epsilon
self.m = None # First moment (momentum)
self.v = None # Second moment (RMSprop)
self.t = 0 # Time step
def update(self, params, gradients):
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t += 1
# Update biased first moment estimate (momentum)
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
# Update biased second moment estimate (RMSprop)
self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
# Compute bias-corrected moments
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
# Update parameters
params = params - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
# Usage Example
theta = np.array([1.0, 2.0])
optimizer = AdamOptimizer(learning_rate=0.01)
for iteration in range(100):
gradients = compute_some_gradients(theta) # Your gradient function
theta = optimizer.update(theta, gradients)
print("Optimized parameters:", theta)
# Adam combines the best of Momentum and RMSprop!
# - Momentum: Accelerates in consistent directions
# - RMSprop: Adapts learning rate per parameter
# Result: Fast, stable, widely used (PyTorch, TensorFlow default)

Key Concepts

Loss Function

Measures how wrong the model's predictions are. Gradient descent minimizes this function (e.g., MSE for regression, cross-entropy for classification).

Gradient

The direction and rate of steepest increase. Negative gradient points toward steepest decrease (where we want to go).

Learning Rate (α)

Step size for each iteration. Too large: overshoots minimum. Too small: slow convergence. Critical hyperparameter!

Epoch

One complete pass through the entire training dataset. Training typically requires multiple epochs.

Convergence

When the algorithm reaches a minimum (or stops improving). Loss stops decreasing significantly.

Local vs Global Minimum

Local: lowest point in a region. Global: absolute lowest point. GD can get stuck in local minima (especially in non-convex functions).

Impact of Learning Rate

Too High (α = 1.0)

Takes huge steps, overshoots minimum, bounces around or diverges

10 → 100 → -50 → 500 → ...

Just Right (α = 0.01) ✓

Takes reasonable steps, converges smoothly to minimum

10 → 8 → 6.4 → 5.1 → 4.1 → ...

Too Low (α = 0.0001)

Takes tiny steps, very slow convergence, may take forever

10 → 9.998 → 9.996 → 9.994 → ...

Interview Tips

  • 💡Explain GD as 'finding the best model parameters by taking steps in the direction that reduces error most'
  • 💡Core formula: θ = θ - α·∇J(θ) where θ=parameters, α=learning rate, ∇J=gradient of loss function
  • 💡Know the three types: Batch (uses all data), Stochastic (uses 1 sample), Mini-Batch (uses small batch, most common)
  • 💡Learning rate is critical: too high → oscillation/divergence, too low → slow convergence
  • 💡Understand improvements: Momentum (accelerates), Adam (adaptive learning rates), RMSprop
  • 💡Challenges: local minima, saddle points, choosing learning rate, computational cost
  • 💡Why it works: convex functions guarantee global minimum; non-convex functions (neural networks) work empirically well
  • 💡Be ready to explain the mathematical intuition: derivative tells us which direction increases loss, so we go opposite direction