Recurrent Neural Networks (RNN)

Neural networks with memory for sequential data

Imagine you're watching a movie. To understand what's happening now, you need to remember what happened before! You can't understand 'He was shocked' without remembering the earlier scene. That's what RNNs do - they have memory! Unlike regular neural networks that treat each input independently, RNNs maintain a 'hidden state' that carries information from previous steps. When processing a sentence word by word, the RNN remembers the context of earlier words. This makes them perfect for sequences: text, time series, speech, music. The 'recurrent' part means the network processes one element at a time, updating its memory each step, creating a feedback loop. It's like having a short-term memory!

What are Recurrent Neural Networks?

Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential data. Unlike feedforward networks, RNNs have connections that form directed cycles, creating an internal 'memory' or 'hidden state'. This allows information to persist and be passed from one time step to the next. At each step, the RNN takes an input and the previous hidden state, produces an output and a new hidden state. This architecture makes RNNs naturally suited for tasks involving sequences like language modeling, speech recognition, time series prediction, and machine translation.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Simple RNN: Understanding Sequential Processing
import numpy as np
# Simple RNN from scratch
class SimpleRNN:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights
self.W_xh = np.random.randn(input_size, hidden_size) * 0.01 # Input to hidden
self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01 # Hidden to hidden
self.W_hy = np.random.randn(hidden_size, output_size) * 0.01 # Hidden to output
self.b_h = np.zeros((1, hidden_size))
self.b_y = np.zeros((1, output_size))
def forward(self, inputs):
"""
Process sequence step by step
inputs: list of input vectors (sequence)
"""
h = np.zeros((1, self.W_hh.shape[0])) # Initial hidden state
outputs = []
print(f"Processing sequence of length {len(inputs)}:")
print(f"Initial hidden state: {h[0, :3].round(2)}...\n")
# Process each time step
for t, x in enumerate(inputs):
# RNN core equation: h_t = tanh(W_xh·x_t + W_hh·h_{t-1} + b_h)
h = np.tanh(x @ self.W_xh + h @ self.W_hh + self.b_h)
# Output: y_t = W_hy·h_t + b_y
y = h @ self.W_hy + self.b_y
print(f"Step {t}: Input={x[0]:.2f}, Hidden state={h[0, :3].round(2)}..., Output={y[0, 0]:.4f}")
outputs.append(y)
return outputs, h
# Example: Process a simple sequence
print("=== Simple RNN Demo ===\n")
rnn = SimpleRNN(input_size=1, hidden_size=5, output_size=1)
# Input sequence: [1, 2, 3]
sequence = [np.array([[1.0]]), np.array([[2.0]]), np.array([[3.0]])]
outputs, final_hidden = rnn.forward(sequence)
print("\nKey Observation:")
print("- Hidden state changes at each step (memory!)")
print("- Each output depends on current input AND previous hidden state")
print("- Same weights (W_xh, W_hh, W_hy) used at all steps")
# Compare: What if inputs were independent? (feedforward)
print("\n=== Why RNN vs Feedforward? ===")
print("Feedforward: Each input processed independently, no memory")
print("RNN: Hidden state connects inputs, enabling context understanding")
print("\nExample: Understanding 'I like cats. They are cute.'")
print("- Feedforward: Can't connect 'They' to 'cats'")
print("- RNN: Hidden state remembers 'cats', understands 'They' refers to cats")

How Do RNNs Work?

RNNs process sequences step by step, maintaining memory through hidden states:

  1. 1.

    Initialize Hidden State

    Start with h_0 = 0 (zero vector). This is the RNN's initial 'memory'.

  2. 2.

    Process Each Time Step

    For each input x_t: Combine with previous hidden state h_{t-1} to compute new hidden state h_t

  3. 3.

    Update Hidden State

    h_t = tanh(W_xh·x_t + W_hh·h_{t-1} + b_h). This is the core RNN equation!

  4. 4.

    Generate Output

    y_t = W_hy·h_t + b_y. Output at time t based on current hidden state

  5. 5.

    Repeat for All Steps

    Continue for entire sequence. Each step updates memory (hidden state)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# RNN for Sequence Processing: Character-Level Language Model
import numpy as np
def softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / np.sum(exp_x)
# Example: Learn to predict next character
text = "hello"
chars = list(set(text))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}
vocab_size = len(chars)
hidden_size = 10
# Initialize RNN weights
W_xh = np.random.randn(vocab_size, hidden_size) * 0.01
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
W_hy = np.random.randn(hidden_size, vocab_size) * 0.01
b_h = np.zeros((hidden_size, 1))
b_y = np.zeros((vocab_size, 1))
def forward_pass(inputs, targets):
"""
Forward pass through RNN
inputs: list of character indices
targets: list of target character indices (next char)
"""
h = np.zeros((hidden_size, 1)) # Initial hidden state
loss = 0
print(f"\nProcessing: '{text}'")
print(f"Goal: Predict next character at each step\n")
for t in range(len(inputs)):
# One-hot encode input
x = np.zeros((vocab_size, 1))
x[inputs[t]] = 1
# RNN forward step
h = np.tanh(W_xh.T @ x + W_hh @ h + b_h)
# Output layer
y = W_hy.T @ h + b_y
probs = softmax(y)
# Loss (cross-entropy)
loss += -np.log(probs[targets[t], 0])
# Display
input_char = idx_to_char[inputs[t]]
target_char = idx_to_char[targets[t]]
predicted_idx = np.argmax(probs)
predicted_char = idx_to_char[predicted_idx]
print(f"Step {t}: Input='{input_char}' -> Target='{target_char}' | Predicted='{predicted_char}' (prob={probs[targets[t], 0]:.3f})")
return loss / len(inputs)
# Prepare data: predict next character
inputs = [char_to_idx[ch] for ch in text[:-1]] # "hell"
targets = [char_to_idx[ch] for ch in text[1:]] # "ello"
loss = forward_pass(inputs, targets)
print(f"\nAverage Loss: {loss:.4f}")
print("\n=== Key Insights ===")
print("1. Hidden state h carries information from previous characters")
print("2. At each step, RNN predicts next character based on:")
print(" - Current input (current character)")
print(" - Hidden state (memory of previous characters)")
print("3. After training, RNN learns patterns: 'h' -> 'e', 'l' -> 'l', etc.")
print("4. Same weights used for all steps (parameter sharing)")
# Demonstrating memory
print("\n=== RNN Memory Demo ===")
print("Input sequence: h -> e -> l -> l")
print("\nHidden state evolution (first 3 values):")
h = np.zeros((hidden_size, 1))
for t, char in enumerate("hell"):
x = np.zeros((vocab_size, 1))
x[char_to_idx[char]] = 1
h = np.tanh(W_xh.T @ x + W_hh @ h + b_h)
print(f"After '{char}': h = [{h[0,0]:.3f}, {h[1,0]:.3f}, {h[2,0]:.3f}, ...]")
print("\nHidden state changes at each step, carrying sequence information!")

RNN Architecture

The core structure of an RNN cell:

Core RNN Equations

Hidden State Update:

h_t = tanh(W_xh · x_t + W_hh · h_(t-1) + b_h)

Output Generation:

y_t = W_hy · h_t + b_y

x_t: Input at time t

h_t: Hidden state at time t (memory)

y_t: Output at time t

W_xh, W_hh, W_hy: Weight matrices (shared across time)

One-to-Many

Image Captioning

Single input → sequence output

Many-to-One

Sentiment Analysis

Sequence input → single output

Many-to-Many

Machine Translation

Sequence input → sequence output

The Vanishing Gradient Problem

A critical challenge in training RNNs:

The Problem

During backpropagation through time (BPTT), gradients are multiplied repeatedly by the same weight matrix W_hh and tanh derivatives (<1). This causes exponential decay - gradients vanish to near-zero, making it impossible to learn long-term dependencies.

gradient_t = gradient_(t+1) × W_hh × tanh'(h_t)

If |W_hh| < 1 and tanh' < 1, then gradient → 0 exponentially!

Consequences:

  • Can't learn long-term dependencies (e.g., remember info from 20+ steps ago)
  • Early layers don't get useful gradient signal
  • Model focuses only on recent inputs, ignores distant past

Solutions

1. LSTM / GRU

Gates control information flow, preventing vanishing gradient

2. Gradient Clipping

Clip gradients to maximum norm to prevent exploding

3. Better Initialization

Initialize W_hh as identity matrix or orthogonal matrix

4. ReLU Activation

Use ReLU instead of tanh (gradient = 1 for positive values)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Demonstrating Vanishing Gradient Problem
import numpy as np
# Simple experiment: How gradients decay through time
def simulate_gradient_flow(sequence_length, W_value, activation_derivative):
"""
Simulate gradient backpropagation through time
"""
gradient = 1.0 # Initial gradient at output
gradients = [gradient]
print(f"Initial gradient: {gradient:.4f}")
print(f"W_hh value: {W_value:.2f}, Activation derivative: {activation_derivative:.2f}\n")
for t in range(sequence_length):
# Gradient flows backward: multiply by W and activation derivative
gradient = gradient * W_value * activation_derivative
gradients.append(gradient)
print(f"Step {t+1}: gradient = {gradient:.6f}")
return gradients
print("=== Vanishing Gradient Demonstration ===\n")
# Case 1: Typical RNN (W_hh < 1, tanh' < 1)
print("Case 1: Vanilla RNN (W_hh=0.9, tanh'=0.7)")
print("-" * 50)
gradients_bad = simulate_gradient_flow(
sequence_length=20,
W_value=0.9,
activation_derivative=0.7
)
print(f"\nGradient after 20 steps: {gradients_bad[-1]:.10f}")
print("→ Gradient practically vanished! Can't learn from distant past.\n")
# Case 2: Better (but still problematic)
print("\nCase 2: Better initialization (W_hh=1.0, ReLU derivative=1.0)")
print("-" * 50)
gradients_better = simulate_gradient_flow(
sequence_length=20,
W_value=1.0,
activation_derivative=1.0
)
print(f"\nGradient after 20 steps: {gradients_better[-1]:.4f}")
print("→ Gradient preserved! But still challenging in practice.\n")
print("=== Why This Matters ===")
print("Example: 'The cat, which was sleeping on the cozy mat, is cute.'")
print("\nTo predict 'is' (singular), model needs to remember 'cat' (singular).")
print("But 'cat' is ~10 words back!")
print("\nWith vanishing gradient:")
print("- RNN can't learn this long-term dependency")
print("- Gradient from 'is' doesn't reach back to 'cat'")
print("- Model will make errors on long-range grammar\n")
print("Solution: LSTM/GRU with gates that prevent vanishing gradient!")

Key Concepts

Hidden State

The RNN's memory. Carries information from previous time steps. Updated at each step: h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b).

Sequential Processing

Process input one element at a time (word by word, frame by frame). Each step uses previous hidden state + current input.

Parameter Sharing

Same weights used at every time step. Allows processing sequences of any length. Significantly reduces parameters compared to separate networks.

Vanishing Gradient

Gradients shrink exponentially during backpropagation through time (BPTT). Makes learning long-term dependencies difficult. Why LSTM/GRU were invented.

Interview Tips

  • 💡Explain RNN as: 'neural network with memory that processes sequences step-by-step, maintaining hidden state'
  • 💡Core formula: h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b), y_t = W_hy·h_t. h_t is hidden state, carries information
  • 💡Key advantage: Can handle variable-length sequences. Uses same weights at all time steps (parameter sharing)
  • 💡Use cases: Language modeling, machine translation, speech recognition, time series, text generation
  • 💡Training: Backpropagation Through Time (BPTT) - unfold RNN through time and backprop like feedforward network
  • 💡Vanishing gradient problem: gradients → 0 during BPTT through many steps. Can't learn long-term dependencies
  • 💡Why vanishing gradient: repeated multiplication by W and tanh derivative (<1) causes exponential decay
  • 💡Solutions: LSTM/GRU (gates control information flow), gradient clipping, better initialization, skip connections
  • 💡RNN vs Feedforward: Feedforward treats inputs independently. RNN maintains context/memory across sequence
  • 💡Types: One-to-many (image captioning), many-to-one (sentiment analysis), many-to-many (translation, video classification)