Recurrent Neural Networks (RNN)
Neural networks with memory for sequential data
Imagine you're watching a movie. To understand what's happening now, you need to remember what happened before! You can't understand 'He was shocked' without remembering the earlier scene. That's what RNNs do - they have memory! Unlike regular neural networks that treat each input independently, RNNs maintain a 'hidden state' that carries information from previous steps. When processing a sentence word by word, the RNN remembers the context of earlier words. This makes them perfect for sequences: text, time series, speech, music. The 'recurrent' part means the network processes one element at a time, updating its memory each step, creating a feedback loop. It's like having a short-term memory!
What are Recurrent Neural Networks?
Recurrent Neural Networks (RNNs) are a class of neural networks designed for sequential data. Unlike feedforward networks, RNNs have connections that form directed cycles, creating an internal 'memory' or 'hidden state'. This allows information to persist and be passed from one time step to the next. At each step, the RNN takes an input and the previous hidden state, produces an output and a new hidden state. This architecture makes RNNs naturally suited for tasks involving sequences like language modeling, speech recognition, time series prediction, and machine translation.
# Simple RNN: Understanding Sequential Processingimport numpy as np# Simple RNN from scratchclass SimpleRNN: def __init__(self, input_size, hidden_size, output_size): # Initialize weights self.W_xh = np.random.randn(input_size, hidden_size) * 0.01 # Input to hidden self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01 # Hidden to hidden self.W_hy = np.random.randn(hidden_size, output_size) * 0.01 # Hidden to output self.b_h = np.zeros((1, hidden_size)) self.b_y = np.zeros((1, output_size)) def forward(self, inputs): """ Process sequence step by step inputs: list of input vectors (sequence) """ h = np.zeros((1, self.W_hh.shape[0])) # Initial hidden state outputs = [] print(f"Processing sequence of length {len(inputs)}:") print(f"Initial hidden state: {h[0, :3].round(2)}...\n") # Process each time step for t, x in enumerate(inputs): # RNN core equation: h_t = tanh(W_xh·x_t + W_hh·h_{t-1} + b_h) h = np.tanh(x @ self.W_xh + h @ self.W_hh + self.b_h) # Output: y_t = W_hy·h_t + b_y y = h @ self.W_hy + self.b_y print(f"Step {t}: Input={x[0]:.2f}, Hidden state={h[0, :3].round(2)}..., Output={y[0, 0]:.4f}") outputs.append(y) return outputs, h# Example: Process a simple sequenceprint("=== Simple RNN Demo ===\n")rnn = SimpleRNN(input_size=1, hidden_size=5, output_size=1)# Input sequence: [1, 2, 3]sequence = [np.array([[1.0]]), np.array([[2.0]]), np.array([[3.0]])]outputs, final_hidden = rnn.forward(sequence)print("\nKey Observation:")print("- Hidden state changes at each step (memory!)")print("- Each output depends on current input AND previous hidden state")print("- Same weights (W_xh, W_hh, W_hy) used at all steps")# Compare: What if inputs were independent? (feedforward)print("\n=== Why RNN vs Feedforward? ===")print("Feedforward: Each input processed independently, no memory")print("RNN: Hidden state connects inputs, enabling context understanding")print("\nExample: Understanding 'I like cats. They are cute.'")print("- Feedforward: Can't connect 'They' to 'cats'")print("- RNN: Hidden state remembers 'cats', understands 'They' refers to cats")How Do RNNs Work?
RNNs process sequences step by step, maintaining memory through hidden states:
- 1.
Initialize Hidden State
Start with h_0 = 0 (zero vector). This is the RNN's initial 'memory'.
- 2.
Process Each Time Step
For each input x_t: Combine with previous hidden state h_{t-1} to compute new hidden state h_t
- 3.
Update Hidden State
h_t = tanh(W_xh·x_t + W_hh·h_{t-1} + b_h). This is the core RNN equation!
- 4.
Generate Output
y_t = W_hy·h_t + b_y. Output at time t based on current hidden state
- 5.
Repeat for All Steps
Continue for entire sequence. Each step updates memory (hidden state)
# RNN for Sequence Processing: Character-Level Language Modelimport numpy as npdef softmax(x): exp_x = np.exp(x - np.max(x)) return exp_x / np.sum(exp_x)# Example: Learn to predict next charactertext = "hello"chars = list(set(text))char_to_idx = {ch: i for i, ch in enumerate(chars)}idx_to_char = {i: ch for ch, i in char_to_idx.items()}vocab_size = len(chars)hidden_size = 10# Initialize RNN weightsW_xh = np.random.randn(vocab_size, hidden_size) * 0.01W_hh = np.random.randn(hidden_size, hidden_size) * 0.01W_hy = np.random.randn(hidden_size, vocab_size) * 0.01b_h = np.zeros((hidden_size, 1))b_y = np.zeros((vocab_size, 1))def forward_pass(inputs, targets): """ Forward pass through RNN inputs: list of character indices targets: list of target character indices (next char) """ h = np.zeros((hidden_size, 1)) # Initial hidden state loss = 0 print(f"\nProcessing: '{text}'") print(f"Goal: Predict next character at each step\n") for t in range(len(inputs)): # One-hot encode input x = np.zeros((vocab_size, 1)) x[inputs[t]] = 1 # RNN forward step h = np.tanh(W_xh.T @ x + W_hh @ h + b_h) # Output layer y = W_hy.T @ h + b_y probs = softmax(y) # Loss (cross-entropy) loss += -np.log(probs[targets[t], 0]) # Display input_char = idx_to_char[inputs[t]] target_char = idx_to_char[targets[t]] predicted_idx = np.argmax(probs) predicted_char = idx_to_char[predicted_idx] print(f"Step {t}: Input='{input_char}' -> Target='{target_char}' | Predicted='{predicted_char}' (prob={probs[targets[t], 0]:.3f})") return loss / len(inputs)# Prepare data: predict next characterinputs = [char_to_idx[ch] for ch in text[:-1]] # "hell"targets = [char_to_idx[ch] for ch in text[1:]] # "ello"loss = forward_pass(inputs, targets)print(f"\nAverage Loss: {loss:.4f}")print("\n=== Key Insights ===")print("1. Hidden state h carries information from previous characters")print("2. At each step, RNN predicts next character based on:")print(" - Current input (current character)")print(" - Hidden state (memory of previous characters)")print("3. After training, RNN learns patterns: 'h' -> 'e', 'l' -> 'l', etc.")print("4. Same weights used for all steps (parameter sharing)")# Demonstrating memoryprint("\n=== RNN Memory Demo ===")print("Input sequence: h -> e -> l -> l")print("\nHidden state evolution (first 3 values):")h = np.zeros((hidden_size, 1))for t, char in enumerate("hell"): x = np.zeros((vocab_size, 1)) x[char_to_idx[char]] = 1 h = np.tanh(W_xh.T @ x + W_hh @ h + b_h) print(f"After '{char}': h = [{h[0,0]:.3f}, {h[1,0]:.3f}, {h[2,0]:.3f}, ...]")print("\nHidden state changes at each step, carrying sequence information!")RNN Architecture
The core structure of an RNN cell:
Core RNN Equations
Hidden State Update:
h_t = tanh(W_xh · x_t + W_hh · h_(t-1) + b_h)
Output Generation:
y_t = W_hy · h_t + b_y
• x_t: Input at time t
• h_t: Hidden state at time t (memory)
• y_t: Output at time t
• W_xh, W_hh, W_hy: Weight matrices (shared across time)
One-to-Many
Image Captioning
Single input → sequence output
Many-to-One
Sentiment Analysis
Sequence input → single output
Many-to-Many
Machine Translation
Sequence input → sequence output
The Vanishing Gradient Problem
A critical challenge in training RNNs:
The Problem
During backpropagation through time (BPTT), gradients are multiplied repeatedly by the same weight matrix W_hh and tanh derivatives (<1). This causes exponential decay - gradients vanish to near-zero, making it impossible to learn long-term dependencies.
gradient_t = gradient_(t+1) × W_hh × tanh'(h_t)
If |W_hh| < 1 and tanh' < 1, then gradient → 0 exponentially!
Consequences:
- • Can't learn long-term dependencies (e.g., remember info from 20+ steps ago)
- • Early layers don't get useful gradient signal
- • Model focuses only on recent inputs, ignores distant past
Solutions
1. LSTM / GRU
Gates control information flow, preventing vanishing gradient
2. Gradient Clipping
Clip gradients to maximum norm to prevent exploding
3. Better Initialization
Initialize W_hh as identity matrix or orthogonal matrix
4. ReLU Activation
Use ReLU instead of tanh (gradient = 1 for positive values)
# Demonstrating Vanishing Gradient Problemimport numpy as np# Simple experiment: How gradients decay through timedef simulate_gradient_flow(sequence_length, W_value, activation_derivative): """ Simulate gradient backpropagation through time """ gradient = 1.0 # Initial gradient at output gradients = [gradient] print(f"Initial gradient: {gradient:.4f}") print(f"W_hh value: {W_value:.2f}, Activation derivative: {activation_derivative:.2f}\n") for t in range(sequence_length): # Gradient flows backward: multiply by W and activation derivative gradient = gradient * W_value * activation_derivative gradients.append(gradient) print(f"Step {t+1}: gradient = {gradient:.6f}") return gradientsprint("=== Vanishing Gradient Demonstration ===\n")# Case 1: Typical RNN (W_hh < 1, tanh' < 1)print("Case 1: Vanilla RNN (W_hh=0.9, tanh'=0.7)")print("-" * 50)gradients_bad = simulate_gradient_flow( sequence_length=20, W_value=0.9, activation_derivative=0.7)print(f"\nGradient after 20 steps: {gradients_bad[-1]:.10f}")print("→ Gradient practically vanished! Can't learn from distant past.\n")# Case 2: Better (but still problematic)print("\nCase 2: Better initialization (W_hh=1.0, ReLU derivative=1.0)")print("-" * 50)gradients_better = simulate_gradient_flow( sequence_length=20, W_value=1.0, activation_derivative=1.0)print(f"\nGradient after 20 steps: {gradients_better[-1]:.4f}")print("→ Gradient preserved! But still challenging in practice.\n")print("=== Why This Matters ===")print("Example: 'The cat, which was sleeping on the cozy mat, is cute.'")print("\nTo predict 'is' (singular), model needs to remember 'cat' (singular).")print("But 'cat' is ~10 words back!")print("\nWith vanishing gradient:")print("- RNN can't learn this long-term dependency")print("- Gradient from 'is' doesn't reach back to 'cat'")print("- Model will make errors on long-range grammar\n")print("Solution: LSTM/GRU with gates that prevent vanishing gradient!")Key Concepts
Hidden State
The RNN's memory. Carries information from previous time steps. Updated at each step: h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b).
Sequential Processing
Process input one element at a time (word by word, frame by frame). Each step uses previous hidden state + current input.
Parameter Sharing
Same weights used at every time step. Allows processing sequences of any length. Significantly reduces parameters compared to separate networks.
Vanishing Gradient
Gradients shrink exponentially during backpropagation through time (BPTT). Makes learning long-term dependencies difficult. Why LSTM/GRU were invented.
Interview Tips
- 💡Explain RNN as: 'neural network with memory that processes sequences step-by-step, maintaining hidden state'
- 💡Core formula: h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b), y_t = W_hy·h_t. h_t is hidden state, carries information
- 💡Key advantage: Can handle variable-length sequences. Uses same weights at all time steps (parameter sharing)
- 💡Use cases: Language modeling, machine translation, speech recognition, time series, text generation
- 💡Training: Backpropagation Through Time (BPTT) - unfold RNN through time and backprop like feedforward network
- 💡Vanishing gradient problem: gradients → 0 during BPTT through many steps. Can't learn long-term dependencies
- 💡Why vanishing gradient: repeated multiplication by W and tanh derivative (<1) causes exponential decay
- 💡Solutions: LSTM/GRU (gates control information flow), gradient clipping, better initialization, skip connections
- 💡RNN vs Feedforward: Feedforward treats inputs independently. RNN maintains context/memory across sequence
- 💡Types: One-to-many (image captioning), many-to-one (sentiment analysis), many-to-many (translation, video classification)