Long Short-Term Memory (LSTM)

Understanding LSTMs for sequential data and time series

Imagine trying to remember a long story! Regular neural networks forget the beginning by the time they reach the end. LSTM is like having a notebook that decides what's important to remember, what to forget, and what to pay attention to right now. It has special 'gates' (like doors) that control memory - perfect for understanding sequences like text, speech, or stock prices!

What is LSTM?

LSTM (Long Short-Term Memory) is a special type of Recurrent Neural Network (RNN) designed to learn long-term dependencies in sequential data. Introduced by Hochreiter and Schmidhuber in 1997, LSTMs solve the vanishing gradient problem that plagued traditional RNNs, making them capable of learning from sequences spanning hundreds of time steps.

The Problem: Vanishing Gradient

Traditional RNNs suffer from the vanishing gradient problem: during backpropagation through time, gradients become exponentially small, making it impossible to learn long-range dependencies. For example, in the sentence 'The clouds are in the SKY', an RNN might forget 'clouds' by the time it reaches 'sky', losing important context.

The Solution: LSTM Architecture

LSTMs introduce a 'cell state' (long-term memory) and gating mechanisms that allow information to flow unchanged through many time steps. The gates learn when to add information, when to forget it, and when to read from memory.

LSTM Architecture

An LSTM cell has four main components: cell state (C_t), hidden state (h_t), and three gates (forget, input, output).

Cell State (C_t)

The 'memory' of the LSTM - information highway that runs through the entire chain

Flows straight down the cell with only minor linear interactions, allowing gradients to flow unchanged. This solves the vanishing gradient problem.

Hidden State (h_t)

The 'short-term memory' - output of the LSTM at time step t

Computed from cell state using the output gate. This is what gets passed to the next layer or used for predictions.

Forget Gate (f_t)

Decides what information to discard from cell state

Looks at h_{t-1} and x_t, outputs number between 0 (completely forget) and 1 (completely keep) for each value in cell state.

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

Input Gate (i_t)

Decides what new information to add to cell state

Has two parts: sigmoid layer decides which values to update, tanh layer creates candidate values to add.

i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

Cell State Update

Combines forget and input gates to update cell state

Multiply old state by forget gate (forgetting), add input gate times candidate values (adding new).

C_t = f_t * C_{t-1} + i_t * C̃_t

Output Gate (o_t)

Decides what to output based on cell state

Filters cell state to determine what parts to output. Output is based on cell state but filtered.

o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)

Code Examples

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# LSTM Implementation for Sequence Prediction
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Example: Text generation with LSTM
# Dataset: Shakespeare text
# 1. SIMPLE LSTM FOR BINARY CLASSIFICATION
print("=== Example 1: Sentiment Analysis ===")
# Suppose we have movie reviews
# Input: sequence of word indices
# Output: positive (1) or negative (0)
model_sentiment = Sequential([
# Embedding layer: converts word indices to dense vectors
Embedding(input_dim=10000, output_dim=128, input_length=100),
# LSTM layer with 64 units
# return_sequences=False: only return output at last time step
LSTM(64, dropout=0.2, recurrent_dropout=0.2),
# Dense output layer
Dense(1, activation='sigmoid')
])
model_sentiment.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
print(model_sentiment.summary())
# 2. STACKED LSTM FOR SEQUENCE-TO-SEQUENCE
print("\n=== Example 2: Text Generation ===")
# Multi-layer LSTM for more complex patterns
model_generation = Sequential([
# First LSTM layer - return_sequences=True to feed into next LSTM
LSTM(128, return_sequences=True, input_shape=(None, 100)),
# Second LSTM layer - can stack multiple layers
LSTM(128, return_sequences=True),
# Final LSTM layer
LSTM(64),
# Output layer for vocabulary
Dense(10000, activation='softmax') # 10000 vocabulary size
])
model_generation.compile(
loss='categorical_crossentropy',
optimizer='adam'
)
# 3. BIDIRECTIONAL LSTM
print("\n=== Example 3: Named Entity Recognition ===")
from tensorflow.keras.layers import Bidirectional
# Bidirectional LSTM processes sequence forward and backward
# Good for tasks where future context matters (like NER)
model_ner = Sequential([
Embedding(input_dim=5000, output_dim=128, input_length=50),
# Bidirectional LSTM doubles the number of units (128*2=256)
Bidirectional(LSTM(128, return_sequences=True)),
# Another bidirectional layer
Bidirectional(LSTM(64, return_sequences=True)),
# Time-distributed dense layer (one prediction per time step)
tf.keras.layers.TimeDistributed(Dense(9, activation='softmax')) # 9 entity types
])
model_ner.compile(
loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
# 4. LSTM FOR TIME SERIES FORECASTING
print("\n=== Example 4: Stock Price Prediction ===")
# Predicting next value in sequence
model_timeseries = Sequential([
# LSTM for time series (3 features: open, high, low)
LSTM(50, activation='relu', return_sequences=True, input_shape=(60, 3)),
LSTM(50, activation='relu'),
Dense(1) # Predict next close price
])
model_timeseries.compile(
optimizer='adam',
loss='mse'
)
# Example training data shape: (samples, timesteps, features)
# X_train shape: (1000, 60, 3) - 1000 samples, 60 time steps, 3 features
# y_train shape: (1000, 1) - next value to predict
print("\nKey LSTM Parameters:")
print("- units: Number of LSTM cells (dimensionality of output)")
print("- return_sequences: True for stacked LSTMs, False for final layer")
print("- dropout: Dropout rate for inputs (prevents overfitting)")
print("- recurrent_dropout: Dropout for recurrent connections")
print("- stateful: Whether to reuse state from previous batch")

Real-World Applications

Natural Language Processing

  • Machine Translation: Google Translate originally used LSTMs (now uses Transformers)
  • Text Generation: Generating Shakespeare-style text, code completion
  • Sentiment Analysis: Analyzing customer reviews, social media sentiment
  • Named Entity Recognition: Extracting names, locations from text
  • Question Answering: Understanding context in long passages

Speech Processing

  • Speech Recognition: Converting speech to text (Siri, Google Assistant)
  • Speech Synthesis: Text-to-speech systems
  • Speaker Identification: Identifying who is speaking
  • Emotion Recognition: Detecting emotions from voice

Time Series Forecasting

  • Stock Price Prediction: Analyzing historical prices to forecast trends
  • Weather Forecasting: Predicting temperature, rainfall based on historical data
  • Energy Demand Forecasting: Predicting electricity consumption
  • Traffic Prediction: Estimating traffic patterns

Video Analysis

  • Action Recognition: Identifying actions in video sequences
  • Video Captioning: Generating descriptions of video content
  • Anomaly Detection: Detecting unusual patterns in surveillance video

LSTM vs RNN vs GRU vs Transformers

Standard RNN

Pros:

  • Simple architecture
  • Fast to train
  • Good for short sequences

Cons:

  • Vanishing gradient problem
  • Cannot learn long-term dependencies
  • Poor for sequences >10 steps

Use When: Short sequences with simple patterns (rarely used in practice)

LSTM

Pros:

  • Solves vanishing gradient
  • Learns long-term dependencies
  • Proven track record
  • Works well for sequences up to ~100 steps

Cons:

  • More parameters than RNN
  • Slower to train than GRU
  • Still struggles with very long sequences (>1000 steps)

Use When: Sequential tasks requiring long-term memory: translation, speech recognition, time series

GRU (Gated Recurrent Unit)

Pros:

  • Simpler than LSTM (fewer parameters)
  • Faster to train
  • Often similar performance to LSTM

Cons:

  • Slightly less expressive than LSTM
  • May underperform on very complex tasks

Use When: When you want LSTM-like performance with faster training (good starting point)

Transformers (Attention)

Pros:

  • Handles very long sequences
  • Parallel training (much faster)
  • State-of-the-art for NLP
  • No vanishing gradient

Cons:

  • Requires more data
  • More memory intensive
  • Quadratic complexity in sequence length

Use When: Long sequences (>100 steps), especially NLP tasks (GPT, BERT use this)

Key Concepts

Vanishing Gradient Solution

Cell state allows gradients to flow unchanged through time, enabling learning of long-range dependencies. The additive cell state update (C_t = f_t * C_{t-1} + i_t * C̃_t) preserves gradient flow.

Gates as Learned Functions

Gates are not hand-coded rules - they're learned from data. The network learns when to forget, when to store, and when to output based on the task.

Bidirectional LSTMs

Process sequence in both directions (forward and backward), useful when future context matters. Doubles parameters but often improves performance on tasks like NER.

Many-to-Many, Many-to-One

LSTMs support different architectures: many-to-many (translation), many-to-one (sentiment), one-to-many (image captioning), seq2seq (conversation).

Training Considerations

  • Sequence Length: Longer sequences = more memory. Truncate or use attention for very long sequences.
  • Batch Size: Smaller batches due to memory constraints. Use gradient accumulation if needed.
  • Dropout: Apply dropout (0.2-0.5) to prevent overfitting. Use both dropout and recurrent_dropout.
  • Gradient Clipping: Clip gradients to prevent exploding gradients (common in RNNs). tf.clip_by_value or clip_by_norm.
  • Learning Rate: Start with smaller learning rates (1e-3 to 1e-4) than CNNs.
  • Stateful LSTMs: Use stateful=True when processing very long sequences in chunks.
  • Bidirectional: Doubles training time but often worth it for better performance.
  • Layer Normalization: Can improve training stability and speed convergence.

Interview Tips

  • 💡Explain the vanishing gradient problem: gradients shrink exponentially in RNNs, making long-term learning impossible
  • 💡Know all three gates: Forget (discard), Input (add), Output (read) - draw the diagram if asked
  • 💡Understand cell state vs hidden state: C_t is long-term memory, h_t is short-term/output
  • 💡Explain formulas: All gates use sigmoid (0-1 output), candidate state uses tanh (-1 to 1)
  • 💡Know the cell state update: C_t = f_t * C_{t-1} + i_t * C̃_t (forget old + add new)
  • 💡Give real examples: Google Translate (originally), speech recognition, time series forecasting
  • 💡Compare with alternatives: GRU (simpler, faster), Transformers (better for NLP, parallel training)
  • 💡Discuss limitations: Still struggles with very long sequences (>1000), Transformers now dominant in NLP
  • 💡Training tips: Gradient clipping, dropout, smaller learning rates, shorter sequences
  • 💡Bidirectional LSTMs: Process both directions, doubles parameters, useful when future context matters
  • 💡Know when NOT to use: Very long sequences (use Transformers), simple patterns (use simpler models)
  • 💡Understand applications: Seq2seq (translation), many-to-one (sentiment), many-to-many (video analysis)