Transformer Architecture
Understanding self-attention, encoder-decoder, BERT, and GPT
Imagine reading a book where every word needs context from other words to understand its meaning. In the sentence 'The bank by the river', 'bank' means shoreline, but in 'I went to the bank', it means a financial institution. Transformers are like super-smart readers that look at ALL words simultaneously to understand each word's meaning based on its context! Unlike older models that read word-by-word sequentially (slow!), Transformers use 'self-attention' to process all words in parallel, understanding relationships between distant words instantly. This breakthrough made models like ChatGPT, BERT, and Google Translate incredibly powerful!
What is a Transformer?
The Transformer is a neural network architecture introduced in the 'Attention is All You Need' paper (2017) that revolutionized natural language processing. Unlike RNNs that process sequences sequentially, Transformers process entire sequences in parallel using a mechanism called self-attention. This allows them to capture long-range dependencies efficiently and has made them the foundation of modern language models like BERT, GPT, and ChatGPT.
❌ Old Approach: RNNs/LSTMs
- • Process sequentially (word by word)
- • Cannot parallelize → SLOW training
- • Vanishing gradient for long sequences
- • Hard to capture long-range dependencies
- • Limited context window
✅ New Approach: Transformers
- • Process entire sequence in parallel
- • Full parallelization → FAST training
- • Self-attention captures any distance
- • Excellent long-range dependencies
- • Scalable to very long sequences
# Conceptual Comparison: RNN vs Transformer"""SENTENCE: "The cat sat on the mat"RNN/LSTM Processing (Sequential):Step 1: Process "The" → hidden state h1Step 2: Process "cat" with h1 → hidden state h2Step 3: Process "sat" with h2 → hidden state h3Step 4: Process "on" with h3 → hidden state h4Step 5: Process "the" with h4 → hidden state h5Step 6: Process "mat" with h5 → hidden state h6❌ Problems:- Must wait for each step (no parallelization)- By step 6, information from "The" is diluted- Slow for long sequences- Training time: O(n) sequential steps---TRANSFORMER Processing (Parallel):ALL AT ONCE:"The" attends to ["The", "cat", "sat", "on", "the", "mat"]"cat" attends to ["The", "cat", "sat", "on", "the", "mat"]"sat" attends to ["The", "cat", "sat", "on", "the", "mat"]"on" attends to ["The", "cat", "sat", "on", "the", "mat"]"the" attends to ["The", "cat", "sat", "on", "the", "mat"]"mat" attends to ["The", "cat", "sat", "on", "the", "mat"]✅ Advantages:- All words processed simultaneously (full parallelization)- Each word directly attends to all others (no information loss)- Fast for any sequence length- Training time: O(1) parallel (but O(n²) complexity per step)"""# Simple Transformer usage with Hugging Facefrom transformers import BertTokenizer, BertModelimport torch# Load pre-trained BERT (a Transformer encoder)tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertModel.from_pretrained('bert-base-uncased')# Example sentencetext = "The bank by the river is beautiful."inputs = tokenizer(text, return_tensors='pt')# Process with Transformer (all tokens processed in parallel!)with torch.no_grad(): outputs = model(**inputs)# Get contextualized embeddings for each wordembeddings = outputs.last_hidden_stateprint(f"Input shape: {inputs['input_ids'].shape}")print(f"Output embeddings shape: {embeddings.shape}")print(f"\nEach word has a 768-dimensional embedding based on FULL context!")# The word "bank" now has different embeddings in:# "The bank by the river" (shoreline) vs "I went to the bank" (financial)# because Transformer looks at ALL words via self-attention!Self-Attention Mechanism
The core innovation of Transformers - how models learn which words to focus on:
Self-Attention Formula
Q (Query)
"What am I looking for?" - represents current word
K (Key)
"What do I offer?" - represents all words to match against
V (Value)
"What information do I pass?" - actual information to aggregate
How Self-Attention Works (Step by Step)
- 1.
Create Q, K, V matrices
Transform input embeddings with learned weight matrices: Q=XW_Q, K=XW_K, V=XW_V
- 2.
Calculate attention scores
Compute QK^T: how similar is each word to every other word? Higher = more relevant
- 3.
Scale scores
Divide by √d_k (dimension of keys) to prevent very large values that make gradients unstable
- 4.
Apply softmax
Convert scores to probabilities (sum to 1). These are attention weights!
- 5.
Weighted sum of values
Multiply attention weights by V: aggregate information from all words, weighted by relevance
# Implementing Self-Attention from Scratchimport torchimport torch.nn.functional as Fimport mathdef scaled_dot_product_attention(Q, K, V): """ Self-Attention implementation Args: Q: Query matrix [batch, seq_len, d_k] K: Key matrix [batch, seq_len, d_k] V: Value matrix [batch, seq_len, d_v] Returns: output: Context vectors [batch, seq_len, d_v] attention_weights: Attention scores [batch, seq_len, seq_len] """ # Step 1: Calculate attention scores (Q · K^T) # How much should each word attend to every other word? d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) # [batch, seq_len, seq_len] print(f"Attention scores shape: {scores.shape}") print(f"Scores[0,0,:] (how word 0 attends to all words): {scores[0,0,:].tolist()}\n") # Step 2: Scale by √d_k (prevents large values) scores = scores / math.sqrt(d_k) # Step 3: Apply softmax (convert to probabilities) attention_weights = F.softmax(scores, dim=-1) print(f"Attention weights (after softmax):") print(f"Weights[0,0,:] (probabilities sum to 1): {attention_weights[0,0,:].tolist()}") print(f"Sum: {attention_weights[0,0,:].sum().item()}\n") # Step 4: Weighted sum of Values output = torch.matmul(attention_weights, V) return output, attention_weights# Example usagebatch_size = 1seq_len = 5 # "The cat sat on mat"d_model = 64 # embedding dimension# Simulate input embeddingsX = torch.randn(batch_size, seq_len, d_model)# Create Q, K, V through linear transformations (learned during training)W_Q = torch.randn(d_model, d_model)W_K = torch.randn(d_model, d_model)W_V = torch.randn(d_model, d_model)Q = torch.matmul(X, W_Q) # Query: what am I looking for?K = torch.matmul(X, W_K) # Key: what do I offer?V = torch.matmul(X, W_V) # Value: what info do I pass?print("INPUT:")print(f"Sequence length: {seq_len} words")print(f"Embedding dimension: {d_model}\n")# Apply self-attentionoutput, attention_weights = scaled_dot_product_attention(Q, K, V)print("OUTPUT:")print(f"Contextualized embeddings shape: {output.shape}")print("\n✅ Each word now has context from ALL other words!")# Visualize attention patternprint("\nATTENTION PATTERN (who attends to whom):")print(" Word0 Word1 Word2 Word3 Word4")for i in range(seq_len): print(f"Word{i}: ", end="") for j in range(seq_len): weight = attention_weights[0, i, j].item() print(f"{weight:.2f} ", end="") print()# Output shows attention weights:# High weight = strong attention (important relationship)# Low weight = weak attention (less relevant)# Each row sums to 1.0Transformer Architecture
Two main components working together:
Encoder (Understanding)
Processes input text to create rich contextual representations. Used for understanding tasks.
Components:
- • Multi-Head Self-Attention
- • Feed-Forward Network
- • Layer Normalization
- • Residual Connections
Used in:
BERT, RoBERTa (classification, QA)
Decoder (Generation)
Generates output text one token at a time. Uses masked self-attention and cross-attention to encoder.
Components:
- • Masked Self-Attention (causal)
- • Cross-Attention to Encoder
- • Feed-Forward Network
- • Layer Norm + Residuals
Used in:
GPT, ChatGPT (text generation)
Full Transformer Stack
1. Input Processing
Tokens → Embeddings + Positional Encodings
2. Encoder Layers (N=6 typical)
Each layer: Multi-Head Attention → Add & Norm → Feed-Forward → Add & Norm
3. Decoder Layers (N=6 typical)
Each layer: Masked Attention → Cross-Attention to Encoder → Feed-Forward (all with Add & Norm)
4. Output Layer
Linear → Softmax → Probability distribution over vocabulary
Famous Transformer Models
Different architectures for different tasks:
BERT
Encoder-Only
Bidirectional Encoder Representations from Transformers. Reads text in both directions.
Training:
Masked Language Model (predict masked words)
Best for:
Classification, QA, NER, sentiment analysis
GPT
Decoder-Only
Generative Pre-trained Transformer. Autoregressive model for text generation.
Training:
Next-token prediction (predict next word)
Best for:
Text generation, completion, chatbots (ChatGPT)
T5
Encoder-Decoder
Text-to-Text Transfer Transformer. Treats all tasks as text-to-text.
Training:
Span corruption (predict corrupted spans)
Best for:
Translation, summarization, any seq2seq task
Key Concepts
Self-Attention
Mechanism that weighs importance of all words when processing each word. Each word 'attends to' other words to understand context. Computes Query, Key, Value matrices to find relationships.
Multi-Head Attention
Multiple attention mechanisms running in parallel, each learning different relationships. Like having multiple perspectives on the same sentence. Increases model's capacity to capture diverse patterns.
Positional Encoding
Since Transformers process in parallel (no inherent order), positional encodings are added to give the model information about word positions in the sequence.
Encoder-Decoder Structure
Encoder processes input (understanding), Decoder generates output (creation). Used in translation. Decoder-only (GPT) for generation, Encoder-only (BERT) for understanding.
Interview Tips
- 💡Transformers replaced RNNs/LSTMs by processing sequences in parallel using self-attention, enabling much faster training and better long-range dependencies
- 💡Self-attention: each word attends to all other words. Computes attention scores using Query×Key (how relevant words are), then weighted sum of Values
- 💡Key formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V. Scaled dot-product prevents large values, softmax gives probabilities
- 💡Multi-head attention runs multiple attention mechanisms in parallel (typically 8-16 heads), each learning different relationships/patterns
- 💡Encoder-Decoder: Encoder understands input, Decoder generates output with cross-attention to encoder. Used in translation (seq2seq tasks)
- 💡Decoder-only (GPT): autoregressive generation, predicts next token. Encoder-only (BERT): bidirectional understanding, good for classification
- 💡Positional encodings necessary because self-attention is permutation-invariant. Uses sin/cos functions or learned embeddings
- 💡Advantages: parallelization (faster), long-range dependencies, transfer learning. Disadvantages: quadratic complexity O(n²), large memory
- 💡BERT = Bidirectional Encoder (understanding). GPT = Autoregressive Decoder (generation). T5 = Full Encoder-Decoder (seq2seq)