Transformer Architecture

Understanding self-attention, encoder-decoder, BERT, and GPT

Imagine reading a book where every word needs context from other words to understand its meaning. In the sentence 'The bank by the river', 'bank' means shoreline, but in 'I went to the bank', it means a financial institution. Transformers are like super-smart readers that look at ALL words simultaneously to understand each word's meaning based on its context! Unlike older models that read word-by-word sequentially (slow!), Transformers use 'self-attention' to process all words in parallel, understanding relationships between distant words instantly. This breakthrough made models like ChatGPT, BERT, and Google Translate incredibly powerful!

What is a Transformer?

The Transformer is a neural network architecture introduced in the 'Attention is All You Need' paper (2017) that revolutionized natural language processing. Unlike RNNs that process sequences sequentially, Transformers process entire sequences in parallel using a mechanism called self-attention. This allows them to capture long-range dependencies efficiently and has made them the foundation of modern language models like BERT, GPT, and ChatGPT.

❌ Old Approach: RNNs/LSTMs

• Process sequentially (word by word)
• Cannot parallelize → SLOW training
• Vanishing gradient for long sequences
• Hard to capture long-range dependencies
• Limited context window

✅ New Approach: Transformers

• Process entire sequence in parallel
• Full parallelization → FAST training
• Self-attention captures any distance
• Excellent long-range dependencies
• Scalable to very long sequences

python

# Conceptual Comparison: RNN vs Transformer
"""
SENTENCE: "The cat sat on the mat"
RNN/LSTM Processing (Sequential):
Step 1: Process "The" → hidden state h1
Step 2: Process "cat" with h1 → hidden state h2
Step 3: Process "sat" with h2 → hidden state h3
Step 4: Process "on" with h3 → hidden state h4
Step 5: Process "the" with h4 → hidden state h5
Step 6: Process "mat" with h5 → hidden state h6
❌ Problems:
- Must wait for each step (no parallelization)
- By step 6, information from "The" is diluted
- Slow for long sequences
- Training time: O(n) sequential steps
---
TRANSFORMER Processing (Parallel):
ALL AT ONCE:
"The" attends to ["The", "cat", "sat", "on", "the", "mat"]
"cat" attends to ["The", "cat", "sat", "on", "the", "mat"]
"sat" attends to ["The", "cat", "sat", "on", "the", "mat"]
"on"  attends to ["The", "cat", "sat", "on", "the", "mat"]
"the" attends to ["The", "cat", "sat", "on", "the", "mat"]
"mat" attends to ["The", "cat", "sat", "on", "the", "mat"]
✅ Advantages:
- All words processed simultaneously (full parallelization)
- Each word directly attends to all others (no information loss)
- Fast for any sequence length
- Training time: O(1) parallel (but O(n²) complexity per step)
"""
# Simple Transformer usage with Hugging Face
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT (a Transformer encoder)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Example sentence
text = "The bank by the river is beautiful."
inputs = tokenizer(text, return_tensors='pt')
# Process with Transformer (all tokens processed in parallel!)
with torch.no_grad():
    outputs = model(**inputs)
# Get contextualized embeddings for each word
embeddings = outputs.last_hidden_state
print(f"Input shape: {inputs['input_ids'].shape}")
print(f"Output embeddings shape: {embeddings.shape}")
print(f"\nEach word has a 768-dimensional embedding based on FULL context!")
# The word "bank" now has different embeddings in:
# "The bank by the river" (shoreline) vs "I went to the bank" (financial)
# because Transformer looks at ALL words via self-attention!

Self-Attention Mechanism

The core innovation of Transformers - how models learn which words to focus on:

Self-Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q (Query)

"What am I looking for?" - represents current word

K (Key)

"What do I offer?" - represents all words to match against

V (Value)

"What information do I pass?" - actual information to aggregate

How Self-Attention Works (Step by Step)

1.
Create Q, K, V matrices
Transform input embeddings with learned weight matrices: Q=XW_Q, K=XW_K, V=XW_V
2.
Calculate attention scores
Compute QK^T: how similar is each word to every other word? Higher = more relevant
3.
Scale scores
Divide by √d_k (dimension of keys) to prevent very large values that make gradients unstable
4.
Apply softmax
Convert scores to probabilities (sum to 1). These are attention weights!
5.
Weighted sum of values
Multiply attention weights by V: aggregate information from all words, weighted by relevance

python

# Implementing Self-Attention from Scratch
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V):
    """
    Self-Attention implementation
    Args:
        Q: Query matrix [batch, seq_len, d_k]
        K: Key matrix [batch, seq_len, d_k]
        V: Value matrix [batch, seq_len, d_v]
    Returns:
        output: Context vectors [batch, seq_len, d_v]
        attention_weights: Attention scores [batch, seq_len, seq_len]
    """
    # Step 1: Calculate attention scores (Q · K^T)
    # How much should each word attend to every other word?
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1))  # [batch, seq_len, seq_len]
    print(f"Attention scores shape: {scores.shape}")
    print(f"Scores[0,0,:] (how word 0 attends to all words): {scores[0,0,:].tolist()}\n")
    # Step 2: Scale by √d_k (prevents large values)
    scores = scores / math.sqrt(d_k)
    # Step 3: Apply softmax (convert to probabilities)
    attention_weights = F.softmax(scores, dim=-1)
    print(f"Attention weights (after softmax):")
    print(f"Weights[0,0,:] (probabilities sum to 1): {attention_weights[0,0,:].tolist()}")
    print(f"Sum: {attention_weights[0,0,:].sum().item()}\n")
    # Step 4: Weighted sum of Values
    output = torch.matmul(attention_weights, V)
    return output, attention_weights
# Example usage
batch_size = 1
seq_len = 5  # "The cat sat on mat"
d_model = 64  # embedding dimension
# Simulate input embeddings
X = torch.randn(batch_size, seq_len, d_model)
# Create Q, K, V through linear transformations (learned during training)
W_Q = torch.randn(d_model, d_model)
W_K = torch.randn(d_model, d_model)
W_V = torch.randn(d_model, d_model)
Q = torch.matmul(X, W_Q)  # Query: what am I looking for?
K = torch.matmul(X, W_K)  # Key: what do I offer?
V = torch.matmul(X, W_V)  # Value: what info do I pass?
print("INPUT:")
print(f"Sequence length: {seq_len} words")
print(f"Embedding dimension: {d_model}\n")
# Apply self-attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("OUTPUT:")
print(f"Contextualized embeddings shape: {output.shape}")
print("\n✅ Each word now has context from ALL other words!")
# Visualize attention pattern
print("\nATTENTION PATTERN (who attends to whom):")
print("     Word0  Word1  Word2  Word3  Word4")
for i in range(seq_len):
    print(f"Word{i}: ", end="")
    for j in range(seq_len):
        weight = attention_weights[0, i, j].item()
        print(f"{weight:.2f}  ", end="")
    print()
# Output shows attention weights:
# High weight = strong attention (important relationship)
# Low weight = weak attention (less relevant)
# Each row sums to 1.0

Transformer Architecture

Two main components working together:

Encoder (Understanding)

Processes input text to create rich contextual representations. Used for understanding tasks.

Components:

• Multi-Head Self-Attention
• Feed-Forward Network
• Layer Normalization
• Residual Connections

Used in:

BERT, RoBERTa (classification, QA)

Decoder (Generation)

Generates output text one token at a time. Uses masked self-attention and cross-attention to encoder.

Components:

• Masked Self-Attention (causal)
• Cross-Attention to Encoder
• Feed-Forward Network
• Layer Norm + Residuals

Used in:

GPT, ChatGPT (text generation)

Full Transformer Stack

1. Input Processing

Tokens → Embeddings + Positional Encodings

2. Encoder Layers (N=6 typical)

Each layer: Multi-Head Attention → Add & Norm → Feed-Forward → Add & Norm

3. Decoder Layers (N=6 typical)

Each layer: Masked Attention → Cross-Attention to Encoder → Feed-Forward (all with Add & Norm)

4. Output Layer

Linear → Softmax → Probability distribution over vocabulary

Famous Transformer Models

Different architectures for different tasks:

BERT

Encoder-Only

Bidirectional Encoder Representations from Transformers. Reads text in both directions.

Training:

Masked Language Model (predict masked words)

Best for:

Classification, QA, NER, sentiment analysis

GPT

Decoder-Only

Generative Pre-trained Transformer. Autoregressive model for text generation.

Training:

Next-token prediction (predict next word)

Best for:

Text generation, completion, chatbots (ChatGPT)

T5

Encoder-Decoder

Text-to-Text Transfer Transformer. Treats all tasks as text-to-text.

Training:

Span corruption (predict corrupted spans)

Best for:

Translation, summarization, any seq2seq task

Key Concepts

Self-Attention

Mechanism that weighs importance of all words when processing each word. Each word 'attends to' other words to understand context. Computes Query, Key, Value matrices to find relationships.

Multi-Head Attention

Multiple attention mechanisms running in parallel, each learning different relationships. Like having multiple perspectives on the same sentence. Increases model's capacity to capture diverse patterns.

Positional Encoding

Since Transformers process in parallel (no inherent order), positional encodings are added to give the model information about word positions in the sequence.

Encoder-Decoder Structure

Encoder processes input (understanding), Decoder generates output (creation). Used in translation. Decoder-only (GPT) for generation, Encoder-only (BERT) for understanding.

Interview Tips

💡Transformers replaced RNNs/LSTMs by processing sequences in parallel using self-attention, enabling much faster training and better long-range dependencies
💡Self-attention: each word attends to all other words. Computes attention scores using Query×Key (how relevant words are), then weighted sum of Values
💡Key formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V. Scaled dot-product prevents large values, softmax gives probabilities
💡Multi-head attention runs multiple attention mechanisms in parallel (typically 8-16 heads), each learning different relationships/patterns
💡Encoder-Decoder: Encoder understands input, Decoder generates output with cross-attention to encoder. Used in translation (seq2seq tasks)
💡Decoder-only (GPT): autoregressive generation, predicts next token. Encoder-only (BERT): bidirectional understanding, good for classification
💡Positional encodings necessary because self-attention is permutation-invariant. Uses sin/cos functions or learned embeddings
💡Advantages: parallelization (faster), long-range dependencies, transfer learning. Disadvantages: quadratic complexity O(n²), large memory
💡BERT = Bidirectional Encoder (understanding). GPT = Autoregressive Decoder (generation). T5 = Full Encoder-Decoder (seq2seq)