Tokenization in NLP
Understanding Tokenization: Breaking text into meaningful units for machine learning
What is Tokenization?
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters. Tokenization is the first and most fundamental step in any NLP pipeline, converting raw text into a format that machine learning models can process.
💡 Simple Analogy:
Think of tokenization like breaking a sentence into puzzle pieces. Just as you need to separate puzzle pieces before assembling them, NLP models need text broken into tokens before they can understand and process language. Different tokenization strategies are like different ways to cut the puzzle - you can cut by words, syllables, or even individual letters.
⚡ Why It Matters:
Tokenization directly impacts model performance, vocabulary size, and the ability to handle unseen words. Modern language models like BERT, GPT, and T5 all rely on sophisticated tokenization strategies to achieve state-of-the-art results.
Types of Tokenization
Different tokenization strategies have different trade-offs:
Word-level Tokenization
Split text into individual words
Method:
Split on whitespace and punctuation
Example:
"Hello world!" → ["Hello", "world", "!"]✓ Pros:
- • Simple and intuitive
- • Preserves word meaning
- • Easy to implement
✗ Cons:
- • Large vocabulary size
- • Cannot handle OOV (out-of-vocabulary) words
- • Different forms treated as different tokens (run, running, ran)
Use Cases:
Simple NLP tasks, small vocabulary domains
Character-level Tokenization
Split text into individual characters
Method:
Each character becomes a token
Example:
"Hello" → ["H", "e", "l", "l", "o"]✓ Pros:
- • Very small vocabulary (~100 chars)
- • No OOV problem
- • Works across languages
✗ Cons:
- • Very long sequences
- • Loses word-level meaning
- • Harder to learn patterns
Use Cases:
Spelling correction, morphologically rich languages, DNA/protein sequences
Subword Tokenization
Split text into meaningful subword units
Method:
Learn frequent subword patterns from data
Example:
"unhappiness" → ["un", "happiness"]✓ Pros:
- • Balance between word and character
- • Handles OOV words
- • Reasonable vocabulary size
- • Captures morphology
✗ Cons:
- • Requires training
- • More complex implementation
Use Cases:
Modern transformers (BERT, GPT), multilingual models
Code Example: Basic Tokenization
# Basic Tokenization Examplesimport re# 1. Simple Whitespace Tokenizationtext = "Hello world! How are you?"tokens = text.split()print(tokens)# Output: ['Hello', 'world!', 'How', 'are', 'you?']# 2. Word Tokenization (with punctuation handling)def word_tokenize(text): # Split on whitespace and punctuation return re.findall(r'\b\w+\b|[^\w\s]', text)tokens = word_tokenize("Hello world! How are you?")print(tokens)# Output: ['Hello', 'world', '!', 'How', 'are', 'you', '?']# 3. Character Tokenizationtext = "Hello"char_tokens = list(text)print(char_tokens)# Output: ['H', 'e', 'l', 'l', 'o']# 4. Building Vocabularydef build_vocab(texts): vocab = set() for text in texts: tokens = word_tokenize(text.lower()) vocab.update(tokens) # Add special tokens vocab.update(['[PAD]', '[UNK]', '[CLS]', '[SEP]']) return {token: idx for idx, token in enumerate(sorted(vocab))}texts = ["Hello world", "How are you", "Hello there"]vocab = build_vocab(texts)print(f"Vocabulary size: {len(vocab)}")print(f"Sample mappings: {list(vocab.items())[:5]}")# 5. Token to ID conversiondef tokens_to_ids(tokens, vocab): return [vocab.get(token, vocab['[UNK]']) for token in tokens]tokens = word_tokenize("hello world")token_ids = tokens_to_ids(tokens, vocab)print(f"Tokens: {tokens}")print(f"Token IDs: {token_ids}")Subword Tokenization Algorithms
Modern NLP relies on subword tokenization. Here are the most popular algorithms:
Byte Pair Encoding (BPE)
Used by: GPT, GPT-2, RoBERTa
Iteratively merges the most frequent pair of characters/subwords
How it works:
1. Start with character vocabulary 2. Find most frequent adjacent pair 3. Merge this pair into a new token 4. Repeat until desired vocab size
Example:
"low" + "lowest" → learns "lo", "low", "est" as tokens
+
- • Data-driven
- • Language-agnostic
- • Handles rare words well
-
- • Greedy algorithm
- • Token boundaries may not align with morphemes
WordPiece
Used by: BERT, DistilBERT
Similar to BPE but chooses merges based on likelihood
How it works:
1. Start with character vocabulary 2. Select pair that maximizes likelihood of training data 3. Add ## prefix to non-starting subwords 4. Repeat until desired vocab size
Example:
"unhappiness" → ["un", "##happiness"] or ["un", "##happ", "##iness"]
+
- • Probabilistic approach
- • Better handles morphology
- • ## prefix indicates continuation
-
- • More complex than BPE
- • Requires larger training corpus
SentencePiece
Used by: T5, ALBERT, XLNet
Treats text as raw Unicode, language-independent
How it works:
Directly trains on raw text without pre-tokenization. Implements both BPE and Unigram algorithms. Treats spaces as special characters (▁).
Example:
"Hello world" → ["▁Hello", "▁world"]
+
- • Truly language-agnostic
- • No pre-tokenization needed
- • Handles all languages uniformly
-
- • Requires training from scratch
- • Special character handling
Unigram Language Model
Used by: T5 (with SentencePiece), ALBERT
Uses probabilistic model to find best tokenization
How it works:
1. Start with large vocabulary 2. Iteratively remove tokens that least affect likelihood 3. Keep tokens that maximize data likelihood 4. Repeat until desired vocab size
Example:
Considers all possible segmentations, picks most probable
+
- • Probabilistic foundation
- • Multiple segmentations possible
- • Better theoretical grounding
-
- • Slower than BPE
- • More computationally expensive
Code Example: Hugging Face Tokenizers
# Using Hugging Face Tokenizers - Industry Standardfrom transformers import AutoTokenizer# 1. BERT WordPiece Tokenizerbert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')text = "Tokenization is the first step in NLP."encoded = bert_tokenizer(text, return_tensors='pt')print("Tokens:", bert_tokenizer.tokenize(text))print("Token IDs:", encoded['input_ids'])print("Attention Mask:", encoded['attention_mask'])# Decode back to textdecoded = bert_tokenizer.decode(encoded['input_ids'][0])print("Decoded:", decoded)# Output:# Tokens: ['token', '##ization', 'is', 'the', 'first', 'step', 'in', 'nl', '##p', '.']# Note: ## prefix indicates subword continuation# 2. GPT-2 BPE Tokenizergpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')text = "Tokenization is crucial for NLP models."tokens = gpt2_tokenizer.tokenize(text)print("GPT-2 Tokens:", tokens)# BPE learns frequent subword patterns# 3. Handling Special Tokenstext1 = "First sentence."text2 = "Second sentence."# For sentence pair tasks (e.g., question answering)encoded = bert_tokenizer( text1, text2, padding='max_length', max_length=20, truncation=True, return_tensors='pt')print("With special tokens:")print(bert_tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))# [CLS] first sentence . [SEP] second sentence . [SEP] [PAD] [PAD] ...# 4. Batch Tokenizationtexts = [ "Short text", "This is a much longer text that needs to be tokenized", "Medium length"]encoded_batch = bert_tokenizer( texts, padding=True, # Pad to longest in batch truncation=True, max_length=15, return_tensors='pt')print("Batch token IDs shape:", encoded_batch['input_ids'].shape)print("Attention masks shape:", encoded_batch['attention_mask'].shape)# 5. Handling OOV Words with Subword Tokenizationtext = "COVID-19 is a coronavirus disease." # COVID-19 might be OOVtokens = bert_tokenizer.tokenize(text)print("Tokens with OOV handling:", tokens)# Subword tokenization breaks unknown words into known subwords# e.g., ["co", "##vid", "-", "19"] or similarCode Example: BPE from Scratch
# Implementing Simple BPE from Scratchfrom collections import Counter, defaultdictimport redef get_stats(vocab): """Count frequency of adjacent pairs""" pairs = defaultdict(int) for word, freq in vocab.items(): symbols = word.split() for i in range(len(symbols) - 1): pairs[(symbols[i], symbols[i+1])] += freq return pairsdef merge_vocab(pair, vocab): """Merge the most frequent pair in vocabulary""" new_vocab = {} bigram = ' '.join(pair) replacement = ''.join(pair) for word in vocab: new_word = word.replace(bigram, replacement) new_vocab[new_word] = vocab[word] return new_vocabdef learn_bpe(text, num_merges=10): """Learn BPE merges from text""" # Initialize vocabulary (character-level) vocab = Counter() for word in text.lower().split(): vocab[' '.join(word) + ' </w>'] += 1 print(f"Initial vocabulary size: {len(vocab)}") print(f"Sample: {list(vocab.items())[:3]}") # Learn merges merges = [] for i in range(num_merges): pairs = get_stats(vocab) if not pairs: break best_pair = max(pairs, key=pairs.get) vocab = merge_vocab(best_pair, vocab) merges.append(best_pair) print(f"Merge {i+1}: {best_pair} (freq: {pairs[best_pair]})") return merges, vocab# Example usagetext = """low low low low lowlower lower lowerlowest lowest"""merges, final_vocab = learn_bpe(text, num_merges=5)print("\nLearned merges:")for i, merge in enumerate(merges): print(f"{i+1}. {merge}")print("\nFinal vocabulary:")for word, freq in list(final_vocab.items())[:10]: print(f"{word}: {freq}")# Applying BPE to new worddef apply_bpe(word, merges): """Apply learned merges to tokenize a word""" word = ' '.join(word) + ' </w>' for merge in merges: bigram = ' '.join(merge) replacement = ''.join(merge) word = word.replace(bigram, replacement) return word.split()new_word = "lower"tokens = apply_bpe(new_word, merges)print(f"\nTokenizing '{new_word}': {tokens}")Special Tokens
Tokenizers add special tokens for specific purposes:
[CLS] or <s>Start of sequence (classification token)
Used by: BERT, RoBERTa
[SEP] or </s>Separator between sentences or end of sequence
Used by: BERT, RoBERTa
[PAD]Padding to make sequences same length
Used by: All models
[UNK]Unknown/out-of-vocabulary token
Used by: All models
[MASK]Masked token for masked language modeling
Used by: BERT
<bos>Beginning of sequence
Used by: GPT
<eos>End of sequence
Used by: GPT
Tokenization Challenges
Out-of-Vocabulary (OOV) Words
Word-level tokenization cannot handle words not in vocabulary
Examples:
- • New words: "COVID-19", "cryptocurrency"
- • Typos: "helllo" instead of "hello"
- • Named entities: "Elon Musk" as single unit
💡 Solution:
Use subword tokenization (BPE, WordPiece) to break unknown words into known subwords
Multilingual Tokenization
Different languages have different writing systems and word boundaries
Examples:
- • Chinese/Japanese have no spaces
- • Arabic/Hebrew are right-to-left
- • German has compound words: "Donaudampfschifffahrtsgesellschaft"
💡 Solution:
Use language-agnostic methods like SentencePiece, Unicode normalization
Vocabulary Size Trade-off
Large vocabulary = more memory, small vocabulary = longer sequences
Examples:
- • Word-level: 50k-100k tokens, short sequences
- • Character-level: ~100 tokens, very long sequences
- • Subword: 30k-50k tokens, balanced
💡 Solution:
Choose vocabulary size based on task, data, and computational constraints
Normalization & Preprocessing
Should text be lowercased? How to handle punctuation, emojis, URLs?
Examples:
- • "Apple" (company) vs "apple" (fruit)
- • Hashtags: #MachineLearning
- • Emojis: 😀 🎉
💡 Solution:
Task-dependent: sentiment analysis keeps casing/emojis, NER keeps casing
Code Example: SentencePiece
# Using SentencePiece - Language-Agnostic Tokenizationimport sentencepiece as spm# 1. Training SentencePiece model from scratch# First, save training data to filetraining_data = """This is a sample text for training.SentencePiece works on raw text without preprocessing.It handles all languages uniformly including 日本語 and العربية."""with open('training_data.txt', 'w', encoding='utf-8') as f: f.write(training_data)# Train the modelspm.SentencePieceTrainer.train( input='training_data.txt', model_prefix='tokenizer', vocab_size=1000, model_type='bpe', # or 'unigram' character_coverage=1.0, # For multilingual, use 0.9995 pad_id=0, unk_id=1, bos_id=2, eos_id=3)# 2. Loading and using trained modelsp = spm.SentencePieceProcessor()sp.load('tokenizer.model')text = "Hello world! This is SentencePiece."# Tokenize to piecestokens = sp.encode_as_pieces(text)print("Tokens:", tokens)# Output: ['▁Hello', '▁world', '!', '▁This', '▁is', '▁Sent', 'ence', 'P', 'iece', '.']# ▁ represents space# Tokenize to IDsids = sp.encode_as_ids(text)print("Token IDs:", ids)# Decode backdecoded = sp.decode_ids(ids)print("Decoded:", decoded)# 3. Using pre-trained T5 tokenizer (uses SentencePiece)from transformers import T5Tokenizert5_tokenizer = T5Tokenizer.from_pretrained('t5-small')text = "translate English to German: Hello, how are you?"tokens = t5_tokenizer.tokenize(text)print("T5 Tokens:", tokens)# T5 uses SentencePiece with special task prefixesencoded = t5_tokenizer(text, return_tensors='pt')print("Encoded shape:", encoded['input_ids'].shape)Tokenization in Practice
Hugging Face Tokenizers
Fast, modern tokenization library
- ✓Extremely fast (Rust backend)
- ✓Supports all major algorithms
- ✓Pre-trained tokenizers available
- ✓Easy to use with Transformers library
spaCy
Industrial-strength NLP library
- ✓Rule-based tokenization
- ✓Language-specific models
- ✓Handles complex cases well
- ✓Integrated with other NLP tools
NLTK
Educational and research toolkit
- ✓Multiple tokenization methods
- ✓Good for learning
- ✓Extensive documentation
- ✓Slower than modern alternatives
SentencePiece
Google's unsupervised tokenizer
- ✓Language-independent
- ✓Trains from raw text
- ✓BPE and Unigram support
- ✓Used in production models
Code Example: Practical Pipeline
# Practical Tokenization Pipelinefrom transformers import AutoTokenizerimport torchclass TextProcessor: """Complete text processing pipeline""" def __init__(self, model_name='bert-base-uncased', max_length=128): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.max_length = max_length def preprocess(self, text): """Clean and normalize text""" # Remove extra whitespace text = ' '.join(text.split()) # Lowercase (if using uncased model) text = text.lower() return text def tokenize_single(self, text, add_special_tokens=True): """Tokenize a single text""" text = self.preprocess(text) encoded = self.tokenizer( text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors='pt', add_special_tokens=add_special_tokens ) return { 'input_ids': encoded['input_ids'], 'attention_mask': encoded['attention_mask'], 'tokens': self.tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]) } def tokenize_batch(self, texts): """Tokenize multiple texts efficiently""" texts = [self.preprocess(t) for t in texts] encoded = self.tokenizer( texts, max_length=self.max_length, padding='max_length', truncation=True, return_tensors='pt' ) return encoded def decode(self, token_ids, skip_special_tokens=True): """Convert token IDs back to text""" return self.tokenizer.decode( token_ids, skip_special_tokens=skip_special_tokens ) def get_vocab_size(self): """Get tokenizer vocabulary size""" return self.tokenizer.vocab_size def analyze_tokenization(self, text): """Analyze how text is tokenized""" result = self.tokenize_single(text, add_special_tokens=True) print(f"Original text: {text}") print(f"Tokens: {result['tokens']}") print(f"Token IDs: {result['input_ids'][0].tolist()}") print(f"Attention mask: {result['attention_mask'][0].tolist()}") print(f"Number of tokens: {result['attention_mask'][0].sum().item()}") return result# Example usageprocessor = TextProcessor('bert-base-uncased', max_length=64)# Single texttext = "Natural language processing is fascinating!"result = processor.analyze_tokenization(text)# Batch processingtexts = [ "Short text", "This is a longer text that demonstrates batch tokenization", "Another example"]batch = processor.tokenize_batch(texts)print(f"\nBatch shape: {batch['input_ids'].shape}")# Decodingdecoded = processor.decode(batch['input_ids'][1])print(f"Decoded text: {decoded}")print(f"\nVocabulary size: {processor.get_vocab_size()}")Key Concepts
Token
A single unit of text after tokenization - can be a word, subword, or character.
Vocabulary
The set of all unique tokens that the tokenizer knows. Size typically ranges from 30k-50k for modern models.
OOV (Out-of-Vocabulary)
Words that don't exist in the tokenizer's vocabulary, replaced with [UNK] token in word-level tokenization.
Subword Units
Meaningful parts of words (prefixes, suffixes, roots) that balance vocabulary size and sequence length.
Token ID
Numerical representation of a token, used as input to neural networks. "hello" might map to ID 2534.
Padding
Adding [PAD] tokens to make all sequences in a batch the same length for efficient processing.
Attention Mask
Binary mask indicating which tokens are real (1) vs padding (0), used in transformer models.
Detokenization
Converting tokens back to readable text, the reverse of tokenization.
Interview Tips
- 💡Explain tokenization as breaking text into processable units (words, subwords, or characters)
- 💡Know the three main types: word-level, character-level, and subword tokenization
- 💡Understand subword algorithms: BPE (GPT), WordPiece (BERT), SentencePiece (T5)
- 💡Explain the OOV problem and how subword tokenization solves it
- 💡Know the vocabulary size trade-off: larger vocab = shorter sequences but more memory
- 💡Understand special tokens: [CLS], [SEP], [PAD], [UNK], [MASK] and their purposes
- 💡Be familiar with Hugging Face tokenizers library - the industry standard
- 💡Explain why BERT uses WordPiece and GPT uses BPE (different design choices)
- 💡Know challenges: multilingual text, compound words, handling emojis/special characters
- 💡Understand that tokenization is not reversible perfectly (information loss possible)
- 💡Explain how tokenization affects model performance and efficiency
- 💡Know that different languages need different tokenization strategies
- 💡Be able to implement basic word-level tokenization from scratch
- 💡Understand the relationship between tokenization and vocabulary size in memory usage
- 💡Discuss modern trends: character-level for some tasks, subword for most production systems