Tokenization in NLP

Understanding Tokenization: Breaking text into meaningful units for machine learning

What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters. Tokenization is the first and most fundamental step in any NLP pipeline, converting raw text into a format that machine learning models can process.

💡 Simple Analogy:

Think of tokenization like breaking a sentence into puzzle pieces. Just as you need to separate puzzle pieces before assembling them, NLP models need text broken into tokens before they can understand and process language. Different tokenization strategies are like different ways to cut the puzzle - you can cut by words, syllables, or even individual letters.

⚡ Why It Matters:

Tokenization directly impacts model performance, vocabulary size, and the ability to handle unseen words. Modern language models like BERT, GPT, and T5 all rely on sophisticated tokenization strategies to achieve state-of-the-art results.

Types of Tokenization

Different tokenization strategies have different trade-offs:

Word-level Tokenization

Split text into individual words

Method:

Split on whitespace and punctuation

Example:

"Hello world!" → ["Hello", "world", "!"]

✓ Pros:

• Simple and intuitive
• Preserves word meaning
• Easy to implement

✗ Cons:

• Large vocabulary size
• Cannot handle OOV (out-of-vocabulary) words
• Different forms treated as different tokens (run, running, ran)

Use Cases:

Simple NLP tasks, small vocabulary domains

Character-level Tokenization

Split text into individual characters

Method:

Each character becomes a token

Example:

"Hello" → ["H", "e", "l", "l", "o"]

✓ Pros:

• Very small vocabulary (~100 chars)
• No OOV problem
• Works across languages

✗ Cons:

• Very long sequences
• Loses word-level meaning
• Harder to learn patterns

Use Cases:

Spelling correction, morphologically rich languages, DNA/protein sequences

Subword Tokenization

Split text into meaningful subword units

Method:

Learn frequent subword patterns from data

Example:

"unhappiness" → ["un", "happiness"]

✓ Pros:

• Balance between word and character
• Handles OOV words
• Reasonable vocabulary size
• Captures morphology

✗ Cons:

• Requires training
• More complex implementation

Use Cases:

Modern transformers (BERT, GPT), multilingual models

Code Example: Basic Tokenization

python

# Basic Tokenization Examples
import re
# 1. Simple Whitespace Tokenization
text = "Hello world! How are you?"
tokens = text.split()
print(tokens)
# Output: ['Hello', 'world!', 'How', 'are', 'you?']
# 2. Word Tokenization (with punctuation handling)
def word_tokenize(text):
    # Split on whitespace and punctuation
    return re.findall(r'\b\w+\b|[^\w\s]', text)
tokens = word_tokenize("Hello world! How are you?")
print(tokens)
# Output: ['Hello', 'world', '!', 'How', 'are', 'you', '?']
# 3. Character Tokenization
text = "Hello"
char_tokens = list(text)
print(char_tokens)
# Output: ['H', 'e', 'l', 'l', 'o']
# 4. Building Vocabulary
def build_vocab(texts):
    vocab = set()
    for text in texts:
        tokens = word_tokenize(text.lower())
        vocab.update(tokens)
    # Add special tokens
    vocab.update(['[PAD]', '[UNK]', '[CLS]', '[SEP]'])
    return {token: idx for idx, token in enumerate(sorted(vocab))}
texts = ["Hello world", "How are you", "Hello there"]
vocab = build_vocab(texts)
print(f"Vocabulary size: {len(vocab)}")
print(f"Sample mappings: {list(vocab.items())[:5]}")
# 5. Token to ID conversion
def tokens_to_ids(tokens, vocab):
    return [vocab.get(token, vocab['[UNK]']) for token in tokens]
tokens = word_tokenize("hello world")
token_ids = tokens_to_ids(tokens, vocab)
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

Subword Tokenization Algorithms

Modern NLP relies on subword tokenization. Here are the most popular algorithms:

Byte Pair Encoding (BPE)

Used by: GPT, GPT-2, RoBERTa

Iteratively merges the most frequent pair of characters/subwords

How it works:

1. Start with character vocabulary
2. Find most frequent adjacent pair
3. Merge this pair into a new token
4. Repeat until desired vocab size

Example:

"low" + "lowest" → learns "lo", "low", "est" as tokens

• Data-driven
• Language-agnostic
• Handles rare words well

• Greedy algorithm
• Token boundaries may not align with morphemes

WordPiece

Used by: BERT, DistilBERT

Similar to BPE but chooses merges based on likelihood

How it works:

1. Start with character vocabulary
2. Select pair that maximizes likelihood of training data
3. Add ## prefix to non-starting subwords
4. Repeat until desired vocab size

Example:

"unhappiness" → ["un", "##happiness"] or ["un", "##happ", "##iness"]

• Probabilistic approach
• Better handles morphology
• ## prefix indicates continuation

• More complex than BPE
• Requires larger training corpus

SentencePiece

Used by: T5, ALBERT, XLNet

Treats text as raw Unicode, language-independent

How it works:

Directly trains on raw text without pre-tokenization. Implements both BPE and Unigram algorithms. Treats spaces as special characters (▁).

Example:

"Hello world" → ["▁Hello", "▁world"]

• Truly language-agnostic
• No pre-tokenization needed
• Handles all languages uniformly

• Requires training from scratch
• Special character handling

Unigram Language Model

Used by: T5 (with SentencePiece), ALBERT

Uses probabilistic model to find best tokenization

How it works:

1. Start with large vocabulary
2. Iteratively remove tokens that least affect likelihood
3. Keep tokens that maximize data likelihood
4. Repeat until desired vocab size

Example:

Considers all possible segmentations, picks most probable

• Probabilistic foundation
• Multiple segmentations possible
• Better theoretical grounding

• Slower than BPE
• More computationally expensive

Code Example: Hugging Face Tokenizers

python

# Using Hugging Face Tokenizers - Industry Standard
from transformers import AutoTokenizer
# 1. BERT WordPiece Tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Tokenization is the first step in NLP."
encoded = bert_tokenizer(text, return_tensors='pt')
print("Tokens:", bert_tokenizer.tokenize(text))
print("Token IDs:", encoded['input_ids'])
print("Attention Mask:", encoded['attention_mask'])
# Decode back to text
decoded = bert_tokenizer.decode(encoded['input_ids'][0])
print("Decoded:", decoded)
# Output:
# Tokens: ['token', '##ization', 'is', 'the', 'first', 'step', 'in', 'nl', '##p', '.']
# Note: ## prefix indicates subword continuation
# 2. GPT-2 BPE Tokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Tokenization is crucial for NLP models."
tokens = gpt2_tokenizer.tokenize(text)
print("GPT-2 Tokens:", tokens)
# BPE learns frequent subword patterns
# 3. Handling Special Tokens
text1 = "First sentence."
text2 = "Second sentence."
# For sentence pair tasks (e.g., question answering)
encoded = bert_tokenizer(
    text1,
    text2,
    padding='max_length',
    max_length=20,
    truncation=True,
    return_tensors='pt'
)
print("With special tokens:")
print(bert_tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# [CLS] first sentence . [SEP] second sentence . [SEP] [PAD] [PAD] ...
# 4. Batch Tokenization
texts = [
    "Short text",
    "This is a much longer text that needs to be tokenized",
    "Medium length"
]
encoded_batch = bert_tokenizer(
    texts,
    padding=True,  # Pad to longest in batch
    truncation=True,
    max_length=15,
    return_tensors='pt'
)
print("Batch token IDs shape:", encoded_batch['input_ids'].shape)
print("Attention masks shape:", encoded_batch['attention_mask'].shape)
# 5. Handling OOV Words with Subword Tokenization
text = "COVID-19 is a coronavirus disease."  # COVID-19 might be OOV
tokens = bert_tokenizer.tokenize(text)
print("Tokens with OOV handling:", tokens)
# Subword tokenization breaks unknown words into known subwords
# e.g., ["co", "##vid", "-", "19"] or similar

Code Example: BPE from Scratch

python

# Implementing Simple BPE from Scratch
from collections import Counter, defaultdict
import re
def get_stats(vocab):
    """Count frequency of adjacent pairs"""
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs
def merge_vocab(pair, vocab):
    """Merge the most frequent pair in vocabulary"""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab
def learn_bpe(text, num_merges=10):
    """Learn BPE merges from text"""
    # Initialize vocabulary (character-level)
    vocab = Counter()
    for word in text.lower().split():
        vocab[' '.join(word) + ' </w>'] += 1
    print(f"Initial vocabulary size: {len(vocab)}")
    print(f"Sample: {list(vocab.items())[:3]}")
    # Learn merges
    merges = []
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_vocab(best_pair, vocab)
        merges.append(best_pair)
        print(f"Merge {i+1}: {best_pair} (freq: {pairs[best_pair]})")
    return merges, vocab
# Example usage
text = """
low low low low low
lower lower lower
lowest lowest
"""
merges, final_vocab = learn_bpe(text, num_merges=5)
print("\nLearned merges:")
for i, merge in enumerate(merges):
    print(f"{i+1}. {merge}")
print("\nFinal vocabulary:")
for word, freq in list(final_vocab.items())[:10]:
    print(f"{word}: {freq}")
# Applying BPE to new word
def apply_bpe(word, merges):
    """Apply learned merges to tokenize a word"""
    word = ' '.join(word) + ' </w>'
    for merge in merges:
        bigram = ' '.join(merge)
        replacement = ''.join(merge)
        word = word.replace(bigram, replacement)
    return word.split()
new_word = "lower"
tokens = apply_bpe(new_word, merges)
print(f"\nTokenizing '{new_word}': {tokens}")

Special Tokens

Tokenizers add special tokens for specific purposes:

[CLS] or <s>

Start of sequence (classification token)

Used by: BERT, RoBERTa

[SEP] or </s>

Separator between sentences or end of sequence

Used by: BERT, RoBERTa

[PAD]

Padding to make sequences same length

Used by: All models

[UNK]

Unknown/out-of-vocabulary token

Used by: All models

[MASK]

Masked token for masked language modeling

Used by: BERT

<bos>

Beginning of sequence

Used by: GPT

<eos>

End of sequence

Used by: GPT

Tokenization Challenges

Out-of-Vocabulary (OOV) Words

Word-level tokenization cannot handle words not in vocabulary

Examples:

• New words: "COVID-19", "cryptocurrency"
• Typos: "helllo" instead of "hello"
• Named entities: "Elon Musk" as single unit

💡 Solution:

Use subword tokenization (BPE, WordPiece) to break unknown words into known subwords

Multilingual Tokenization

Different languages have different writing systems and word boundaries

Examples:

• Chinese/Japanese have no spaces
• Arabic/Hebrew are right-to-left
• German has compound words: "Donaudampfschifffahrtsgesellschaft"

💡 Solution:

Use language-agnostic methods like SentencePiece, Unicode normalization

Vocabulary Size Trade-off

Large vocabulary = more memory, small vocabulary = longer sequences

Examples:

• Word-level: 50k-100k tokens, short sequences
• Character-level: ~100 tokens, very long sequences
• Subword: 30k-50k tokens, balanced

💡 Solution:

Choose vocabulary size based on task, data, and computational constraints

Normalization & Preprocessing

Should text be lowercased? How to handle punctuation, emojis, URLs?

Examples:

• "Apple" (company) vs "apple" (fruit)
• Hashtags: #MachineLearning
• Emojis: 😀 🎉

💡 Solution:

Task-dependent: sentiment analysis keeps casing/emojis, NER keeps casing

Code Example: SentencePiece

python

# Using SentencePiece - Language-Agnostic Tokenization
import sentencepiece as spm
# 1. Training SentencePiece model from scratch
# First, save training data to file
training_data = """
This is a sample text for training.
SentencePiece works on raw text without preprocessing.
It handles all languages uniformly including 日本語 and العربية.
"""
with open('training_data.txt', 'w', encoding='utf-8') as f:
    f.write(training_data)
# Train the model
spm.SentencePieceTrainer.train(
    input='training_data.txt',
    model_prefix='tokenizer',
    vocab_size=1000,
    model_type='bpe',  # or 'unigram'
    character_coverage=1.0,  # For multilingual, use 0.9995
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3
)
# 2. Loading and using trained model
sp = spm.SentencePieceProcessor()
sp.load('tokenizer.model')
text = "Hello world! This is SentencePiece."
# Tokenize to pieces
tokens = sp.encode_as_pieces(text)
print("Tokens:", tokens)
# Output: ['▁Hello', '▁world', '!', '▁This', '▁is', '▁Sent', 'ence', 'P', 'iece', '.']
# ▁ represents space
# Tokenize to IDs
ids = sp.encode_as_ids(text)
print("Token IDs:", ids)
# Decode back
decoded = sp.decode_ids(ids)
print("Decoded:", decoded)
# 3. Using pre-trained T5 tokenizer (uses SentencePiece)
from transformers import T5Tokenizer
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
text = "translate English to German: Hello, how are you?"
tokens = t5_tokenizer.tokenize(text)
print("T5 Tokens:", tokens)
# T5 uses SentencePiece with special task prefixes
encoded = t5_tokenizer(text, return_tensors='pt')
print("Encoded shape:", encoded['input_ids'].shape)

Tokenization in Practice

Hugging Face Tokenizers

Fast, modern tokenization library

✓Extremely fast (Rust backend)
✓Supports all major algorithms
✓Pre-trained tokenizers available
✓Easy to use with Transformers library

spaCy

Industrial-strength NLP library

✓Rule-based tokenization
✓Language-specific models
✓Handles complex cases well
✓Integrated with other NLP tools

NLTK

Educational and research toolkit

✓Multiple tokenization methods
✓Good for learning
✓Extensive documentation
✓Slower than modern alternatives

SentencePiece

Google's unsupervised tokenizer

✓Language-independent
✓Trains from raw text
✓BPE and Unigram support
✓Used in production models

Code Example: Practical Pipeline

python

# Practical Tokenization Pipeline
from transformers import AutoTokenizer
import torch
class TextProcessor:
    """Complete text processing pipeline"""
    def __init__(self, model_name='bert-base-uncased', max_length=128):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.max_length = max_length
    def preprocess(self, text):
        """Clean and normalize text"""
        # Remove extra whitespace
        text = ' '.join(text.split())
        # Lowercase (if using uncased model)
        text = text.lower()
        return text
    def tokenize_single(self, text, add_special_tokens=True):
        """Tokenize a single text"""
        text = self.preprocess(text)
        encoded = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
            add_special_tokens=add_special_tokens
        )
        return {
            'input_ids': encoded['input_ids'],
            'attention_mask': encoded['attention_mask'],
            'tokens': self.tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
        }
    def tokenize_batch(self, texts):
        """Tokenize multiple texts efficiently"""
        texts = [self.preprocess(t) for t in texts]
        encoded = self.tokenizer(
            texts,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return encoded
    def decode(self, token_ids, skip_special_tokens=True):
        """Convert token IDs back to text"""
        return self.tokenizer.decode(
            token_ids,
            skip_special_tokens=skip_special_tokens
        )
    def get_vocab_size(self):
        """Get tokenizer vocabulary size"""
        return self.tokenizer.vocab_size
    def analyze_tokenization(self, text):
        """Analyze how text is tokenized"""
        result = self.tokenize_single(text, add_special_tokens=True)
        print(f"Original text: {text}")
        print(f"Tokens: {result['tokens']}")
        print(f"Token IDs: {result['input_ids'][0].tolist()}")
        print(f"Attention mask: {result['attention_mask'][0].tolist()}")
        print(f"Number of tokens: {result['attention_mask'][0].sum().item()}")
        return result
# Example usage
processor = TextProcessor('bert-base-uncased', max_length=64)
# Single text
text = "Natural language processing is fascinating!"
result = processor.analyze_tokenization(text)
# Batch processing
texts = [
    "Short text",
    "This is a longer text that demonstrates batch tokenization",
    "Another example"
]
batch = processor.tokenize_batch(texts)
print(f"\nBatch shape: {batch['input_ids'].shape}")
# Decoding
decoded = processor.decode(batch['input_ids'][1])
print(f"Decoded text: {decoded}")
print(f"\nVocabulary size: {processor.get_vocab_size()}")

Key Concepts

Token

A single unit of text after tokenization - can be a word, subword, or character.

Vocabulary

The set of all unique tokens that the tokenizer knows. Size typically ranges from 30k-50k for modern models.

OOV (Out-of-Vocabulary)

Words that don't exist in the tokenizer's vocabulary, replaced with [UNK] token in word-level tokenization.

Subword Units

Meaningful parts of words (prefixes, suffixes, roots) that balance vocabulary size and sequence length.

Token ID

Numerical representation of a token, used as input to neural networks. "hello" might map to ID 2534.

Padding

Adding [PAD] tokens to make all sequences in a batch the same length for efficient processing.

Attention Mask

Binary mask indicating which tokens are real (1) vs padding (0), used in transformer models.

Detokenization

Converting tokens back to readable text, the reverse of tokenization.

Interview Tips

💡Explain tokenization as breaking text into processable units (words, subwords, or characters)
💡Know the three main types: word-level, character-level, and subword tokenization
💡Understand subword algorithms: BPE (GPT), WordPiece (BERT), SentencePiece (T5)
💡Explain the OOV problem and how subword tokenization solves it
💡Know the vocabulary size trade-off: larger vocab = shorter sequences but more memory
💡Understand special tokens: [CLS], [SEP], [PAD], [UNK], [MASK] and their purposes
💡Be familiar with Hugging Face tokenizers library - the industry standard
💡Explain why BERT uses WordPiece and GPT uses BPE (different design choices)
💡Know challenges: multilingual text, compound words, handling emojis/special characters
💡Understand that tokenization is not reversible perfectly (information loss possible)
💡Explain how tokenization affects model performance and efficiency
💡Know that different languages need different tokenization strategies
💡Be able to implement basic word-level tokenization from scratch
💡Understand the relationship between tokenization and vocabulary size in memory usage
💡Discuss modern trends: character-level for some tasks, subword for most production systems