Word Embeddings
Understanding Word Embeddings: Dense vector representations capturing semantic meaning
What are Word Embeddings?
Word embeddings are dense vector representations of words in continuous vector space. Instead of representing words as discrete symbols, embeddings map words to vectors of real numbers (typically 50-300 dimensions) where semantically similar words have similar vector representations.
💡 Simple Analogy:
Think of word embeddings like GPS coordinates for words in 'meaning space'. Just as cities geographically close to each other have similar coordinates, words with similar meanings have similar vectors. 'King' and 'Queen' would be close neighbors, while 'King' and 'Banana' would be far apart.
🎯 Why Needed:
Traditional one-hot encoding creates sparse, high-dimensional vectors with no notion of similarity (all words are equally distant). Word embeddings solve this by learning dense, low-dimensional representations that capture semantic relationships and enable arithmetic operations on meaning.
One-Hot Encoding vs Word Embeddings
One-Hot Encoding
Binary vector with vocabulary size, only one '1' and rest '0s'
Vocab: [cat, dog, bird] cat = [1, 0, 0] dog = [0, 1, 0]
❌ Issues:
- • Vocabulary size = vector dimension (10k-100k)
- • Sparse vectors (mostly zeros)
- • No semantic similarity
- • All words equally distant
Word Embeddings
Dense vector with fixed dimension (e.g., 300)
cat = [0.2, -0.5, 0.1, ...] (300 dims) dog = [0.3, -0.4, 0.2, ...] (300 dims)
✓ Benefits:
- • Fixed dimension (50-300)
- • Dense vectors (all values meaningful)
- • Captures semantic similarity
- • Similar words have similar vectors
Popular Word Embedding Methods
Word2Vec (2013)
Mikolov et al. (Google)
Predicts words from context (CBOW) or context from words (Skip-gram)
+ Pros
- • Fast to train
- • Good at capturing syntax
- • Widely used baseline
- Cons
- • One vector per word (no polysemy)
- • No context sensitivity
GloVe (2014)
Pennington et al. (Stanford)
Global Vectors - factorizes word co-occurrence matrix
+ Pros
- • Captures global statistics
- • Good performance
- • Pre-trained vectors available
- Cons
- • Static embeddings
- • Memory intensive for large vocab
FastText (2017)
Bojanowski et al. (Facebook)
Extension of Word2Vec using character n-grams
+ Pros
- • Handles OOV words
- • Works for morphologically rich languages
- • Captures subword structure
- Cons
- • Larger model size
- • Slower than Word2Vec
Code Example: Word2Vec Basics
# Word Embeddings Basics with Gensimfrom gensim.models import Word2Vecimport numpy as np# Sample corpussentences = [ ['king', 'queen', 'man', 'woman'], ['paris', 'france', 'berlin', 'germany'], ['cat', 'dog', 'animal', 'pet'], ['good', 'bad', 'happy', 'sad']]# Train Word2Vec model (Skip-gram)model = Word2Vec( sentences, vector_size=100, # Embedding dimension window=5, # Context window min_count=1, # Minimum word frequency workers=4, sg=1 # Skip-gram (1) or CBOW (0))# Get vector for a wordking_vector = model.wv['king']print(f"King vector shape: {king_vector.shape}")print(f"King vector (first 5 dims): {king_vector[:5]}")# Find similar wordssimilar_to_king = model.wv.most_similar('king', topn=3)print(f"Words similar to 'king': {similar_to_king}")# Word arithmetic: king - man + woman ≈ queenresult = model.wv.most_similar( positive=['king', 'woman'], negative=['man'], topn=1)print(f"king - man + woman = {result}")# Cosine similaritysimilarity = model.wv.similarity('king', 'queen')print(f"Similarity(king, queen) = {similarity:.3f}")Code Example: Pre-trained GloVe
# Using Pre-trained GloVe Embeddingsimport numpy as npfrom gensim.scripts.glove2word2vec import glove2word2vecfrom gensim.models import KeyedVectors# Convert GloVe format to Word2Vec format (one-time)glove_file = 'glove.6B.100d.txt' # Download from Stanford NLPword2vec_file = 'glove.6B.100d.word2vec.txt'glove2word2vec(glove_file, word2vec_file)# Load pre-trained embeddingsembeddings = KeyedVectors.load_word2vec_format(word2vec_file, binary=False)print(f"Vocabulary size: {len(embeddings)}")print(f"Embedding dimension: {embeddings.vector_size}")# Get embedding for a wordword = 'computer'vector = embeddings[word]print(f"{word} embedding: {vector[:5]}...")# Find similar wordssimilar_words = embeddings.most_similar('computer', topn=5)print(f"Words similar to '{word}': {similar_words}")# Analogies: paris - france + germany ≈ berlinresult = embeddings.most_similar( positive=['paris', 'germany'], negative=['france'], topn=1)print(f"paris - france + germany = {result[0][0]}")# Calculate similaritysim = embeddings.similarity('car', 'automobile')print(f"Similarity(car, automobile) = {sim:.3f}")Contextual Embeddings (Modern Approach)
Unlike static embeddings, contextual embeddings generate different vectors for the same word based on context.
Example:
"I went to the bank to deposit money" vs "I sat on the river bank" - 'bank' gets different vectors
ELMo (2018)
Embeddings from Language Models - bi-directional LSTM
Deep contextualized representations from bidirectional LM
BERT (2019)
Bidirectional Encoder Representations from Transformers
Transformer encoder with masked language modeling
GPT Series
Generative Pre-trained Transformers
Transformer decoder with causal language modeling
Code Example: Contextual Embeddings (BERT)
# Contextual Embeddings with BERTfrom transformers import BertTokenizer, BertModelimport torch# Load pre-trained BERTtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertModel.from_pretrained('bert-base-uncased')# Two sentences with word "bank" in different contextssentence1 = "I went to the bank to deposit money"sentence2 = "I sat on the river bank"def get_word_embedding(sentence, word, tokenizer, model): # Tokenize tokens = tokenizer.tokenize(sentence) indexed_tokens = tokenizer.convert_tokens_to_ids(tokens) tokens_tensor = torch.tensor([indexed_tokens]) # Get embeddings with torch.no_grad(): outputs = model(tokens_tensor) # Last hidden state: (batch_size, sequence_length, hidden_size) hidden_states = outputs.last_hidden_state # Find word position word_idx = tokens.index(word) word_embedding = hidden_states[0, word_idx, :] return word_embedding.numpy()# Get contextual embeddings for "bank" in both sentencesbank_embedding1 = get_word_embedding(sentence1, "bank", tokenizer, model)bank_embedding2 = get_word_embedding(sentence2, "bank", tokenizer, model)print(f"Bank embedding 1 shape: {bank_embedding1.shape}") # (768,)print(f"Bank embedding 2 shape: {bank_embedding2.shape}") # (768,)# Calculate cosine similarityfrom numpy.linalg import normcosine_sim = np.dot(bank_embedding1, bank_embedding2) / (norm(bank_embedding1) * norm(bank_embedding2))print(f"Cosine similarity between 'bank' in different contexts: {cosine_sim:.3f}")# Lower similarity shows BERT captures different meanings!print(f"Bank (financial) first 5 dims: {bank_embedding1[:5]}")print(f"Bank (river) first 5 dims: {bank_embedding2[:5]}")Remarkable Properties of Word Embeddings
Semantic Similarity
Similar words have similar vectors
cosine_similarity(king, queen) > cosine_similarity(king, banana)Word Arithmetic
Vector operations capture semantic relationships
king - man + woman ≈ queen
paris - france + germany ≈ berlinAnalogies
Can solve analogy tasks through vector math
man : woman :: king : ?
vec(woman) - vec(man) + vec(king) ≈ vec(queen)Clustering
Related words cluster together in vector space
Countries, cities, verbs, adjectives form distinct clustersApplications
- ▸Text classification and sentiment analysis
- ▸Named entity recognition (NER)
- ▸Machine translation
- ▸Question answering systems
- ▸Document similarity and clustering
- ▸Recommendation systems
- ▸Information retrieval and search
Key Concepts
Embedding Dimension
Size of the vector (typically 50-300). Higher dimensions capture more nuance but require more data.
Cosine Similarity
Measure of similarity between two vectors. Range [-1, 1], where 1 means identical direction.
Context Window
Number of surrounding words considered when training embeddings (e.g., 5 words before and after).
Negative Sampling
Training technique that samples negative examples (non-context words) for efficiency.
Subword Embeddings
Embeddings for word pieces (FastText, BPE) rather than whole words, handling OOV better.
Static vs Contextual
Static: one vector per word (Word2Vec). Contextual: different vectors based on context (BERT).
Interview Tips
- 💡Explain word embeddings as dense vector representations of words in continuous space
- 💡Know the difference between one-hot encoding (sparse, no similarity) and embeddings (dense, semantic similarity)
- 💡Understand Word2Vec: CBOW predicts word from context, Skip-gram predicts context from word
- 💡Know the famous example: king - man + woman ≈ queen
- 💡Explain GloVe as matrix factorization approach using co-occurrence statistics
- 💡Understand FastText handles OOV words using character n-grams
- 💡Know the shift to contextual embeddings (ELMo, BERT) that vary by context
- 💡Explain why embeddings work: distributional hypothesis - words in similar contexts have similar meanings
- 💡Be familiar with cosine similarity for measuring vector similarity
- 💡Know typical embedding dimensions: 50-300 for static, 768-1024 for contextual
- 💡Understand pre-trained embeddings vs training your own
- 💡Know applications: classification, NER, translation, search
- 💡Explain that embeddings can capture semantic and syntactic relationships
- 💡Discuss limitations: bias in embeddings, polysemy in static embeddings