Word Embeddings

Understanding Word Embeddings: Dense vector representations capturing semantic meaning

What are Word Embeddings?

Word embeddings are dense vector representations of words in continuous vector space. Instead of representing words as discrete symbols, embeddings map words to vectors of real numbers (typically 50-300 dimensions) where semantically similar words have similar vector representations.

💡 Simple Analogy:

Think of word embeddings like GPS coordinates for words in 'meaning space'. Just as cities geographically close to each other have similar coordinates, words with similar meanings have similar vectors. 'King' and 'Queen' would be close neighbors, while 'King' and 'Banana' would be far apart.

🎯 Why Needed:

Traditional one-hot encoding creates sparse, high-dimensional vectors with no notion of similarity (all words are equally distant). Word embeddings solve this by learning dense, low-dimensional representations that capture semantic relationships and enable arithmetic operations on meaning.

One-Hot Encoding vs Word Embeddings

One-Hot Encoding

Binary vector with vocabulary size, only one '1' and rest '0s'

Vocab: [cat, dog, bird]
cat = [1, 0, 0]
dog = [0, 1, 0]

Issues:

  • Vocabulary size = vector dimension (10k-100k)
  • Sparse vectors (mostly zeros)
  • No semantic similarity
  • All words equally distant

Word Embeddings

Dense vector with fixed dimension (e.g., 300)

cat = [0.2, -0.5, 0.1, ...] (300 dims)
dog = [0.3, -0.4, 0.2, ...] (300 dims)

Benefits:

  • Fixed dimension (50-300)
  • Dense vectors (all values meaningful)
  • Captures semantic similarity
  • Similar words have similar vectors

Popular Word Embedding Methods

Word2Vec (2013)

Mikolov et al. (Google)

100-300 dimensions

Predicts words from context (CBOW) or context from words (Skip-gram)

+ Pros

  • Fast to train
  • Good at capturing syntax
  • Widely used baseline

- Cons

  • One vector per word (no polysemy)
  • No context sensitivity

GloVe (2014)

Pennington et al. (Stanford)

50-300 dimensions

Global Vectors - factorizes word co-occurrence matrix

+ Pros

  • Captures global statistics
  • Good performance
  • Pre-trained vectors available

- Cons

  • Static embeddings
  • Memory intensive for large vocab

FastText (2017)

Bojanowski et al. (Facebook)

100-300 dimensions

Extension of Word2Vec using character n-grams

+ Pros

  • Handles OOV words
  • Works for morphologically rich languages
  • Captures subword structure

- Cons

  • Larger model size
  • Slower than Word2Vec

Code Example: Word2Vec Basics

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Word Embeddings Basics with Gensim
from gensim.models import Word2Vec
import numpy as np
# Sample corpus
sentences = [
['king', 'queen', 'man', 'woman'],
['paris', 'france', 'berlin', 'germany'],
['cat', 'dog', 'animal', 'pet'],
['good', 'bad', 'happy', 'sad']
]
# Train Word2Vec model (Skip-gram)
model = Word2Vec(
sentences,
vector_size=100, # Embedding dimension
window=5, # Context window
min_count=1, # Minimum word frequency
workers=4,
sg=1 # Skip-gram (1) or CBOW (0)
)
# Get vector for a word
king_vector = model.wv['king']
print(f"King vector shape: {king_vector.shape}")
print(f"King vector (first 5 dims): {king_vector[:5]}")
# Find similar words
similar_to_king = model.wv.most_similar('king', topn=3)
print(f"Words similar to 'king': {similar_to_king}")
# Word arithmetic: king - man + woman queen
result = model.wv.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=1
)
print(f"king - man + woman = {result}")
# Cosine similarity
similarity = model.wv.similarity('king', 'queen')
print(f"Similarity(king, queen) = {similarity:.3f}")

Code Example: Pre-trained GloVe

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Using Pre-trained GloVe Embeddings
import numpy as np
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
# Convert GloVe format to Word2Vec format (one-time)
glove_file = 'glove.6B.100d.txt' # Download from Stanford NLP
word2vec_file = 'glove.6B.100d.word2vec.txt'
glove2word2vec(glove_file, word2vec_file)
# Load pre-trained embeddings
embeddings = KeyedVectors.load_word2vec_format(word2vec_file, binary=False)
print(f"Vocabulary size: {len(embeddings)}")
print(f"Embedding dimension: {embeddings.vector_size}")
# Get embedding for a word
word = 'computer'
vector = embeddings[word]
print(f"{word} embedding: {vector[:5]}...")
# Find similar words
similar_words = embeddings.most_similar('computer', topn=5)
print(f"Words similar to '{word}': {similar_words}")
# Analogies: paris - france + germany berlin
result = embeddings.most_similar(
positive=['paris', 'germany'],
negative=['france'],
topn=1
)
print(f"paris - france + germany = {result[0][0]}")
# Calculate similarity
sim = embeddings.similarity('car', 'automobile')
print(f"Similarity(car, automobile) = {sim:.3f}")

Contextual Embeddings (Modern Approach)

Unlike static embeddings, contextual embeddings generate different vectors for the same word based on context.

Example:

"I went to the bank to deposit money" vs "I sat on the river bank" - 'bank' gets different vectors

ELMo (2018)

Embeddings from Language Models - bi-directional LSTM

Deep contextualized representations from bidirectional LM

BERT (2019)

Bidirectional Encoder Representations from Transformers

Transformer encoder with masked language modeling

GPT Series

Generative Pre-trained Transformers

Transformer decoder with causal language modeling

Code Example: Contextual Embeddings (BERT)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Contextual Embeddings with BERT
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Two sentences with word "bank" in different contexts
sentence1 = "I went to the bank to deposit money"
sentence2 = "I sat on the river bank"
def get_word_embedding(sentence, word, tokenizer, model):
# Tokenize
tokens = tokenizer.tokenize(sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])
# Get embeddings
with torch.no_grad():
outputs = model(tokens_tensor)
# Last hidden state: (batch_size, sequence_length, hidden_size)
hidden_states = outputs.last_hidden_state
# Find word position
word_idx = tokens.index(word)
word_embedding = hidden_states[0, word_idx, :]
return word_embedding.numpy()
# Get contextual embeddings for "bank" in both sentences
bank_embedding1 = get_word_embedding(sentence1, "bank", tokenizer, model)
bank_embedding2 = get_word_embedding(sentence2, "bank", tokenizer, model)
print(f"Bank embedding 1 shape: {bank_embedding1.shape}") # (768,)
print(f"Bank embedding 2 shape: {bank_embedding2.shape}") # (768,)
# Calculate cosine similarity
from numpy.linalg import norm
cosine_sim = np.dot(bank_embedding1, bank_embedding2) / (norm(bank_embedding1) * norm(bank_embedding2))
print(f"Cosine similarity between 'bank' in different contexts: {cosine_sim:.3f}")
# Lower similarity shows BERT captures different meanings!
print(f"Bank (financial) first 5 dims: {bank_embedding1[:5]}")
print(f"Bank (river) first 5 dims: {bank_embedding2[:5]}")

Remarkable Properties of Word Embeddings

Semantic Similarity

Similar words have similar vectors

cosine_similarity(king, queen) > cosine_similarity(king, banana)

Word Arithmetic

Vector operations capture semantic relationships

king - man + woman ≈ queen paris - france + germany ≈ berlin

Analogies

Can solve analogy tasks through vector math

man : woman :: king : ? vec(woman) - vec(man) + vec(king) ≈ vec(queen)

Clustering

Related words cluster together in vector space

Countries, cities, verbs, adjectives form distinct clusters

Applications

  • Text classification and sentiment analysis
  • Named entity recognition (NER)
  • Machine translation
  • Question answering systems
  • Document similarity and clustering
  • Recommendation systems
  • Information retrieval and search

Key Concepts

Embedding Dimension

Size of the vector (typically 50-300). Higher dimensions capture more nuance but require more data.

Cosine Similarity

Measure of similarity between two vectors. Range [-1, 1], where 1 means identical direction.

Context Window

Number of surrounding words considered when training embeddings (e.g., 5 words before and after).

Negative Sampling

Training technique that samples negative examples (non-context words) for efficiency.

Subword Embeddings

Embeddings for word pieces (FastText, BPE) rather than whole words, handling OOV better.

Static vs Contextual

Static: one vector per word (Word2Vec). Contextual: different vectors based on context (BERT).

Interview Tips

  • 💡Explain word embeddings as dense vector representations of words in continuous space
  • 💡Know the difference between one-hot encoding (sparse, no similarity) and embeddings (dense, semantic similarity)
  • 💡Understand Word2Vec: CBOW predicts word from context, Skip-gram predicts context from word
  • 💡Know the famous example: king - man + woman ≈ queen
  • 💡Explain GloVe as matrix factorization approach using co-occurrence statistics
  • 💡Understand FastText handles OOV words using character n-grams
  • 💡Know the shift to contextual embeddings (ELMo, BERT) that vary by context
  • 💡Explain why embeddings work: distributional hypothesis - words in similar contexts have similar meanings
  • 💡Be familiar with cosine similarity for measuring vector similarity
  • 💡Know typical embedding dimensions: 50-300 for static, 768-1024 for contextual
  • 💡Understand pre-trained embeddings vs training your own
  • 💡Know applications: classification, NER, translation, search
  • 💡Explain that embeddings can capture semantic and syntactic relationships
  • 💡Discuss limitations: bias in embeddings, polysemy in static embeddings