Chapter 18: Embeddings

How Machines Represent Meaning

The Problem with Discrete Symbols

Words, categories, user IDs—discrete symbols pervade machine learning applications. But neural networks operate on continuous vectors, not discrete tokens. The mapping from symbols to vectors determines what the model can learn.

The naive approach is to assign each symbol an integer ID: “dog” = 1, “cat” = 2, “car” = 3. But these numbers are arbitrary. The fact that “cat” (ID 2) is numerically between “dog” (ID 1) and “car” (ID 3) doesn’t reflect any semantic relationship. The distance |ID(cat) - ID(dog)| = 1 is the same as the distance from “dog” to “xylophone,” even though dogs and cats are far more similar than dogs and xylophones.

Machine learning models need representations where similarity in meaning corresponds to similarity in representation. A model that can’t tell that “dog” and “puppy” are related, or that “king” and “queen” share properties, must learn these relationships from scratch—wasting data and parameters.

Embeddings solve this by mapping discrete symbols (words, items, users) to continuous vectors in a learned space where semantic similarity corresponds to geometric proximity. Words with similar meanings are embedded near each other; unrelated words are far apart. This representation enables models to generalize: knowledge about “dog” partially transfers to “puppy” because their embeddings are similar.

Embeddings are the foundation of modern NLP. They’re also critical for recommendation systems, search engines, and any system that needs to represent discrete objects in a continuous space.

The Problem with Discrete Symbols

Words are discrete symbols. “dog,” “cat,” and “automobile” are arbitrary labels with no inherent relationship. Assigning them integer IDs (dog=1, cat=2, automobile=3) doesn’t help—the numbers are meaningless. The “distance” from dog to cat (|2-1| = 1) is the same as from dog to automobile (|3-1| = 2), even though semantically dogs and cats are more similar than dogs and cars.

One-hot encoding—representing each word as a binary vector with a single 1 and all other positions 0—doesn’t help either. “Dog” = [1, 0, 0, …], “cat” = [0, 1, 0, …], “car” = [0, 0, 1, …]. These vectors are equidistant: the Euclidean distance between any pair is the same. There’s no notion that “dog” and “cat” are more similar than “dog” and “quantum.”

Machine learning models need continuous representations where similarity in meaning corresponds to proximity in space. This is what embeddings provide.

Vector Spaces: Meaning as Geometry

An embedding maps each discrete symbol (word, token, user, item) to a point in a continuous vector space—typically 100 to 1024 dimensions. Words with similar meanings are embedded near each other; words with different meanings are far apart.

In this geometric space:

“dog” and “cat” are close (both animals, common pets)
“king” and “queen” are close (both royalty)
“walked” and “running” are close (both locomotion verbs)
“dog” and “quantum” are far apart (unrelated concepts)

Distance metrics (cosine similarity, Euclidean distance) measure semantic similarity. Close vectors represent similar meanings; distant vectors represent different meanings.

The famous example from Word2Vec:

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{queen}

This works because the embedding space learns consistent semantic directions. Subtracting “man” and adding “woman” shifts along the “gender” direction in the space. The resulting vector is closest to “queen”—the female equivalent of “king.”

This algebraic manipulation of meaning is possible only because embeddings encode semantics geometrically. Meaning becomes position in vector space, and relationships become directions.

Vector Spaces: Meaning as Geometry diagram

The diagram shows a 2D projection of embedding space. Similar words cluster together (animals, locomotion verbs, royalty). Distance in this space encodes semantic similarity—a geometric encoding of meaning.

Similarity as Distance

In embedding space, semantic similarity corresponds to geometric proximity. Words with similar meanings are close together; words with different meanings are far apart. Distance metrics—cosine similarity, Euclidean distance—measure semantic similarity.

Cosine similarity is the most common metric:

\text{similarity}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{|\mathbf{v}_1| |\mathbf{v}_2|}

This measures the angle between two vectors, ranging from -1 (opposite) to +1 (identical direction). Words with similar meanings have high cosine similarity (angles near 0°). Words with opposite meanings have negative similarity (angles near 180°). Unrelated words have similarity near 0 (orthogonal).

Similarity as Distance diagram

The diagram shows embeddings in 2D (projected from high-dimensional space). Similar words cluster together: animals, vehicles, royalty. Distance in embedding space corresponds to semantic similarity.

Example: King - Man + Woman ≈ Queen

The famous Word2Vec example demonstrates vector arithmetic in embedding space:

\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}

This works because embeddings capture semantic relationships as geometric directions. The “gender” direction is learned consistently: $\vec{king} - \vec{man}$ yields a vector pointing from “male royalty” to “female royalty.” Adding $\vec{woman}$ moves along this direction, arriving near $\vec{queen}$ .

Similar patterns emerge for other relationships:

$\vec{Paris} - \vec{France} + \vec{Italy} \approx \vec{Rome}$ (capital cities)
$\vec{walking} - \vec{walked} + \vec{swam} \approx \vec{swimming}$ (verb tenses)
$\vec{bigger} - \vec{big} + \vec{small} \approx \vec{smaller}$ (comparatives)

These analogies aren’t perfect—they work best on clean, high-frequency words—but they demonstrate that embeddings encode structured relationships, not just isolated meanings.

Compositionality: Combining Meanings

If words are vectors, sentences can be vectors too. The simplest composition is averaging: sum word embeddings and normalize. For “The cat sat,” compute:

\vec{sentence} = \frac{\vec{the} + \vec{cat} + \vec{sat}}{3}

Surprisingly, this naive approach works reasonably well for tasks like sentiment classification or semantic similarity. The average captures the general topic and tone, even though it ignores word order and syntax.

More sophisticated composition uses weighted averaging (weight words by importance), RNNs (Chapter 17), or Transformers (Chapter 20) to produce context-dependent representations. But the principle remains: meaning composes through vector operations.

Embeddings also enable cross-lingual representations. Multilingual embeddings map words from different languages into a shared space where translations are close: $\vec{dog}_{\text{English}} \approx \vec{chien}_{\text{French}}$ . This enables zero-shot translation and cross-lingual transfer learning.

Subword Tokenization: Handling Unknown Words

Word-level embeddings face a fundamental problem: what happens when the model encounters a word that wasn’t in the training vocabulary? If your vocabulary contains “play” but not “playing,” the model treats “playing” as an unknown token—represented by a generic [UNK] embedding that loses all meaning.

This is especially problematic for:

Rare words: “antidisestablishmentarianism” might appear once in a billion words
Typos: “teh” instead of “the”
New terms: “COVID-19” didn’t exist before 2020
Morphology: Languages like German create compound words by joining roots

Byte Pair Encoding (BPE) solves this by breaking words into subword units. The algorithm:

Start with characters as the atomic units
Iteratively merge the most frequent character pairs
Stop after a fixed number of merges (typically 30k-50k vocabulary size)

Example: Training on text containing “playing,” “played,” “plays”:

Initial: [“p”, “l”, “a”, “y”, “i”, “n”, “g”, “p”, “l”, “a”, “y”, “e”, “d”, “p”, “l”, “a”, “y”, “s”]
Merge frequent pairs: “pl” appears often → [“pl”, “a”, “y”, …]
Continue merging: “pla”, “play”, …
Final vocabulary includes: [“play”, “ing”, “ed”, “s”]
“playing” → [“play”, “ing”], “played” → [“play”, “ed”]

Each subword gets its own embedding. The word embedding is composed from subword embeddings (often by summing or averaging). This means “playing” and “played” share the “play” subword embedding, capturing their semantic relationship automatically.

WordPiece (used by BERT) is similar to BPE but chooses merges that maximize likelihood on the training corpus rather than raw frequency. SentencePiece extends this to be language-agnostic—it operates directly on raw text without requiring pre-tokenization, making it work for languages without clear word boundaries (Chinese, Japanese, Thai).

Benefits:

No unknown tokens: Every word can be decomposed into known subwords (worst case: individual characters)
Smaller vocabulary: 30k subwords vs 100k+ words
Morphology handling: “un-”, “pre-”, “-ing”, “-ed” are learned as subwords, capturing grammatical structure
Rare word handling: Rare words decompose into common subwords with meaningful embeddings

Tradeoff: Sequences become longer. A sentence with 20 words might become 25-30 subword tokens. This increases computational cost (longer sequences mean more attention operations) but is universally accepted as worthwhile.

Production: All modern LLMs (GPT, BERT, LLaMA, Claude) use subword tokenization. It’s not optional—it’s the standard. Training your own tokenizer on domain-specific data can improve performance on specialized terminology.

Training Embeddings: Negative Sampling and Contrastive Learning

How are embeddings actually learned? The core principle: words that appear in similar contexts should have similar embeddings.

Word2Vec introduced two training objectives:

Skip-gram: Given a word, predict surrounding words (context from target)
CBOW (Continuous Bag of Words): Given context words, predict the target word

For skip-gram, the model maximizes the probability of seeing context word $c$ given target word $t$ :

p(c | t) = \frac{e^{\vec{c} \cdot \vec{t}}}{\sum_{c' \in \text{Vocab}} e^{\vec{c'} \cdot \vec{t}}}

The numerator is the dot product between context and target embeddings—high if they’re similar. The denominator normalizes over the entire vocabulary to get a probability distribution.

The problem: computing the denominator requires summing over 100k+ words for every training example. This is computationally infeasible.

Negative Sampling solves this by approximating the full softmax with a simpler objective. Instead of normalizing over all words, sample a few negative examples—words that don’t appear in this context—and train the model to distinguish positive pairs (words that actually co-occur) from negative pairs (random words):

\log \sigma(\vec{c} \cdot \vec{t}) + \sum_{i=1}^{k} \mathbb{E}_{c_i \sim P_{\text{negative}}} [\log \sigma(-\vec{c_i} \cdot \vec{t})]

Where $\sigma$ is the sigmoid function, $k$ is the number of negative samples (typically 5-20), and $P_{\text{negative}}$ is a distribution for sampling negative words (usually proportional to word frequency^{0.75}).

This means: maximize similarity for positive pairs (cat appears near meow → pull embeddings together) and minimize similarity for negative pairs (cat + quantum → push apart). Remarkably, this simple objective—with only 5-20 negative samples instead of 100k normalizations—learns nearly identical embeddings.

Contrastive Learning generalizes this principle beyond words. Modern methods like SimCLR (for images) and CLIP (for image-text pairs) use the same idea:

Positive pairs: augmented versions of the same image, or matching image-text pairs
Negative pairs: different images, or mismatched image-text pairs
Objective: maximize similarity for positives, minimize for negatives

This is the same training principle underlying BERT (masked language modeling is contrastive: predict masked token vs all other tokens) and GPT (next token prediction is contrastive: predict correct next token vs all other tokens).

Production tip: Negative sampling makes embedding training 100-1000× faster than full softmax. It’s the reason Word2Vec could train on billions of words on consumer hardware. Modern scaling wouldn’t be possible without this approximation.

Production: Vector Databases and Semantic Search

Embeddings are useless unless you can search them efficiently. Storing millions of embeddings and finding the closest match to a query embedding is the foundation of semantic search, recommendation, and retrieval-augmented generation (RAG).

Vector databases store embeddings and support fast approximate nearest-neighbor (ANN) search:

FAISS (Facebook AI Similarity Search): Open-source library optimized for billion-scale search
Pinecone, Weaviate, Milvus: Managed vector database services
Qdrant, Chroma: Lightweight open-source alternatives

Indexing strategies trade accuracy for speed:

Flat (brute-force): Compute distance to every embedding. Exact but O(n), infeasible for large databases.
IVF (Inverted File Index): Cluster embeddings, search only nearest clusters. 10-100× speedup, ~1-5% accuracy loss.
HNSW (Hierarchical Navigable Small World): Graph-based index with logarithmic search. Very fast, high accuracy, but larger memory footprint.
Product Quantization: Compress embeddings (768 dims → 64 bytes) for memory efficiency.

Scaling: With proper indexing (HNSW + product quantization), billion-scale search completes in < 10ms on commodity hardware.

Production architecture:

Indexing: Precompute embeddings for all documents, build HNSW index (minutes to hours offline)
Query: Embed user query (milliseconds), search index (milliseconds), return top-k results
Reranking: Optional second-stage reranker (cross-encoder) improves quality for top 10-100 results

Real examples:

Google Search: Embeddings power semantic matching (finding relevant pages even without keyword matches)
Recommendation systems: “Users who liked X also liked Y” via embedding similarity
RAG for LLMs: Retrieve relevant documents from vector database before generating answer

Tradeoff: Index build time (hours for billions of documents) vs search time (milliseconds). Once built, indexes support millions of queries per second. Most systems rebuild indexes incrementally (add new documents without full rebuild).

Embedding Dimensionality: Size vs Quality

Embedding dimensionality is a critical engineering decision. How many dimensions should your embeddings have?

Common dimensions:

50-100: Tiny models, mobile deployment, fast but limited expressiveness
300: Word2Vec/GloVe standard, good general-purpose choice
768: BERT standard, modern default for many tasks
1024-1536: Large models (GPT-3, GPT-4), high quality but expensive
2048+: Specialized applications, diminishing returns

Tradeoff: Increasing dimensions from 300 to 768 typically gives 5-10% accuracy improvement on downstream tasks, but:

Memory: 2.5× more (float32: 300 dims = 1.2KB, 768 dims = 3KB per embedding)
Compute: 2.5× more for dot products, distance calculations
Search speed: Nearest-neighbor search slows with higher dimensions (curse of dimensionality)

Rule of thumb:

Start with 300-384 for word-level tasks with moderate data
Use 768 for sentence-level tasks and when using pretrained models (BERT, etc.)
Use 1024-1536 when quality matters more than cost (high-value queries, critical applications)
Use 128-256 for deployment-constrained environments (mobile, edge, real-time systems)

When to increase:

More training data (10M+ examples can support higher dimensions)
Complex downstream tasks (question answering, reasoning vs sentiment classification)
High-resource deployment (GPU serving, unlimited memory)

When to decrease:

Limited data (< 1M examples risk overfitting with 768 dims)
Memory constraints (mobile apps, edge devices)
Fast retrieval required (billion-scale search is faster with 256 dims than 1024)

Production monitoring: Track embedding statistics during training:

Norm: Should be roughly constant across words (norm varies too much → unstable training)
Variance: High variance across dimensions → embeddings use full space; low variance → collapse
Saturation: If all embeddings cluster in small region, model isn’t learning diversity

Embedding collapse is a failure mode where all embeddings converge to similar values, making them useless. Monitor inter-embedding distances—if they all become nearly identical, something is wrong (bad initialization, learning rate too high, contrastive temperature too low).

Engineering Takeaway

Embeddings are the foundation of modern NLP and the bridge between discrete symbols and neural networks. Understanding how embeddings are trained, configured, and deployed is essential for building production systems.

Subword tokenization is the standard, not an option. All modern LLMs use BPE or WordPiece to handle rare words, typos, and morphology. If you’re building a custom NLP system, don’t use word-level embeddings—start with subword tokenization. Train your own tokenizer on domain-specific data to capture specialized terminology. The tradeoff (longer sequences) is universally accepted as worthwhile because it eliminates unknown tokens and captures morphological structure.

Pretrained embeddings bootstrap performance. Word2Vec (2013), GloVe (2014), and FastText (2016) provide general-purpose embeddings trained on billions of words. Using these as input features improves performance on downstream tasks, especially with limited labeled data. Pretrained embeddings capture semantic and syntactic relationships that would take massive data to learn from scratch. For modern systems, use contextual embeddings (BERT, GPT) instead of static embeddings—they capture polysemy and context-dependent meaning.

Contextualized embeddings solved the polysemy problem. Static embeddings assign one vector per word, so “bank” (financial) and “bank” (river) get identical representations. Contextualized embeddings (ELMo, BERT, GPT) generate different vectors depending on surrounding words, disambiguating meaning. Modern production systems exclusively use contextualized embeddings. This is why BERT and GPT representations outperform Word2Vec on every task—they adapt to context.

Vector databases enable semantic search at scale. Storing embeddings in FAISS, Pinecone, Weaviate, or Milvus enables fast approximate nearest-neighbor search over billions of vectors. Proper indexing (HNSW + quantization) achieves < 10ms search latency. This powers semantic search (find documents by meaning, not keywords), recommendation (find similar items/users), and RAG for LLMs (retrieve relevant context before generation). Vector search is the standard architecture for modern retrieval—keyword matching is obsolete.

Embedding dimensionality is a critical engineering decision. Common choices: 300 (Word2Vec standard), 768 (BERT standard), 1536 (GPT-3 standard). Increasing dimensions improves quality (~5-10% accuracy gain from 300 to 768) but increases memory (2.5×), compute (2.5×), and search latency. Rule of thumb: 768 for general tasks, 1024-1536 when quality matters, 256-384 for deployment-constrained environments. Monitor embedding norms and variance during training to detect collapse (all embeddings converging to same values).

Fine-tuning adapts embeddings to your domain. Pretrained embeddings capture general language patterns. Fine-tuning on domain-specific data (medical records, legal documents, technical manuals) adapts them to specialized terminology and relationships. This is standard practice in production—don’t use off-the-shelf embeddings for specialized domains. Fine-tune on 10k-100k domain examples to capture domain-specific semantics. The accuracy gain is often 10-20% compared to general-purpose embeddings.

Embedding visualization reveals what models learn. Project high-dimensional embeddings to 2D/3D using t-SNE or UMAP to visualize clustering. If semantically similar words cluster together, embeddings are good. If they’re randomly scattered, something is wrong (bad training, poor data, insufficient capacity). Use visualization during development to debug: check that synonyms cluster, antonyms separate, and analogies work (king - man + woman ≈ queen). In production, monitor embedding distributions to detect drift—if embedding statistics change, retrain.

The lesson: Embeddings are learned continuous representations of discrete symbols. They enable neural networks to process words, users, items, and other discrete objects. Modern embeddings use subwords (not words), context (not static vectors), and vector databases (not keyword search). Understanding embeddings—subword tokenization, negative sampling, dimensionality tradeoffs, vector search—is essential for building production NLP and recommendation systems.

References and Further Reading

Efficient Estimation of Word Representations in Vector Space – Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) https://arxiv.org/abs/1301.3781

This is the Word2Vec paper that popularized embeddings. Mikolov et al. showed that simple neural language models trained on large corpora learn embeddings that capture semantic and syntactic relationships through vector arithmetic (king - man + woman ≈ queen). The paper introduced skip-gram and CBOW architectures and demonstrated that embeddings transfer across tasks. Reading this explains where modern embeddings came from and why they work.

GloVe: Global Vectors for Word Representation – Jeffrey Pennington, Richard Socher, Christopher Manning (2014) https://nlp.stanford.edu/pubs/glove.pdf

GloVe showed that embeddings can be learned from word co-occurrence statistics rather than neural language models. Pennington et al. demonstrated that the relationship between word vectors and co-occurrence probabilities follows a specific mathematical structure, making embedding learning more interpretable. GloVe embeddings often perform similarly to Word2Vec but train faster and have a clearer statistical foundation.

Deep Contextualized Word Representations – Matthew Peters, Mark Neumann, Mohit Iyyer, et al. (2018) https://arxiv.org/abs/1802.05365

ELMo introduced contextualized embeddings—representations that change based on surrounding context. Peters et al. showed that deep bidirectional language models capture context-dependent meaning better than static embeddings, dramatically improving performance on many NLP tasks. This paper bridged the gap between Word2Vec-style static embeddings and modern Transformer-based contextual representations (BERT, GPT). Understanding ELMo explains why context matters and sets up Transformer language models in Part V.