Chapter 20: Transformers
The Universal Architecture
What a Transformer Is
A Transformer is a neural network architecture built entirely from attention mechanisms (Chapter 19) and feedforward layers. Unlike RNNs, which process sequences recurrently, or CNNs, which process spatial data through convolution, Transformers process all positions in parallel using self-attention to capture dependencies.
The original Transformer (Vaswani et al., 2017) was designed for machine translation and used an encoder-decoder structure:
- Encoder: Processes the input sequence (source language) through stacked self-attention and feedforward layers
- Decoder: Generates the output sequence (target language) through stacked self-attention, cross-attention to the encoder, and feedforward layers
Modern language models (GPT, BERT, LLaMA) use variations:
- Encoder-only (BERT): Stacked self-attention layers that process bidirectional context, used for understanding tasks (classification, question answering)
- Decoder-only (GPT): Stacked causal self-attention layers that generate sequences autoregressively, used for generation tasks (text completion, chatbots)
At its core, a Transformer block consists of:
- Multi-head self-attention: Each position attends to all positions (or all previous positions for causal models)
- Feedforward network: A two-layer MLP applied independently to each position
- Residual connections: Skip connections around each sublayer
- Layer normalization: Normalization after each sublayer
These blocks are stacked (typically 6-96 layers), creating deep networks that learn hierarchical representations.
The Transformer Formula
For each layer , given input :
Where:
- MultiHeadAttention applies self-attention with multiple heads
- FFN is a position-wise feedforward network:
- Residual connections () help gradients flow through deep networks
- LayerNorm stabilizes training
The diagram shows a single Transformer block: self-attention captures dependencies between positions, FFN transforms each position independently, and residual connections with layer normalization stabilize deep networks. Stacking many blocks creates the full Transformer.
Encoder-Decoder vs Encoder-Only vs Decoder-Only
The original Transformer used an encoder-decoder architecture for translation, but modern models use different variants depending on the task. Understanding when to use which is essential.
Encoder-Only (BERT, RoBERTa)
Processes input with bidirectional self-attention—each token can attend to all tokens (past and future). The encoder outputs contextual representations useful for understanding tasks.
Architecture: Input tokens → Positional encoding → N encoder layers → Representations
Use cases:
- Classification: Sentiment analysis, spam detection (use [CLS] token representation)
- Named entity recognition: Tag each token with entity type
- Question answering: Find answer span in context
- Embeddings: Generate high-quality contextualized embeddings
Advantage: Bidirectional context means better representations for understanding Disadvantage: Cannot generate text (no autoregressive structure)
Decoder-Only (GPT, LLaMA, Claude)
Processes input with causal self-attention—each token can only attend to previous tokens. The decoder generates text autoregressively, predicting one token at a time.
Architecture: Input tokens → Positional encoding → N decoder layers with causal masking → Output logits
Use cases:
- Text generation: Completion, creative writing, code generation
- Language modeling: Next token prediction
- Chat: Conversational AI (prompt + history → response)
- Few-shot learning: In-context learning (examples in prompt)
Advantage: Can both understand and generate, simpler architecture, scales better Disadvantage: Cannot see future tokens (less context per token during training)
Encoder-Decoder (T5, BART, Original Transformer)
Combines both: encoder processes input bidirectionally, decoder generates output autoregressively with cross-attention to encoder outputs.
Architecture: Input → Encoder → Encoder outputs ← Decoder (with cross-attention) → Output
Use cases:
- Translation: Source language → target language
- Summarization: Long document → short summary
- Question answering: Question + context → answer
- Any sequence-to-sequence task
Advantage: Best for tasks with distinct input/output (translation, summarization) Disadvantage: More complex, two separate stacks
Modern Trend: Decoder-Only Dominates
Despite encoder-only and encoder-decoder having their uses, decoder-only models (GPT-3, GPT-4, LLaMA, Claude) dominate modern AI for several reasons:
- Unified architecture: One model handles both understanding and generation
- Simpler training: No encoder-decoder synchronization, just causal prediction
- Better scaling: Decoder-only scales more predictably to billions of parameters
- In-context learning: Can adapt to new tasks via prompting without fine-tuning
- Deployment simplicity: One model for all tasks vs specialized models
Approximately 90% of modern LLMs are decoder-only. BERT-style encoder-only models still used for specialized understanding tasks (retrieval, classification), but GPT-style decoder-only is the standard.
Engineering decision: Use decoder-only (GPT-style) unless you have specific requirements for bidirectional context without generation (embeddings, classification) or distinct input-output structure (translation with encoder-decoder).
Layer Normalization: Pre-Norm vs Post-Norm
The original Transformer paper placed layer normalization after each sublayer (post-norm):
Modern Transformers place normalization before each sublayer (pre-norm):
This seemingly small change has significant effects on training.
Post-Norm (Original Transformer):
- Normalizes the sum of input and attention output
- Requires careful learning rate warmup (gradual increase from very small)
- Without warmup, training often diverges
- Gradients can explode in early training
Pre-Norm (Modern Standard):
- Normalizes before attention, residual adds raw input
- Trains stably without warmup (can start with full learning rate)
- Enables training much deeper networks (100+ layers)
- More forgiving to hyperparameter choices
Why pre-norm works better:
Pre-norm creates a shorter gradient path. In post-norm, gradients flow through normalization layers, which can amplify or dampen them unpredictably. In pre-norm, the residual connection provides a direct path for gradients to flow unchanged through the network, similar to ResNet’s skip connections.
Empirical results:
- GPT-2: Post-norm with warmup
- GPT-3: Pre-norm, no warmup required
- BERT: Post-norm with warmup
- LLaMA, GPT-4, Claude: Pre-norm (standard for all modern large models)
Connection to ResNet: This mirrors the pre-activation vs post-activation debate in ResNets. Pre-activation (BatchNorm → ReLU → Conv) trains better than post-activation (Conv → BatchNorm → ReLU). The principle is the same: normalization before the main operation stabilizes training.
Production tip: Always use pre-norm unless you’re replicating a specific historical architecture. It’s not optional—it’s essential for training deep Transformers (> 24 layers) without careful hyperparameter tuning.
KV Cache: Efficient Autoregressive Inference
Autoregressive generation is the dominant inference pattern for decoder-only models (GPT, LLaMA), but naive implementation is extremely inefficient.
The Problem:
When generating token , the model computes attention over all previous tokens . For each token:
- Token 1: Attend to nothing (just input embedding)
- Token 2: Attend to token 1 (compute attention with 1 key-value pair)
- Token 3: Attend to tokens 1-2 (compute attention with 2 key-value pairs)
- Token : Attend to tokens (compute attention with key-value pairs)
Naive implementation: Recompute key and value projections for all previous tokens every time.
- Total cost: key-value computations for generating tokens
KV Cache Solution:
Keys and values for token don’t change when processing token . Cache them!
When generating token :
- Retrieve cached keys and values for tokens
- Compute new key and value for token
- Append to cache
- Compute attention using all cached K, V and new query
Cost: key-value computations for generating tokens (2-3× speedup).
Memory Requirements:
Cache size per layer:
The factor of 2 is for keys and values. For a model with:
- Batch size: 1
- Sequence length: 2048 tokens
- Model dimension: 4096 (LLaMA-7B)
- Number of layers: 32
Cache size = MB per layer Total = GB for the full model
For LLaMA-70B or GPT-4 scale:
- Model dimension: 8192-12288
- Layers: 80-96
- Cache size: 30-50 GB per request at max context length
Memory vs Compute Tradeoff:
Without cache: Low memory, high compute (recompute everything) With cache: High memory, low compute (reuse cached K,V)
In production, memory is the bottleneck. For large models, KV cache can exceed model weight memory. Batching becomes limited by cache size, not compute.
Optimization: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
Standard multi-head attention has separate K, V for each head, multiplying cache size by number of heads.
MQA (used in PaLM, GPT-4): All heads share the same K, V (only query is per-head)
- Reduces cache size by factor of num_heads (8-16×)
- Minimal quality loss (~1-2%)
GQA (LLaMA 2): Groups of heads share K, V
- Intermediate approach: 8 heads → 2 KV pairs (4× reduction)
- Better quality than MQA, still significant memory savings
Production Reality:
- All production LLM serving systems use KV cache
- Cache management is a major engineering challenge (eviction policies, quantization)
- Serving systems optimize batch size × context length product to max out cache memory
- For ChatGPT, Claude, GPT-4: cache size limits how many concurrent users can be served
When cache matters most:
- Long conversations (1000+ tokens of context)
- Batch inference (cache size × batch size)
- Large models (70B+ parameters with many layers)
Production tip: KV cache is mandatory for efficient inference—without it, generating 100 tokens takes 10-100× longer. Modern serving frameworks (vLLM, TensorRT-LLM, TGI) implement KV cache by default, but understanding cache size and memory requirements is critical for deployment planning.
Positional Encodings: Teaching Position to Parallel Models
Since Transformers process all positions in parallel (no inherent ordering), they need explicit position information. Without positional encodings, the model treats input as a bag of words—“dog bit man” and “man bit dog” would be identical.
Positional encodings are added to input embeddings:
Several variants exist, each with different properties:
Sinusoidal (Original Transformer):
Where is the position index and is the dimension index. Different dimensions use different frequencies, creating a unique encoding for each position.
Advantages:
- Deterministic (no learned parameters)
- Works for any sequence length (extrapolates to unseen lengths)
- Theoretically allows model to learn relative positions
Disadvantages:
- Fixed formula may not be optimal
- Extrapolation beyond training length is imperfect
Learned Positional Embeddings (BERT, GPT-2):
Train a lookup table: position gets learned embedding . Each position has its own trainable vector.
Advantages:
- Can learn optimal encodings for the training data
- Often performs slightly better than sinusoidal within training range
Disadvantages:
- Fixed maximum length (if trained on 512 tokens, cannot handle 1000)
- No extrapolation beyond max training length
- More parameters to store
RoPE (Rotary Position Embeddings, LLaMA, GPT-NeoX):
Instead of adding position info to embeddings, RoPE applies rotation matrices to queries and keys based on relative position. This encodes position as rotations in the attention space.
Where is a rotation matrix for position . The attention dot product automatically encodes relative position .
Advantages:
- Encodes relative positions naturally (attention depends on distance, not absolute position)
- Better extrapolation to longer sequences than learned embeddings
- No added parameters
Disadvantages:
- More complex implementation
- Requires understanding of rotation matrices
ALiBi (Attention with Linear Biases, BLOOM, MPT):
Don’t modify embeddings—instead, add a bias to attention scores based on distance:
Where is a learned penalty per head. Positions farther apart get lower attention scores.
Advantages:
- Best extrapolation: Models trained on 512 tokens generalize to 10k+ tokens seamlessly
- Simple: just subtract distance penalty from attention scores
- No added parameters or embedding modifications
Disadvantages:
- Requires modifying attention computation
- Relatively new (less battle-tested than others)
Modern Trend:
- GPT-2, BERT: Learned embeddings (2018-2019)
- GPT-3: Learned embeddings (2020)
- LLaMA, GPT-NeoX: RoPE (2021-2023)
- BLOOM, MPT: ALiBi (2022-2023)
- Trend: Moving toward relative position methods (RoPE, ALiBi) for better length extrapolation
Production choice: Use RoPE for general-purpose LLMs (LLaMA uses it). Use ALiBi if you need strong length extrapolation (e.g., train on 2k, deploy on 100k). Learned embeddings are legacy—only use if replicating older architectures.
Context Window Limitations: The Length Bottleneck
Attention scales quadratically with sequence length: O(n²) memory and compute. This creates a fundamental tradeoff between context window size and inference cost.
Typical Context Limits:
- BERT (2018): 512 tokens
- GPT-2 (2019): 1024 tokens
- GPT-3 (2020): 2048 tokens
- GPT-3.5 (2022): 4096 tokens
- GPT-4 (2023): 8192 tokens (standard), 32768 tokens (extended)
- Claude 2 (2023): 100k tokens
- GPT-4 Turbo (2023): 128k tokens
Why limits exist:
- Memory: Attention matrix is . For 100k tokens: 10 billion entries per layer
- Compute: Computing attention scores requires operations
- KV Cache: Storing cached keys/values for long contexts requires massive memory
Practical implications:
Doubling context length quadruples memory and compute:
- 2k context: baseline
- 4k context: 4× memory/compute
- 8k context: 16× memory/compute
- 32k context: 256× memory/compute vs 2k
For GPT-4 scale models, 32k context costs 16× more than 2k context per request. This is why longer context windows are more expensive to serve.
Workarounds:
- Sliding window: Keep only recent N tokens in context, discard older tokens
- Hierarchical attention: Compress old context into summary, attend fully to recent context
- Sparse attention: Attend to only a subset of tokens (local + global patterns)
- Retrieval-Augmented Generation (RAG): Store full context in vector database, retrieve relevant portions on-demand
- Compression: Summarize long documents before processing
Production reality:
Despite 100k-200k context windows being possible, most applications need < 4k tokens:
- Chat conversations: 1k-2k tokens
- Code completion: 1k-4k tokens
- Document Q&A: 2k-8k tokens (beyond this, use RAG)
The 100k+ context marketing is often unnecessary. Engineering decision: balance context window vs cost. Use RAG for documents beyond 10k tokens instead of loading everything into context.
Cost example: For GPT-4:
- 2k context: baseline cost
- 32k context: 16× more expensive (memory + compute)
- Serving 1000 concurrent users with 32k context requires 16× more GPUs than 2k context
Understanding context limitations is critical for deployment: longer context = higher cost, and most use cases don’t need it.
Parallelism: Why GPUs Love Transformers
Transformers’ key advantage over RNNs is parallelization. RNNs process sequences serially—computing step requires the hidden state from step . This sequential dependency makes training slow: you can’t parallelize across time steps.
Transformers compute all positions simultaneously. The self-attention matrix is a single matrix multiplication that processes all positions in parallel. The feedforward network applies the same transformation to all positions independently, again parallelizable.
This enables massive speed-ups on GPUs and TPUs, which excel at parallel matrix operations. Training a Transformer on a 1000-token sequence is almost as fast as training on a 100-token sequence (memory and compute scale quadratically, but modern hardware handles this efficiently for reasonable lengths).
The comparison:
- RNN: Process 1000 tokens in 1000 serial steps
- Transformer: Process 1000 tokens in 1 parallel step (ignoring the depth of layers)
This parallelism enabled training on massive datasets (billions of tokens) that would have been intractable with RNNs. It’s a primary reason Transformers replaced RNNs for language modeling.
Scalability: Why Bigger Models Get Smarter
Transformers exhibit remarkable scaling properties: larger models (more parameters, more layers) trained on more data consistently improve performance. This is captured by scaling laws (Kaplan et al., 2020):
Where:
- is the test loss
- is the number of model parameters
- is the dataset size
- , , , are empirically determined constants
The key finding: performance improves predictably with scale. Doubling model size reduces loss by a consistent amount. Doubling data also reduces loss. There’s no sign of saturation—bigger keeps getting better.
This has profound implications:
- Performance is predictable: You can estimate how a 10B parameter model will perform based on experiments with 1B parameter models
- Scaling is a strategy: Instead of clever algorithms, scale up models and data
- Compute becomes the bottleneck: Training large models requires massive compute (thousands of GPUs for weeks)
Models have scaled from GPT-2 (1.5B parameters, 2019) to GPT-3 (175B, 2020) to GPT-4 (rumored ~1T+, 2023). Each increase brought qualitative improvements—new capabilities (few-shot learning, reasoning, coding) emerged at scale.
Why do Transformers scale so well?
- No architectural bottleneck: Unlike RNNs’ fixed-size hidden state, Transformers’ attention scales with model width and depth
- Expressiveness: With enough layers and width, Transformers can represent arbitrarily complex functions
- Optimization: Residual connections and layer normalization enable training very deep networks stably
- Data efficiency: Self-supervised learning (predict masked/next tokens) leverages unlabeled data effectively
Modality Agnostic: Text, Vision, Audio, and Beyond
Transformers were designed for text but generalize to any sequential or structured data. The key insight: any data can be tokenized into sequences and processed with attention.
Vision Transformers (ViT)
Treat images as sequences of patches. Divide a 224×224 image into 16×16 patches (196 patches total), flatten each patch into a vector, and process with a Transformer. This pure-attention approach matches or exceeds CNNs on image classification with sufficient data.
ViTs show that convolution’s spatial inductive bias isn’t necessary with enough data. Transformers learn spatial relationships from scratch through attention. For large-scale vision (millions of images), Transformers now dominate.
Audio and Speech
Process audio as sequences of spectrograms or raw waveform chunks. Transformers excel at speech recognition (Whisper), music generation, and audio understanding.
Multi-modal Models
Transformers naturally handle multiple modalities by attending across modalities. CLIP (text-image), Flamingo (language-vision), and GPT-4 (text-image-audio) use Transformers to align representations across modalities. Attention between text and image tokens enables cross-modal reasoning.
Video, Time Series, Proteins, Chemistry
Transformers have been applied to video (treat frames as token sequences), time series forecasting, protein structure prediction (AlphaFold uses attention), molecular generation, and graph neural networks. Anywhere there’s structured data with dependencies, Transformers work.
The universality comes from attention being a general-purpose mechanism: it doesn’t assume spatial locality (like convolution) or sequential processing (like recurrence). It’s a flexible way to capture dependencies in any structured data.
Engineering Takeaway
Transformers are the universal architecture for modern AI. Understanding Transformers—encoder/decoder variants, pre-norm, KV cache, positional encodings, and context windows—is essential for building, deploying, and optimizing AI systems.
Decoder-only models dominate production, not encoder-decoder. Despite the original Transformer using encoder-decoder for translation, 90% of modern LLMs (GPT-3, GPT-4, LLaMA, Claude) are decoder-only. Why? Unified architecture (one model for understanding and generation), simpler training (pure causal prediction), better scaling (predictable to billions of parameters), and in-context learning (adapt via prompting). Use decoder-only unless you have specific requirements for bidirectional context without generation (embeddings, classification). Encoder-decoder is legacy except for translation and summarization.
Pre-norm is non-negotiable for deep Transformers. Original Transformers used post-norm (LayerNorm after sublayers) and required careful warmup to train stably. Modern Transformers use pre-norm (LayerNorm before sublayers), enabling stable training without warmup and supporting 100+ layer networks. GPT-3, LLaMA, GPT-4, Claude all use pre-norm. Connection to ResNet: pre-activation enables deeper networks. Never use post-norm for new models—it’s harder to train and offers no benefits. Pre-norm is standard practice.
KV cache is critical for inference efficiency. Autoregressive generation without KV cache recomputes keys/values for all previous tokens every step—O(n²) cost. With KV cache, store computed K,V and reuse them—O(n) cost, 2-3× speedup. Tradeoff: cache requires massive memory (4-50 GB for large models at long contexts). Production reality: KV cache limits concurrent users more than compute. All LLM serving systems (vLLM, TensorRT-LLM) use KV cache by default. Understanding cache memory requirements is critical for deployment planning.
Positional encodings determine length extrapolation. Learned embeddings (BERT, GPT-2) work well within training length but fail beyond. Sinusoidal encodings (original Transformer) extrapolate but imperfectly. RoPE (LLaMA) encodes relative positions via rotations—better extrapolation, no parameters. ALiBi (BLOOM) adds distance bias to attention—best extrapolation (train on 2k, deploy on 100k). Modern trend: RoPE for general LLMs, ALiBi for extreme length extrapolation. Choose based on max context requirements.
Context windows are a cost tradeoff, not just a feature. Attention is O(n²), so doubling context quadruples cost: 32k context costs 16× more than 2k context (memory + compute). Marketing emphasizes 100k-200k context windows, but most applications need < 4k tokens (chat: 1-2k, code: 1-4k, documents: use RAG beyond 10k). Engineering decision: balance context vs cost. Long contexts are expensive to serve—use RAG (retrieve relevant portions) instead of loading entire documents into context. Understanding this tradeoff prevents over-engineering.
Transformers scale predictably with compute and data. Scaling laws (Kaplan et al.): performance improves as a power law with model size and dataset size, no saturation observed. Doubling model size or data consistently reduces loss. This makes performance predictable and scaling a strategy: instead of clever algorithms, scale up. GPT-2 (1.5B) → GPT-3 (175B) → GPT-4 (~1T+) each brought qualitative improvements (reasoning, coding, few-shot learning). Scaling works because Transformers have no architectural bottleneck, unlike RNNs’ fixed hidden state.
Deployment optimization is mandatory at scale. Training is one-time cost, inference is continuous. Production requires: quantization (INT8/INT4 reduces memory 4-8×), KV cache (2-3× speedup), Flash Attention (2-4× speedup, no quality loss), mixed precision (FP16 training), and careful batching (maximize GPU utilization). For GPT-scale models, these aren’t optional—they’re required to meet latency (< 100ms per token) and cost constraints. Modern serving frameworks (vLLM, TGI) implement these by default, but understanding them is critical for capacity planning.
The lesson: Transformers are the universal architecture—all modern LLMs, vision models (ViT), and multimodal systems use them. Understanding architectural choices (decoder-only, pre-norm), inference optimizations (KV cache, efficient attention), and deployment tradeoffs (context length vs cost) is essential for production AI engineering. They’re not just an architecture—they’re how modern AI works.
References and Further Reading
Attention Is All You Need – Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (2017) https://arxiv.org/abs/1706.03762
This is the paper that introduced Transformers and changed AI forever. Vaswani et al. showed that attention alone (without recurrence or convolution) achieves state-of-the-art results on machine translation. They introduced scaled dot-product attention, multi-head attention, positional encodings, and the encoder-decoder architecture. Every large language model since builds on this foundation. Reading this paper is essential—it’s the most influential paper in modern AI. Understanding the architecture, training details, and empirical results will clarify why Transformers dominate.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale – Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. (2020) https://arxiv.org/abs/2010.11929
Vision Transformers (ViT) extended Transformers to vision, showing that pure attention (without convolution) matches or exceeds CNNs on image classification. Dosovitskiy et al. demonstrated that with sufficient data, Transformers’ flexibility outweighs CNNs’ spatial inductive bias. This paper explains how images are tokenized as patches and why Transformers are becoming universal across modalities. Reading this shows how Transformers generalize beyond text and why they’re replacing specialized architectures.
Scaling Laws for Neural Language Models – Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, et al. (2020) https://arxiv.org/abs/2001.08361
This OpenAI paper quantified how Transformer performance scales with model size, dataset size, and compute. Kaplan et al. showed that loss follows predictable power laws, making scaling a reliable strategy for improving performance. The paper explains why bigger models keep getting better and justifies the trend toward massive models (GPT-3, GPT-4, PaLM). Understanding scaling laws explains the economics and strategy of modern AI: compute and data matter more than algorithmic tricks. Reading this gives you the empirical foundation for why scaling works.