Chapter 17: Recurrent Neural Networks

How Machines Remember

Why Order Matters

Sequences carry meaning in their order. “Dog bites man” means something different from “Man bites dog.” Stock prices today depend on prices yesterday and last week. Speech is unintelligible without temporal context—the sounds /k/, /æ/, /t/ form “cat” only in sequence. A sentence’s meaning emerges from word order, not just word presence.

Standard feedforward neural networks ignore order. They process inputs independently, treating a sentence as a bag of words or an image as a grid of unrelated pixels. For images, this is suboptimal but manageable (CNNs add spatial structure). For sequences, it’s catastrophic. The network can’t distinguish “arrived before departed” from “departed before arrived” if it only sees word counts.

Recurrent Neural Networks (RNNs) are designed for sequential data. They maintain a hidden state—a memory vector that carries information from previous time steps to the current step. At each position in the sequence, the network combines the current input with its memory of the past to produce an output and update its memory. The same parameters are applied at every time step, creating a recurrent structure where information loops back into itself.

RNNs enabled early breakthroughs in machine translation, speech recognition, and text generation. But they have fundamental limitations—particularly with long sequences—that led to their replacement by Transformers (Chapter 20). Understanding RNNs is essential for understanding why Transformers won.

Hidden State: Memory as a Vector

The core idea of RNNs is the hidden state $h_t$ —a fixed-size vector that encodes the network’s memory at time step $t$ . This state is updated at each step based on the current input $x_t$ and the previous hidden state $h_{t-1}$ :

h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Where:

$h_t$ is the hidden state at time $t$ (the network’s memory)
$h_{t-1}$ is the previous hidden state
$x_t$ is the current input
$W_{hh}$ are recurrent weights (how past states influence the current state)
$W_{xh}$ are input weights (how the current input affects the state)
$b_h$ is a bias term
$f$ is a nonlinear activation function (typically tanh or ReLU)

The output at each time step is computed from the hidden state:

y_t = g(W_{hy} h_t + b_y)

Where $W_{hy}$ are output weights and $g$ is an output activation (softmax for classification, identity for regression).

What does the hidden state represent?

For language modeling, $h_t$ might encode:

Recent words and their relationships
Syntactic structure (are we in a subordinate clause?)
Semantic context (what topic is being discussed?)
Long-term dependencies (the subject of the sentence, mentioned 10 words ago)

For time series prediction, $h_t$ might encode:

Recent trends (increasing, decreasing, stable)
Seasonality patterns
Anomalies or regime changes

The network learns what to remember by adjusting $W_{hh}$ , $W_{xh}$ , and $W_{hy}$ during training. Information that helps predict future outputs is preserved in the hidden state; irrelevant information is discarded.

Hidden State: Memory as a Vector diagram

The diagram shows an RNN unrolled through time. The same RNN cell (with shared weights) processes each input. The hidden state (red arrows) carries information forward through time, creating a memory of the past.

Fixed-size bottleneck

The hidden state has fixed dimensionality (typically 128, 256, or 512). No matter how long the sequence, all relevant past information must compress into this vector. For short sequences, this works. For long sequences, it becomes a severe bottleneck—information from the distant past must either be forgotten or compressed so heavily that it becomes useless.

This is one reason RNNs struggle with long-range dependencies, as we’ll see next.

Long-Term Dependencies: Why RNNs Struggle

The fundamental problem with vanilla RNNs is that they cannot learn long-range dependencies—patterns where information at time $t$ depends on information from time $t - 100$ or earlier. This failure has two causes: the vanishing gradient problem and the fixed-size hidden state bottleneck.

Vanishing Gradients

Training RNNs requires backpropagation through time (BPTT): unroll the network across all time steps and backpropagate gradients from the final loss back to early time steps. At each step backward, gradients are multiplied by $W_{hh}$ (the recurrent weight matrix) and the derivative of the activation function.

If the largest eigenvalue of $W_{hh}$ is less than 1, gradients shrink exponentially as they backpropagate through time. After 10 steps, gradients may be multiplied by $0.9^{10} \approx 0.35$ . After 100 steps, $0.9^{100} \approx 0.000027$ —effectively zero.

When gradients vanish, early time steps don’t receive meaningful learning signals. The network cannot learn that word 1 affects word 100 because the gradient signal from step 100 never makes it back to step 1. The network becomes effectively “memoryless” beyond a short window (typically 10-20 steps).

Exploding Gradients

The opposite problem: if the largest eigenvalue of $W_{hh}$ exceeds 1, gradients grow exponentially. After 100 steps, they might be multiplied by $1.1^{100} \approx 13,780$ . Gradients explode to NaN (not a number), causing training to diverge.

Exploding gradients are easier to fix than vanishing gradients—gradient clipping (rescaling gradients if their norm exceeds a threshold) prevents explosions. But this doesn’t solve vanishing gradients. The fundamental issue is that vanilla RNNs have difficulty maintaining gradient flow through many time steps.

Information Bottleneck

Even if gradients flowed perfectly, the fixed-size hidden state creates an information bottleneck. To remember information from 100 steps ago, it must survive 100 updates—each potentially overwriting the state. The network must learn to preserve relevant information while incorporating new information, which is difficult without explicit memory mechanisms.

For tasks like language modeling, this manifests as:

Forgetting the subject of a sentence after multiple clauses
Failing to match opening and closing brackets in code after many tokens
Losing track of narrative context in long documents

LSTM and GRU: Protecting Memory

Long Short-Term Memory (LSTM) networks solved the vanishing gradient problem through a clever architectural innovation: gated memory cells. Instead of overwriting the hidden state at every step, LSTMs selectively control what information to keep, what to discard, and what to output using learned gates.

LSTM Architecture

An LSTM cell has two states:

Cell state $c_t$ : Long-term memory, flows through time with minimal interference
Hidden state $h_t$ : Short-term memory, exposed to the network

The cell state is protected by three gates:

1. Forget Gate decides what to discard from memory:

f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)

Outputs values in $(0, 1)$ for each dimension of $c_{t-1}$ . Values near 0 mean “forget this,” values near 1 mean “keep this.”

2. Input Gate decides what new information to store:

i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)

\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)

The input gate $i_t$ controls how much of the candidate update $\tilde{c}_t$ to add to memory.

3. Cell State Update combines forgetting and adding:

c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

Where $\odot$ is element-wise multiplication. This allows the cell state to flow through time relatively unchanged (if $f_t \approx 1$ and $i_t \approx 0$ ), preserving long-term dependencies.

4. Output Gate decides what to expose:

o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)

h_t = o_t \odot \tanh(c_t)

The hidden state $h_t$ is a filtered version of the cell state, controlling what information is passed to the next layer or output.

LSTM and GRU: Protecting Memory diagram

The diagram shows an LSTM cell with three gates controlling information flow. The cell state (red) flows horizontally with minimal interference. Gates (forget, input, output) modulate what enters, stays in, and exits from memory.

Why LSTMs work

The key insight is that gradients can flow through the cell state with minimal attenuation. The forget gate multiplicatively scales $c_{t-1}$ , and if $f_t \approx 1$ , gradients flow backward unchanged. This creates a “highway” for gradients across many time steps, solving the vanishing gradient problem.

LSTMs can learn dependencies spanning hundreds of steps because the cell state provides a protected memory channel. Information added at step 1 can survive to step 100 if the forget gate keeps it and the input gate doesn’t overwrite it.

Gated Recurrent Units (GRUs)

GRUs simplify LSTMs by combining the forget and input gates into an “update gate” and removing the separate cell state:

z_t = \sigma(W_z [h_{t-1}, x_t])

r_t = \sigma(W_r [h_{t-1}, x_t])

\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t])

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

Where $z_t$ is the update gate (controls how much of the past to keep) and $r_t$ is the reset gate (controls how much of the past to use when computing the new state). GRUs have fewer parameters than LSTMs and often perform similarly, making them a popular choice when computational efficiency matters.

Bidirectional RNNs: Processing Sequences Both Ways

Standard RNNs process sequences left-to-right, maintaining a hidden state that accumulates information from past time steps. But many sequence tasks benefit from both past and future context. Consider the sentence: “The bank was steep and grassy.” To understand that “bank” means riverbank (not financial institution), you need to see both the words before (“The”) and after (“steep,” “grassy”).

Bidirectional RNNs solve this by running two RNNs in parallel:

Forward RNN processes the sequence left-to-right: $\overrightarrow{h}_t = f(W_{hh} \overrightarrow{h}_{t-1} + W_{xh} x_t)$
Backward RNN processes the sequence right-to-left: $\overleftarrow{h}_t = f(W_{hh} \overleftarrow{h}_{t+1} + W_{xh} x_t)$

At each position $t$ , the outputs are concatenated: $h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]$ . This gives each position access to both past context (from the forward RNN) and future context (from the backward RNN).

Use cases:

Bidirectional RNNs are natural for tasks where you have the full sequence upfront:

Named entity recognition: “Paris” could be a person or city—future context (“Paris is the capital of France”) disambiguates
Sentiment analysis: “I thought this movie would be terrible, but it was great!”—the final sentiment contradicts early words
Speech recognition: phonemes are ambiguous without surrounding context in both directions
Protein structure prediction: amino acid folding depends on neighbors on both sides

When not to use bidirectional RNNs:

Any task requiring streaming or real-time processing cannot use bidirectional RNNs—you can’t wait for future tokens that haven’t arrived yet. Language modeling (predicting the next word) is inherently unidirectional because you don’t have access to future words. Voice assistants processing speech as it’s spoken must use unidirectional models.

Connection to Transformers: Bidirectional context is one reason BERT (Chapter 20) outperformed previous language models for understanding tasks. Transformers achieve bidirectional context through self-attention rather than separate forward/backward passes, making them more efficient.

Seq2Seq and Encoder-Decoder Architecture

Many sequence tasks involve mapping one sequence to another: translating English to French, summarizing documents, converting speech to text. Sequence-to-sequence (seq2seq) models handle these by splitting the problem into two phases: encoding and decoding.

Architecture:

The encoder is an RNN (typically LSTM or GRU) that processes the input sequence and compresses it into a fixed-size context vector:

h_1, h_2, \ldots, h_n = \text{Encoder}(x_1, x_2, \ldots, x_n)

c = h_n \quad \text{(final hidden state becomes context)}

The decoder is another RNN that generates the output sequence, conditioned on the context vector:

y_1, y_2, \ldots, y_m = \text{Decoder}(c)

At each decoding step, the decoder uses its previous hidden state, the previously generated token, and the context vector to predict the next token. The decoder essentially answers: “Given the encoded input and what I’ve generated so far, what comes next?”

The bottleneck problem:

The context vector $c$ is a fixed-size vector (typically 256-512 dimensions) that must encode the entire input sequence—whether it’s 10 words or 100 words. For long sequences, this becomes an information bottleneck. Important details from early in the sequence are lost or overwritten as the encoder processes later tokens.

This bottleneck was a key motivation for the attention mechanism (Chapter 19). Instead of compressing everything into a single context vector, attention allows the decoder to look back at all encoder hidden states and focus on relevant parts of the input at each decoding step. Attention-based seq2seq models (Bahdanau et al., 2014) significantly outperformed vanilla seq2seq on translation tasks.

Applications before Transformers:

Before Transformers dominated, seq2seq models powered:

Machine translation (Google Translate, 2016)
Text summarization
Dialogue systems and chatbots
Image captioning (CNN encoder + RNN decoder)

Modern status: Seq2seq established the encoder-decoder pattern that Transformers still use (T5, BART). But RNN-based seq2seq has been replaced by Transformer-based seq2seq, which parallelizes across time steps and handles longer contexts more effectively.

Production Example: Time Series Forecasting with LSTMs

Despite being superseded by Transformers for language, RNNs remain the practical choice for many time series forecasting tasks. Consider a real-time server load prediction system deployed at a cloud infrastructure company:

Task: Predict the next 10 minutes of CPU load based on the previous 100 minutes, updating predictions every minute as new data arrives.

Architecture:

Input: Sequence of 100 timesteps, each with 5 features (CPU %, memory %, network I/O, disk I/O, request rate)
LSTM layer: 128 hidden units, processes the sequence sequentially
Dense output layer: Maps final hidden state to 10 predictions (next 10 minutes)
Total parameters: ~200k (~800 KB model size)

Why LSTM instead of Transformer?

Latency: LSTM inference takes < 1ms on CPU. Transformers require 10-50ms for comparable sequence lengths due to attention’s quadratic complexity.
Memory: The model fits in 1 MB. Transformer would require 10-100 MB for similar capacity.
Streaming: The LSTM updates its hidden state as each new data point arrives—no need to buffer the full sequence or recompute attention. Transformers require the full sequence upfront (though streaming variants exist).
Deployment: Runs on edge devices, microcontrollers, or low-memory environments where Transformer inference is impractical.

Performance:

Mean Absolute Error: 5-7% (within acceptable range for auto-scaling decisions)
Throughput: 1000 predictions/second on a single CPU core
Deployment: Kubernetes pods, edge gateways, IoT devices

When Transformers win:

For tasks where accuracy is paramount and latency/memory are less constrained (e.g., batch forecasting, long-horizon predictions, multivariate dependencies), Transformers achieve 10-20% better accuracy. But they’re 10-100× slower and larger.

The tradeoff: RNNs trade off accuracy for speed, size, and streaming capability. For real-time systems with strict latency requirements, memory constraints, or edge deployment, LSTMs remain the engineering choice.

Engineering Takeaway

RNNs were the first architecture to handle sequences, and LSTMs made long-range dependencies learnable. But they’ve largely been replaced by Transformers for language tasks. Understanding when to still use RNNs—and why Transformers won—requires understanding RNN’s inherent limitations.

When to use RNNs

Despite being superseded for language, RNNs remain useful for:

1. Time series with true sequential dependence: When data arrives one point at a time and order is critical (sensor streams, financial ticks), RNNs (especially LSTMs) are natural. They process sequentially without needing to buffer the entire sequence.

2. Low-dimensional sequences: For short sequences with few features, RNNs are computationally cheaper than Transformers. A 1D time series with 100 time steps is well-suited to RNNs.

3. Streaming applications: RNNs can process data online—update the hidden state as new inputs arrive without reprocessing the entire history. Transformers require the full sequence upfront (though there are streaming variants).

4. Memory-constrained environments: RNNs have smaller memory footprints than Transformers for long sequences because they don’t store attention matrices.

Why Transformers replaced RNNs for language

Transformers (Chapter 20) solved RNN’s fundamental limitations:

Parallelization: RNNs process sequences serially—step $t$ depends on step $t-1$ . This makes training slow. Transformers process all positions in parallel, enabling GPU acceleration.
Long-range dependencies: Even LSTMs struggle with dependencies beyond a few hundred steps. Transformers use attention to directly connect all positions, making any long-range dependency easy to learn.
No hidden state bottleneck: RNNs compress everything into a fixed-size vector. Transformers let each position attend to all other positions directly, avoiding compression loss.

Production considerations

Bidirectional RNNs: Run the RNN forward and backward through the sequence, concatenating hidden states. This gives each position both past and future context, improving accuracy on tasks where you have the full sequence upfront (not streaming).
Sequence-to-sequence (seq2seq): Encode the input sequence into a final hidden state (context vector), then decode it into an output sequence. This powered early machine translation systems but has been replaced by Transformer-based approaches.
Gradient clipping is mandatory: Always clip gradients by norm when training RNNs to prevent exploding gradients. This is a simple but critical stability trick.
LSTMs vs GRUs: LSTMs are more expressive (more parameters) but slower. GRUs are faster and often perform similarly. Try both and choose based on validation performance and compute constraints.

The lesson: RNNs introduced the idea of recurrent memory for sequences, and LSTMs made it practical. But recurrence is inherently sequential and creates bottlenecks. Transformers showed that attention—connecting all positions directly—is a better solution for most sequence tasks. RNNs remain relevant for true streaming applications and low-dimensional time series, but for language and most other sequence tasks, Transformers have won.

References and Further Reading

Long Short-Term Memory – Sepp Hochreiter and Jürgen Schmidhuber (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

This is the paper that introduced LSTMs and solved the vanishing gradient problem in RNNs. Hochreiter and Schmidhuber showed that gated memory cells allow networks to learn dependencies over hundreds of time steps, which vanilla RNNs couldn’t do. Reading this paper (especially the motivation and architecture sections) explains why gates are necessary and how they preserve gradient flow. LSTMs enabled the first wave of successful sequence learning systems.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation – Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, et al. (2014) https://arxiv.org/abs/1406.1078

This paper introduced GRUs—a simpler alternative to LSTMs that’s faster and often performs similarly. Cho et al. also introduced the encoder-decoder architecture for sequence-to-sequence learning, which became the standard for machine translation until Transformers. Understanding GRUs shows that you don’t always need LSTM’s full complexity, and understanding encoder-decoder shows how RNNs were used to map between sequences.

The Unreasonable Effectiveness of Recurrent Neural Networks – Andrej Karpathy (2015) https://karpathy.github.io/2015/05/21/rnn-effectiveness/

This is one of the most influential blog posts on RNNs. Karpathy demonstrates what character-level RNNs can learn—generating Shakespeare, LaTeX, Linux kernel code—and provides intuition for how RNNs represent and generate sequences. Reading this gives you a feel for what RNNs do internally and why they’re powerful despite their limitations. It also shows failure modes (repetition, forgetting) that motivated the shift to Transformers.