Chapter 37: Multimodal Models

For decades, AI systems were specialists. Computer vision models recognized images. Speech recognition models transcribed audio. Language models generated text. Each modality—vision, audio, language—had its own architecture, its own training pipeline, its own community of researchers. The models did not talk to each other. A vision model could classify an image as “dog,” but it could not explain why. A language model could write about dogs, but it could not see them.

This changed with the realization that all modalities can be represented as tokens—discrete units processed by the same Transformer architecture. Images are patches. Audio is spectrograms. Text is words. All become sequences fed into a unified model. In 2021, OpenAI released CLIP, a model that learned to align images and text by training on 400 million image-caption pairs scraped from the internet. CLIP did not require carefully labeled data—just images with associated text, mined from the web. This weak supervision, at scale, enabled zero-shot transfer: CLIP could classify images it had never seen during training, guided only by text descriptions.

CLIP was the turning point. After CLIP came Flamingo (DeepMind), which interleaved images and text for few-shot visual question answering. Then Whisper (OpenAI), which transcribed speech across 97 languages with unprecedented robustness. Then GPT-4V, which analyzed images, charts, and diagrams alongside text. Then Gemini (Google), trained natively on text, images, audio, and video. Multimodal AI is no longer experimental—it is the default for frontier models.

This chapter explains how multimodal models work, why unifying modalities improves performance, and where current models still fail. Understanding multimodal models is understanding the next generation of AI systems: not text-only assistants, but systems that see, hear, and speak.

Unifying Modalities: Text, Vision, Audio as Tokens

The Transformer architecture processes sequences of tokens. Originally designed for text, Transformers work on any sequence: image patches, audio frames, video clips. The key insight: everything can be tokenized.

Text Tokenization

Text is already discrete. Break sentences into words, subwords (BPE, SentencePiece), or characters. Each token maps to an embedding vector. The Transformer processes these embeddings.

Example: “The cat sat” → tokens ["The", " cat", " sat"] → embeddings → Transformer.

Vision Tokenization (Vision Transformer, ViT)

Images are continuous 2D grids of pixels. To tokenize:

Divide the image into patches (e.g., 16×16 pixel squares)
Flatten each patch into a vector
Project each vector into an embedding
Treat the sequence of patch embeddings as tokens

Example: 224×224 image divided into 14×14 = 196 patches of 16×16 pixels. Each patch becomes a token. The Transformer processes 196 tokens representing the image.

Why this works: Patches capture local structure (edges, textures, objects), and the Transformer’s self-attention learns spatial relationships between patches. After training on millions of images, ViT matches or beats convolutional neural networks (CNNs) on image classification.

Audio Tokenization

Audio is a continuous waveform. To tokenize:

Convert waveform to spectrogram (frequency representation over time)
Divide spectrogram into time slices (e.g., 20ms windows)
Treat each time slice as a token embedding

Alternatively, encode raw audio waveforms with a learned encoder (Wav2Vec, HuBERT) that outputs discrete tokens.

Example: 10-second audio clip → 500 spectrogram frames → 500 tokens → Transformer.

Whisper (OpenAI’s speech recognition model) uses this approach: audio → log-mel spectrogram → Transformer encoder → text tokens (transcription).

Unified Architecture

Once all modalities are tokenized, the same Transformer architecture processes them. Text tokens, image patch tokens, and audio tokens all flow through self-attention and feedforward layers. The model learns representations that bridge modalities.

Key advantage: Unified models leverage data from multiple modalities. A model that learns from both text and images develops richer representations than a text-only or vision-only model. Language grounds vision; vision grounds language.

Language models trained on text alone learn statistical patterns: which words follow which, which phrases sound natural. But text is disconnected from the physical world. The model reads “red ball” without seeing red or round. It predicts “gravity pulls objects down” without understanding forces or motion.

Grounding means connecting language to perception. Multimodal models learn associations between words and sensory inputs: “red” linked to red pixels, “ball” linked to circular shapes. This grounding improves generalization and enables new capabilities.

CLIP: Contrastive Language-Image Pretraining

CLIP (OpenAI, 2021) trained two encoders—one for images, one for text—to align in a shared embedding space.

Training process:

Collect 400 million (image, text) pairs from the internet
- Images with captions, alt-text, surrounding text
Encode each image with a vision encoder (ViT)
Encode each caption with a text encoder (Transformer)
Compute similarity between image and text embeddings (cosine similarity)
Optimize contrastive loss:
- Maximize similarity for correct (image, caption) pairs
- Minimize similarity for incorrect pairs

Contrastive loss forces alignment: If an image shows a dog, its embedding should be close to “a photo of a dog” and far from “a photo of a cat.” After training on 400M pairs, CLIP learns to associate visual patterns with language descriptions.

Zero-shot transfer:

CLIP enables zero-shot image classification without fine-tuning. To classify an image into categories {cat, dog, car}:

Encode the image → embedding $v$
Encode text prompts: “a photo of a cat”, “a photo of a dog”, “a photo of a car” → embeddings $t_1, t_2, t_3$
Compute similarity: $\text{sim}(v, t_i)$ for each category
Predict the category with highest similarity

CLIP achieves competitive accuracy on ImageNet (image classification) despite never being trained on ImageNet labels. It generalizes to new tasks using language as the interface.

Why grounding matters:

CLIP’s text encoder learns richer representations than a text-only model. Because it is trained to align with images, it learns that “red” relates to color, “ball” relates to shape, “dog” relates to furry four-legged animals. These visual associations improve language understanding.

Conversely, CLIP’s vision encoder learns richer representations than a vision-only model. Because it is trained to align with text, it learns high-level semantic concepts (“dog,” “running,” “outdoors”) instead of just low-level features (edges, textures).

Flamingo: Interleaving Images and Text

Flamingo (DeepMind, 2022) extended multimodal learning to few-shot visual question answering. Given a sequence of interleaved images and text, Flamingo answers questions about the images.

Architecture:

Frozen vision encoder (processes images → embeddings)
Language model (processes text tokens)
Cross-attention layers connecting vision and language
- Language model attends to image embeddings when generating text
- “What is in this image?” → model looks at image embeddings → generates caption

Few-shot learning:

Flamingo can learn new tasks from a few examples in context. Provide 2-3 (image, question, answer) examples, then ask a new question about a new image. The model generalizes from the in-context examples—similar to GPT-3’s few-shot learning, but for vision.

Why this matters:

Flamingo shows that multimodal models can reason visually using language as scaffolding. Language guides attention: “What color is the car?” directs the model to look at the car region and extract color information. Perception and language work together.

Perception + Language: How World Models Form

Multimodal models begin to form world models—internal representations of objects, scenes, and their relationships. These models are not explicit 3D simulations, but statistical associations learned from data.

Object Recognition and Localization

A model trained on images and captions learns to associate words with visual regions:

“Dog” → furry four-legged object
“Car” → rectangular object with wheels
“Tree” → green vertical structure with branches

GPT-4V (GPT-4 with vision) can describe images: “A golden retriever sitting on grass in a park.” It identifies objects (dog, grass), attributes (golden, sitting), and context (park). This requires recognizing objects and understanding their relationships.

Spatial Understanding (Limited)

Current multimodal models struggle with spatial reasoning:

Counting: “How many apples are in the image?” Often inaccurate
Depth perception: Cannot reliably estimate distance between objects
3D structure: Struggle with occluded objects, viewpoint changes
Physical reasoning: Do not understand gravity, support, balance

Example: Show an image of a stack of blocks. Ask: “If I remove the middle block, what happens?” Humans know the top blocks fall. Models struggle—they lack physics understanding.

Temporal Understanding (Very Limited)

Video models process sequences of frames, but temporal reasoning remains weak:

Event detection: Models detect “a person runs” but struggle with “a person starts running, then stops”
Long-term dynamics: Cannot track objects across many frames reliably
Causality: Do not understand which events cause which

Why World Models Are Still Weak

Multimodal models learn correlations from data, not causal mechanisms. They see millions of images of dogs, learn that “dog” correlates with certain pixel patterns, but do not understand what a dog is—an animal with biology, behavior, needs. Language grounds perception statistically, but not conceptually.

What’s missing:

Physical interaction: Models observe images/videos but do not interact with objects
Embodiment: No body, no sensors, no motor control
Long-term memory: No persistent memory across conversations/sessions
Causal models: Learn $P(Y | X)$ but not “X causes Y”

Multimodal models are progress toward world models, but far from human-like understanding.

Limitations: Why Sensory Understanding Is Still Weak

Despite impressive capabilities, multimodal models have fundamental limitations:

Data Efficiency Gap

CLIP trained on 400 million image-text pairs. Humans learn object recognition from dozens of examples. A child sees “dog” 10-20 times and generalizes to all dogs. Models need millions of examples for comparable generalization. This inefficiency suggests models are not learning the same way humans do—they memorize statistical patterns, not concepts.

Fragile Generalization

Multimodal models generalize within their training distribution but fail on out-of-distribution inputs:

CLIP trained mostly on photos → struggles with sketches, paintings, abstract art
GPT-4V trained on natural images → struggles with medical scans, satellite imagery
Whisper trained on speech → struggles with music, environmental sounds, accents far from training data

Humans transfer knowledge across domains easily. Models do not.

Lack of Common Sense

Show an image of a person holding an umbrella indoors on a sunny day. Ask: “Is this unusual?” Humans immediately recognize the inconsistency. Models struggle—they lack common sense about when umbrellas are used, what “indoors” means contextually, and what makes a situation unusual.

Inference Costs

Processing images is 10-100x more expensive than processing text.

Text token: ~1 embedding lookup, ~1K FLOPs per layer
Image patch token: ~1 embedding projection, ~1K FLOPs per layer, but images have 196-256 tokens per image

Analyzing a single image costs as much as processing 200-500 words of text. For video (30 frames/second), costs explode: 10 seconds of video = 300 frames = 60,000 tokens = equivalent to processing 30,000 words.

Practical implications:

GPT-4V is expensive to run. Analyzing a 10-image document costs 10x a text-only query. Applications must balance functionality and cost: where is vision worth 10-100x more compute?

Alignment Is Harder

Multimodal models inherit text model alignment challenges (hallucinations, bias) and add new ones:

Visual hallucinations: Describing objects not in the image
Misidentification: Confusing visually similar objects
Cultural bias: Models trained on Western images struggle with non-Western contexts

Aligning multimodal models requires human feedback on vision tasks—more expensive than text-only feedback (humans must review images, not just text).

Engineering Takeaway

Token abstraction enables unification—same architecture handles all modalities

Tokenizing images, audio, and text into sequences allows a single Transformer to process all modalities. This engineering win simplifies model architecture: no need for separate vision networks, audio networks, language networks. One architecture, multiple modalities. This unification accelerates research and deployment—improvements to Transformers benefit all modalities simultaneously.

Contrastive learning scales—weak supervision beats careful labeling

CLIP trained on 400 million (image, text) pairs scraped from the internet. No manual labeling, no curated datasets—just whatever images and text co-occur on the web. Weak supervision at scale beats careful labeling at small scale. This lesson generalizes: use massive noisy data, not small clean data. Scale compensates for noise.

Grounding improves generalization—multimodal models transfer better than text-only

Models that learn from both text and vision develop richer representations. Language grounds vision (semantic concepts), vision grounds language (perceptual meaning). CLIP’s text encoder outperforms text-only models on certain NLP tasks because it has visual grounding. Multimodal training improves all modalities, not just the multimodal tasks.

Inference costs multiply—images are 10-100x more expensive than text

Analyzing one image costs as much as processing 200-500 words. Video is worse: 10 seconds = 30,000 words equivalent. Applications must justify the cost. When is vision worth 10x more compute? Document understanding (analyze receipts, forms), visual QA (customer support with screenshots), image generation. But for text-only tasks, adding vision is wasteful.

Data alignment is the bottleneck—paired multimodal data is scarcer than text alone

Text-only data is abundant: web pages, books, articles, trillions of tokens. Image-text pairs are scarcer: need images with captions or alt-text. High-quality pairs (descriptive captions, not just “image.jpg”) are rarer still. Video-text alignment is even scarcer. Collecting and cleaning paired multimodal data is a major engineering challenge. This limits how far multimodal models can scale with current methods.

Applications become richer—document understanding, visual assistants, video analysis

Multimodal models enable new applications:

Document understanding: Analyze receipts, invoices, forms with mixed text and images
Visual assistants: Answer questions about screenshots, charts, diagrams
Medical imaging: Describe X-rays, MRIs, pathology slides
Accessibility: Generate captions for images, describe scenes for visually impaired users
Creative tools: DALL-E (text → image), video editing with language commands

The future of AI applications is multimodal: not text-only chatbots, but assistants that see, hear, and speak.

Gaps remain large—no true spatial reasoning, physics understanding, embodied grounding

Despite progress, multimodal models lack fundamental capabilities:

Cannot reliably count objects, estimate depths, understand 3D structure
Do not understand physics: gravity, support, collision, causality
Lack embodied grounding: never interact with physical world, never use a body to learn sensorimotor associations
Temporal reasoning weak: struggle with long-term dynamics, event causality

These gaps mean multimodal models are powerful pattern recognizers, not true world modelers. They excel at classification, description, retrieval—but fail at reasoning, planning, physical understanding.

Engineering Takeaway diagram

References and Further Reading

Learning Transferable Visual Models From Natural Language Supervision (CLIP) - Radford et al. (2021), OpenAI

Why it matters: CLIP revolutionized computer vision by showing that language can supervise vision at scale. Instead of training on carefully labeled datasets like ImageNet (1M images, 1000 classes), CLIP trained on 400 million (image, text) pairs scraped from the internet. No manual labeling—just whatever images and captions co-occur on the web. Contrastive learning aligned vision and text encoders in a shared embedding space, enabling zero-shot transfer: CLIP classifies images into categories it never saw during training, guided only by text descriptions. This approach generalized better than supervised learning and enabled applications like text-to-image generation (DALL-E uses CLIP), visual question answering, and image search. CLIP changed computer vision from task-specific models to general-purpose multimodal models.

Flamingo: A Visual Language Model for Few-Shot Learning - Alayrac et al. (2022), DeepMind

Why it matters: Flamingo demonstrated that multimodal models can perform few-shot visual reasoning—learning new tasks from a handful of in-context examples, like GPT-3 for vision. Given 2-3 examples of (image, question, answer) tuples, Flamingo answers questions about new images. The key innovation: cross-attention layers connecting a frozen vision encoder and a language model. The language model attends to visual features when generating text, enabling perception-guided language generation. Flamingo showed that multimodal models are not just better at classification—they can reason, generalize, and solve novel tasks with minimal examples. This set the stage for GPT-4V and other vision-language models that combine perception and reasoning.

Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) - Radford et al. (2022), OpenAI

Why it matters: Whisper trained on 680,000 hours of audio-text pairs scraped from the web, covering 97 languages. No careful curation—just massive scale and weak supervision. Result: Whisper transcribes speech more robustly than models trained on carefully labeled datasets. It handles accents, background noise, code-switching (mixing languages), and domain shifts (podcasts, phone calls, lectures) better than supervised models. The lesson: weak supervision at scale beats strong supervision at small scale. Whisper also demonstrated that multimodal architectures (audio encoder → text decoder) transfer well across languages and domains. It became the de facto standard for speech recognition, powering accessibility tools, transcription services, and voice interfaces. Whisper showed that multimodal learning is not just for vision—audio-text alignment follows the same principles.