Multimodality
Human perception is fundamentally multimodal. You read a chart while hearing someone explain it. You watch a video and simultaneously process the speaker's words, facial expressions, and the visual content on screen. For decades, AI systems were strictly unimodal: a vision model processed images, a speech model processed audio, a language model processed text — and they operated in isolation. Frontier AI has broken down these walls. Today's leading models accept text, images, audio, and video as input, generate across multiple modalities, and reason about the relationships between them. This capability — multimodality — is not a cosmetic add-on. It changes what AI can perceive, understand, and do.
What Multimodal Models Can Do
The range of multimodal capabilities in frontier systems is now extensive. Vision-language models can describe photographs in natural language, read and interpret text in images (including handwriting), analyze charts and diagrams, identify objects in complex scenes, and answer questions about visual content — 'What emotion does the person on the left appear to be expressing?' or 'Does this X-ray show any abnormality?' Audio-language capabilities include speech transcription, speaker identification, translation of spoken language, and increasingly, generation of natural-sounding speech. Video understanding extends this to temporal sequences: models can summarize a lecture video, identify the moment in a video when a specific event occurs, or answer questions about what happened in a scene. Multimodal generation — creating content in one modality based on instructions in another — adds further capability: generating images from text descriptions (DALL-E 3, Midjourney, Stable Diffusion), generating audio from text (text-to-speech, music generation), and generating video from text or images (Sora, Runway). Some frontier models both perceive and generate across modalities in a unified system, enabling workflows like 'here is a rough sketch — redraw it as a polished vector illustration in this style.'
Think of each modality — text, image, audio, video — as a distinct information channel, each carrying data that the other channels cannot fully express. A photograph of a damaged building carries spatial, textural, and structural information that a text description would take thousands of words to approximate. Multimodal models can access all these channels simultaneously, giving them a richer picture of the world than any unimodal system.
How Multimodal Architectures Work
The core engineering challenge in multimodality is representation: how do you get a model that speaks 'text' to also understand 'image'? The dominant approach is to encode each modality into a shared embedding space — a high-dimensional vector representation — and then process these embeddings together in a unified transformer. For images, this typically involves a vision encoder (often a Vision Transformer, or ViT) that divides an image into patches, encodes each patch as a vector, and passes these patch embeddings to the language model alongside the text token embeddings. From the transformer's perspective, an image becomes a sequence of visual tokens, interleaved with text tokens. The same self-attention mechanism that relates words to each other can now relate words to image patches, enabling the model to ground language in vision. For audio, a similar approach applies: a Whisper-style audio encoder converts audio spectrograms into embeddings that are fed into the language model. For video, temporal information is handled by encoding sequences of frames. The training data is multimodal: image-caption pairs, interleaved text-and-image documents, video transcriptions, and instruction-following datasets where the task involves multiple modalities. The model learns, through this training, that certain visual patterns correspond to certain textual concepts — not through explicit programming, but through the statistical structure of billions of aligned multimodal examples.
Flashcards — click each card to reveal the answer
Match each multimodal task to the capability it primarily demonstrates.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Multimodal capability opens important applications in medicine (reading radiology images alongside patient records), accessibility (describing visual content for blind users, captioning for deaf users), education (analyzing a student's handwritten work), science (parsing figures and tables in research papers automatically), and creative work (generating and iterating on visual concepts through dialogue). But it also introduces new risks: deepfakes become easier to create, verification of visual evidence becomes harder, and models may inherit biases about how different kinds of people are visually represented in training data — biases that manifest in image generation outputs.
Vision-language models can hallucinate visual details — describing objects that are not in an image, misreading text in photos, or confidently assigning incorrect attributes to visual content. This is distinct from text hallucination but equally dangerous in high-stakes settings. Always verify AI-generated visual descriptions against the original source for important decisions.
A Vision Transformer (ViT) encodes an image for a multimodal language model. What is the primary mechanism by which this works?
A student argues: 'Multimodal AI models are just regular language models with an image converter bolted on — fundamentally they are still language systems.' How accurate is this characterization?
Multimodal Capability Audit
- Using a publicly available multimodal AI system (GPT-4o, Claude, Gemini, or similar), conduct a structured capability audit.
- Test 1 — Image description: Upload a moderately complex image (a busy photograph, a diagram, or a hand-drawn sketch). Ask the model to describe what it sees in detail. Note accuracy and any hallucinated elements.
- Test 2 — Visual reasoning: Find an image containing a chart or table with numbers. Ask the model a question that requires interpreting the data — 'Which category had the highest growth?' Compare its answer to what you can verify by eye.
- Test 3 — Cross-modal instruction: Give the model a written description of a scene and ask it to describe what an image of that scene would look like in detail. Then (if available) generate the image and compare the description to the result.
- Test 4 — Limits: Design a test you expect the model to fail — visual content it should not be able to interpret correctly. Document the failure.
- Write a one-page report: what can this system perceive accurately, where does it hallucinate, and what does that suggest about the gap between capability and reliability?