Multimodal AI¶

✨ Bit: We see, hear, read, and speak in multiple modalities simultaneously. Multimodal AI does the same — one model that understands text, images, audio, and video together. It's the closest AI has come to how humans actually perceive the world.

★ TL;DR¶

What: AI systems that process and generate across multiple data types (text, image, audio, video) in a unified model
Why: The real world is multimodal. Text-only AI is limited. Multimodal = richer understanding + more useful outputs
Key point: All frontier models (GPT-5, Gemini 3, Claude 4, LLaMA 4) are now natively multimodal. This is the default, not the exception.

★ Overview¶

Definition¶

Multimodal AI refers to models that can understand, combine, and generate content across multiple modalities — text, images, audio, video, code, and structured data — within a single system, rather than requiring separate specialized models for each.

Scope¶

Covers: Vision-language models, text-to-video, text-to-audio, and cross-modal understanding. For image generation specifically, see Diffusion Models. For the visual-understanding side of multimodal work, see Computer Vision Fundamentals for AI Builders. For the text-only LLM perspective, see Llms Overview.

Last verified for frontier-model and product examples in this note: 2026-04.

Significance¶

Every frontier model released since 2024 is multimodal
Text-to-video market projected at $18.6B by end of 2026
Enables: visual understanding, document analysis, video generation, voice interaction
LLaMA 4 = Meta's first natively multimodal LLaMA (not bolted on)

Prerequisites¶

Llms Overview — the text foundation
Transformers — attention across modalities

★ Deep Dive¶

The Multimodal Spectrum¶

LEVEL 1: Multi-input (understand)
  Text + Image → Answer
  "What's in this photo?" → "A cat sitting on a laptop"
  Models: GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6 (vision)

LEVEL 2: Multi-output (generate)
  Text → Image + Audio + Video
  "Create a video of a sunset" → [video file]
  Models: Sora, Veo, DALL-E

LEVEL 3: Omni (understand + generate across all)
  Text ↔ Image ↔ Audio ↔ Video (any direction)
  "Describe this image, then create a video extending it with music"
  Models: GPT-5.4 (omni), Gemini 3.1 Pro (emerging)

LEVEL 4: Real-time interactive (emerging 2026)
  Live camera/audio + AI reasoning + real-time response
  AR glasses, live video chat with AI
  Status: Early experiments

How Multimodal Models Work¶

┌────────────────────────────────────────────────────────┐
│              MULTIMODAL ARCHITECTURE                    │
│                                                        │
│  Image ──► [Vision Encoder] ──► Image tokens ──┐       │
│                                                │       │
│  Text  ──► [Tokenizer]      ──► Text tokens  ──┤       │
│                                                │──► [LLM    │
│  Audio ──► [Audio Encoder]  ──► Audio tokens ──┤    Backbone]│
│                                                │       │
│  Video ──► [Frame Encoder]  ──► Video tokens ──┘       │
│                                                        │
│  Output: Text / Image tokens / Audio tokens            │
│          ↓                                             │
│  [Decoder for each modality]                           │
└────────────────────────────────────────────────────────┘

KEY IDEA: Convert everything into tokens, process with shared
Transformer backbone, decode to target modality.

Vision-Language Models (VLMs)¶

The most mature multimodal capability — models that understand images:

Model	Vision Capabilities	Context
GPT-5	Image understanding, chart analysis, OCR, visual reasoning	Integrated
Gemini 3.1 Pro	Native multimodal, visual reasoning, document understanding	1M+ tokens
Claude Opus 4.6 / Sonnet 4.6	Image analysis, chart reading, code from screenshots	1M
LLaMA 4 Scout	First natively multimodal LLaMA, image understanding	10M tokens
Gemma 3	Lightweight VLM, efficient image processing	Open-weight

Common use cases: - Document/receipt understanding (OCR + reasoning) - Chart and graph analysis - UI screenshot → code generation - Medical image analysis - Visual question answering

Text-to-Video (The 2025-2026 Frontier)¶

Model	Company	Key Feature	Status
Sora 2	OpenAI	Enhanced realism, synchronized dialogue, iOS app	Released Sep 2025
Veo 3.1	Google	4K output, native audio, 3 reference images for direction	Available on Vertex AI
Runway Gen-3 Alpha	Runway	Creator-focused, controlability	Production
Kling	Kuaishou	Strong motion, Chinese market leader	Available
Pika 2.0	Pika Labs	Style transfer, effects	Consumer-focused

What changed in 2025-2026: - Video generation went from "toy demos" to "legitimate production tool" - Sora integration into ChatGPT = mainstream access - Veo 3.1 supports 4K + native audio generation - Still improving: physics accuracy, human motion, facial consistency

Text-to-Audio & Music¶

Model	Type	Key Feature
ElevenLabs	Text-to-Speech	Most natural voice cloning
Suno	Text-to-Music	Full song generation from text
Udio	Text-to-Music	High-quality music, various genres
Bark	Text-to-Speech	Open-source, multilingual
MusicLM / MusicFX	Text-to-Music	Google's music generation

The CLIP Model (Foundational for Multimodal)¶

CLIP (Contrastive Language-Image Pre-training):

  "A photo of a cat"  ──► [Text Encoder]  ──► Text embedding
                                                    ↕ (should be close)
  [actual cat photo]   ──► [Image Encoder] ──► Image embedding

  Trained on 400M image-text pairs from the internet.
  Result: Text and images in the SAME vector space.

  This enables:
  - Zero-shot image classification ("Is this a cat or dog?")
  - Image search by text description
  - Text search by image
  - Foundation for Stable Diffusion's text understanding

◆ Types & Classifications¶

Type	Input → Output	Example Models	Key Challenge
Image Understanding	Image+Text → Text	GPT-5, Gemini 3, Claude 4	Fine-grained visual reasoning
Image Generation	Text → Image	DALL-E 3, SD, Midjourney	Prompt adherence, consistency
Video Generation	Text/Image → Video	Sora 2, Veo 3.1	Physics, temporal consistency
Voice/TTS	Text → Speech	ElevenLabs, Bark	Natural prosody, emotion
Music Generation	Text → Music	Suno, Udio	Musical structure, lyrics
Document AI	Document → Structured Data	GPT-5, Gemini (document mode)	Table extraction, layout
Omni	Any → Any	GPT-5 (omni mode)	Maintaining quality across all

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
More natural interaction (like humans do)	Much more compute-intensive than text-only
Richer understanding (see + read + hear)	Video generation still has artifacts
New creative possibilities	Copyright/deepfake concerns
Enables visual reasoning, document understanding	Hallucination extends to visual modalities
Single model for multiple tasks	Prompt engineering is harder across modalities

◆ Quick Reference¶

WHAT'S MATURE (production-ready):
  ✅ Image understanding (VLMs) — all frontier models
  ✅ Image generation — Stable Diffusion, DALL-E, Midjourney
  ✅ Text-to-speech — ElevenLabs
  ✅ Document AI — Gemini, GPT-5

WHAT'S EMERGING (usable but imperfect):
  ⚠️ Text-to-video — Sora 2, Veo 3 (impressive but artifacts)
  ⚠️ Music generation — Suno, Udio
  ⚠️ Real-time multimodal — voice+video chat

WHAT'S EARLY (research/demos):
  🔬 3D generation from text
  🔬 Real-time interactive video
  🔬 Full omni models (any-to-any)

○ Gotchas & Common Mistakes¶

⚠️ "Multimodal" doesn't mean "good at everything": A model great at text+image may be mediocre at audio. Check per-modality benchmarks.
⚠️ Token cost: Images = many tokens. A single image can consume 1K-5K tokens of context. Video = massively more.
⚠️ Deepfake risk: Realistic video/audio generation = major potential for misuse. Always consider ethical implications.
⚠️ Temporal consistency: Video models still struggle with consistent faces, physics, and object permanence across frames.
⚠️ Not magic: "Make me a professional 30-second ad" is still too complex for single-prompt generation.

○ Interview Angles¶

Q: How do multimodal models process images?
A: Images are encoded by a vision encoder (like ViT) into a sequence of "visual tokens," similar to text tokens. These are concatenated with text tokens and processed by the same Transformer backbone. Cross-attention allows the model to reason about both text and visual information together.
Q: Why is native multimodality important vs bolting vision onto a text model?
A: Bolted-on vision (early GPT-4V approach) processes modalities separately and aligns them — creating artifacts. Native multimodality (Gemini, LLaMA 4) trains on all modalities from the start, creating deeper cross-modal understanding and more natural integration.

★ Code & Implementation¶

Vision + Text with GPT-4o (Image Analysis)¶

# pip install openai>=1.60 Pillow>=10
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def analyze_image(image_path: str, question: str = "Describe this image in detail.") -> str:
    """Send an image + question to GPT-4o vision."""
    img_bytes = Path(image_path).read_bytes()
    b64_image = base64.b64encode(img_bytes).decode("utf-8")
    ext       = Path(image_path).suffix.lstrip(".").lower()
    media_type = f"image/{ext}" if ext in ("jpg", "jpeg", "png", "gif", "webp") else "image/jpeg"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url",
                 "image_url": {"url": f"data:{media_type};base64,{b64_image}", "detail": "high"}},
                {"type": "text", "text": question},
            ],
        }],
        max_tokens=500,
    )
    return response.choices[0].message.content

# URL-based (no local file needed for testing)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Simple_English_Wikipedia_favicon.svg/240px-Simple_English_Wikipedia_favicon.svg.png"}},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }],
    max_tokens=100,
)
print(response.choices[0].message.content)

★ Connections¶

Relationship	Topics
Builds on	Transformers, Llms Overview, Diffusion Models
Leads to	AR/VR AI, Robotics (visual+language understanding), Video AI
Compare with	Single-modality models (text-only, image-only)
Cross-domain	Computer vision, Audio signal processing, HCI

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Modality dominance	Model relies heavily on text, ignores image/audio inputs	Unbalanced multi-modal training, text bias	Ablation testing per modality, balanced training data
OCR/vision hallucination	Model reads text in images that doesn't exist	Visual encoder hallucination	Verification pipeline, confidence thresholds, multi-model consensus
Audio transcription errors	Speech-to-text fails on accents, noise, domain jargon	Insufficient acoustic diversity in training	Domain-specific fine-tuning, preprocessing (noise reduction)

◆ Hands-On Exercises¶

Exercise 1: Build a Multimodal Document Analyzer¶

Goal: Process documents with text + images using a multimodal LLM Time: 30 minutes Steps: 1. Prepare 5 documents with text, charts, and tables 2. Send each to a multimodal model (GPT-4o or Gemini) 3. Ask structured questions about both text and visual content 4. Grade accuracy on text-based vs image-based questions Expected Output: Accuracy comparison: text questions vs visual questions

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	OpenAI "GPT-4V System Card" (2023)	How multimodal capabilities are evaluated and deployed
📄 Paper	Radford et al. "CLIP" (2021)	Foundational vision-language alignment paper
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 3	Multimodal model capabilities and application patterns
🔧 Hands-on	Google Gemini API Docs	Production multimodal API with vision, audio, and video

★ Sources¶

OpenAI Sora documentation (2024-2025)
Google Veo release notes (2025-2026)
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
Meta LLaMA 4 multimodal announcement (April 2025)
Anthropic Claude 4 vision documentation