Computer Vision Fundamentals for AI Builders¶
✨ Bit: You don't need to be a vision PhD to build multimodal AI — but you do need to understand why a ViT-L/14 processes images as 16×16 patch sequences, why CLIP can search images with text, and why resolution quadruples compute cost. This note gives you that working knowledge.
★ TL;DR¶
- What: The core concepts behind image understanding — CNNs, Vision Transformers, CLIP, detection, segmentation — that every GenAI builder needs
- Why: Modern AI is multimodal. GPT-4o, Gemini, and Claude all process images. Understanding how pixels become representations is now a core GenAI skill.
- Key point: Vision models turn pixel arrays into semantic embeddings. Those embeddings are then used for classification, retrieval, generation, or as input to language models.
★ Overview¶
Definition¶
Computer vision (CV) is the field of enabling machines to interpret and reason about visual data — images, video, documents, and 3D scenes.
Scope¶
Covers: Core CV tasks, how images become representations, CNN vs ViT architectures, CLIP and contrastive learning, practical code for image classification and similarity search. For text-to-image generation, see Diffusion Models. For the broader multimodal landscape, see Multimodal AI.
Significance¶
- Multimodal is the default: GPT-4o, Gemini 2.5, Claude 3.5 — all flagship models process images natively. Vision is no longer a separate field.
- Production use cases: Document understanding, visual search, screenshot-aware assistants, product image analysis, medical imaging, autonomous systems
- Interview relevance: System design interviews increasingly include visual components ("design an image search system", "how does a multimodal model process screenshots?")
Prerequisites¶
- Multimodal AI — the bigger picture
- Embeddings — vector representations
- Modern Architectures — transformer fundamentals
★ Deep Dive¶
Core Vision Tasks¶
INPUT: Image (H × W × C pixel array)
│
▼
┌──────────────────────────────────────────────────────────┐
│ VISION TASKS │
├──────────────────────────────────────────────────────────┤
│ │
│ CLASSIFICATION "This is a cat" │
│ Image → single label │
│ │
│ DETECTION "Cat at (x1,y1,x2,y2)" │
│ Image → bounding boxes + labels │
│ │
│ SEGMENTATION "These pixels are cat" │
│ Image → per-pixel labels │
│ │
│ OCR / DOCUMENT AI "Invoice #1234, Total: $500" │
│ Image → structured text extraction │
│ │
│ IMAGE-TEXT MATCHING "How similar is this image │
│ (CLIP, SigLIP) to 'a sunset over mountains'?" │
│ │
│ VISUAL QA "How many people are in │
│ (VLMs) this photo?" → "Three" │
│ │
└──────────────────────────────────────────────────────────┘
| Task | Input | Output | Key Models (2026) |
|---|---|---|---|
| Classification | Image | Label(s) + confidence | ViT, EfficientNet, ConvNeXt |
| Object Detection | Image | Bounding boxes + labels | YOLO v9/v10, DETR, RT-DETR |
| Segmentation | Image | Per-pixel mask | SAM 2, Mask2Former |
| OCR / Document AI | Image | Structured text | PaddleOCR, Tesseract, DocTR |
| Image-Text Matching | Image + Text | Similarity score | CLIP, SigLIP, EVA-CLIP |
| Visual QA | Image + Question | Answer text | GPT-4o, Gemini, LLaVA |
From Pixels to Representations¶
RAW IMAGE FEATURE EXTRACTION SEMANTIC EMBEDDING
┌────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ ░░▓▓██░░▓▓██ │ │ Edges → Shapes │ │ [0.23, -0.1, │
│ ░░▓▓██░░▓▓██ │ ─CNN─► │ Shapes → Parts │ ─Pool─► │ 0.87, 0.45, │
│ ██░░▓▓██░░▓▓ │ or ViT │ Parts → Objects │ │ ..., -0.33] │
│ ██░░▓▓██░░▓▓ │ │ Objects → Scene │ │ │
│ │ │ │ │ 768-dim vec │
│ H×W×3 tensor │ │ Feature maps │ │ │
│ (e.g. 224×224×3)│ │ │ └──────────────┘
└────────────────┘ └──────────────────┘
│ │
│ Resolution matters: │
│ 224×224 = 150K pixels │
│ 512×512 = 786K pixels (5× more compute) │
│ 1024×1024 = 3.1M pixels (20× more compute) │
CNN vs Vision Transformer (ViT)¶
| Aspect | CNN (ConvNet) | Vision Transformer (ViT) |
|---|---|---|
| How it works | Slides learned filters across image, building local → global features | Splits image into patches, treats each as a "token", processes with transformer |
| Key operation | Convolution (local receptive field) | Self-attention (global receptive field) |
| Inductive bias | Translation invariance, locality | Minimal — learns spatial relationships from data |
| Data efficiency | Better with small datasets (built-in priors) | Needs large datasets or pretraining |
| Scale behavior | Diminishing returns past ~500M params | Scales well to billions of parameters |
| Integration with LLMs | Requires adapter/projection layer | Natural fit — same architecture family |
| Current status | Still excellent for edge/mobile (EfficientNet, MobileNet) | Dominant for research and multimodal (ViT, SigLIP) |
How ViT works (the key idea):
Image (224×224)
│
▼
Split into patches: 14×14 grid of 16×16 pixel patches = 196 patches
│
▼
Flatten each patch: 16×16×3 = 768 values per patch
│
▼
Linear projection: 768 → D (embedding dimension)
│
▼
Add position embeddings: Tell the model where each patch is
│
▼
Prepend [CLS] token: Will hold the aggregate image representation
│
▼
Process through Transformer encoder (same as BERT/GPT!)
│
▼
[CLS] output = image embedding (768-dim or 1024-dim vector)
Why this matters for GenAI builders: ViT produces patch-level and image-level embeddings that can be directly consumed by language models. This is how multimodal models like LLaVA and GPT-4o work — a ViT encodes the image, a projection layer maps visual tokens into the LLM's embedding space, and the LLM processes visual + text tokens together.
CLIP: The Bridge Between Vision and Language¶
┌──────────────────────────────────────────────────────────────┐
│ CLIP ARCHITECTURE │
│ │
│ IMAGE TEXT │
│ "photo of a cat" "a photo of a cat" │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Image │ │ Text │ │
│ │ Encoder │ │ Encoder │ │
│ │ (ViT) │ │ (Transf) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ▼ ▼ │
│ [Image Embedding] [Text Embedding] │
│ 768-dim vector 768-dim vector │
│ │ │ │
│ └───────────┐ ┌─────────────────┘ │
│ ▼ ▼ │
│ Cosine Similarity │
│ (maximize for matching pairs) │
│ │
│ Training: 400M image-text pairs from the internet │
│ Result: Shared embedding space for images AND text │
└──────────────────────────────────────────────────────────────┘
What CLIP enables:
- Zero-shot image classification (no task-specific training!)
- Image search with natural language queries
- Image-text similarity scoring
- Foundation for multimodal models (LLaVA, etc.)
How Multimodal LLMs Process Images¶
The architecture behind GPT-4o / Gemini / LLaVA:
Image ──► [Vision Encoder (ViT)] ──► [Projection Layer] ──► Visual Tokens
│
┌─────┴─────┐
│ │
Text ──► [Tokenizer] ──► Text Tokens ─────────────────► │ LLM │
│ Decoder │
│ │
└─────┬─────┘
│
Generated Text
Key insight: The vision encoder produces "visual tokens" that are
concatenated with text tokens. The LLM processes them together.
Image tokens per image (examples):
- LLaVA: 576 tokens (24×24 grid)
- GPT-4o: ~170 tokens (low detail) to ~1105 tokens (high detail)
- Gemini: Variable, up to ~3000 tokens
More tokens = better detail but higher cost and latency
Object Detection & Segmentation (Quick Tour)¶
| Model Family | Task | Speed | Key Innovation |
|---|---|---|---|
| YOLO v9/v10 | Detection | ⚡ Real-time (< 10ms) | Single-pass prediction, optimized for edge |
| RT-DETR | Detection | ⚡ Real-time | Transformer-based, no NMS needed |
| DETR | Detection | 🐢 Slower | End-to-end transformer detection |
| SAM 2 (Meta) | Segmentation | ⚡ Fast | Segment anything — zero-shot, prompt-based |
| Mask2Former | Segmentation | 🐢 Medium | Unified architecture for all segmentation types |
★ Code & Implementation¶
Image Classification with a Pretrained ViT¶
# pip install transformers>=4.40 torch>=2.0 pillow>=10.0
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch
# Load pretrained ViT (ImageNet-21k → ImageNet-1k fine-tuned)
model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)
# Classify an image
image = Image.open("photo.jpg") # Any image
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()
label = model.config.id2label[predicted_class]
confidence = torch.softmax(logits, dim=-1).max().item()
print(f"Prediction: {label} ({confidence:.1%})")
# Expected output: Prediction: golden_retriever (94.3%)
Zero-Shot Image Classification with CLIP¶
# pip install transformers>=4.40 torch>=2.0 pillow>=10.0
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch
# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Zero-shot classification — no training needed!
image = Image.open("product.jpg")
candidate_labels = [
"a photograph of a laptop",
"a photograph of a coffee mug",
"a photograph of a smartphone",
"a photograph of a book",
]
inputs = processor(
text=candidate_labels,
images=image,
return_tensors="pt",
padding=True,
)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)[0]
for label, prob in sorted(zip(candidate_labels, probs), key=lambda x: -x[1]):
print(f" {prob:.1%} {label}")
# Expected output:
# 72.3% a photograph of a laptop
# 15.1% a photograph of a smartphone
# 8.2% a photograph of a book
# 4.4% a photograph of a coffee mug
Image Similarity Search (Visual Retrieval)¶
# pip install transformers>=4.40 torch>=2.0 pillow>=10.0 numpy>=1.24
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch
import numpy as np
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def get_image_embedding(image_path: str) -> np.ndarray:
"""Get CLIP image embedding for similarity search."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embedding = model.get_image_features(**inputs)
# Normalize for cosine similarity
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
return embedding.numpy()[0]
def get_text_embedding(text: str) -> np.ndarray:
"""Get CLIP text embedding for cross-modal search."""
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
embedding = model.get_text_features(**inputs)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
return embedding.numpy()[0]
# Build a visual search index
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
embeddings = np.array([get_image_embedding(p) for p in image_paths])
# Search with text query (cross-modal!)
query = "a red sports car on a highway"
query_emb = get_text_embedding(query)
# Cosine similarity (embeddings are already normalized)
similarities = embeddings @ query_emb
best_match_idx = similarities.argmax()
print(f"Best match: {image_paths[best_match_idx]} (similarity: {similarities[best_match_idx]:.3f})")
# Expected output: Best match: img2.jpg (similarity: 0.312)
# In production, use a vector DB (Pinecone, Qdrant) for scale
◆ Quick Reference¶
CV TASK DECISION GUIDE:
"What is in this image?" → Classification (ViT, EfficientNet)
"Where are the objects?" → Detection (YOLO, RT-DETR)
"Which pixels belong to what?" → Segmentation (SAM 2, Mask2Former)
"What does this document say?" → OCR (PaddleOCR, DocTR)
"Find images similar to this text" → CLIP / SigLIP embedding search
"Answer questions about this image" → VLM (GPT-4o, Gemini, LLaVA)
RESOLUTION vs COMPUTE TRADE-OFF:
224×224: Baseline (1×) — standard classification
384×384: 2.9× compute — better detail, common for ViT-L
512×512: 5.3× compute — detection / segmentation
1024×1024: 21× compute — high-res analysis
MODEL SIZE REFERENCE (ViT family):
ViT-B/16: 86M params, 768-dim embedding
ViT-L/14: 304M params, 1024-dim embedding
ViT-H/14: 632M params, 1280-dim embedding
EVA-CLIP: 1B+ params, SOTA image-text alignment
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Resolution mismatch | Model misses small objects or text | Input resized to 224×224, crushing fine detail | Use higher resolution models (384/512), or tile + merge strategy |
| Domain shift | High accuracy on ImageNet, poor on real data | Training data distribution ≠ production data (medical, industrial, satellite) | Fine-tune on domain-specific data, even 500-1000 images helps significantly |
| CLIP hallucination | High similarity score for wrong matches | CLIP latches onto spurious correlations (color, texture) | Use as retrieval + rerank pipeline, not sole decision maker |
| Aspect ratio distortion | Stretched/squished images produce wrong features | Naive resize to square causes information loss | Use padding/letterboxing, or models that handle variable aspect ratios |
| OCR on stylized text | Fails on handwriting, unusual fonts, curved text | OCR models trained primarily on printed text | Use specialized models (TrOCR for handwriting), or VLMs (GPT-4o) |
○ Gotchas & Common Mistakes¶
- ⚠️ High ImageNet accuracy ≠ production robustness: A model with 90% on ImageNet may fail spectacularly on your specific photos (different lighting, angles, backgrounds).
- ⚠️ Resolution is the hidden cost multiplier: Going from 224→512 resolution increases compute ~5× and memory ~5×. Budget this explicitly.
- ⚠️ CLIP is powerful but not precise: CLIP excels at broad semantic matching but struggles with fine-grained distinctions ("two cats" vs "three cats"). Don't use it for counting or spatial reasoning.
- ⚠️ OCR ≠ document understanding: Extracting text (OCR) is different from understanding layout and relationships (Document AI). Don't confuse them.
- ⚠️ Vision tokens are expensive in VLMs: A single image in GPT-4o costs 85-1105 tokens. Processing 10 images per request can cost more than the text portion.
○ Interview Angles¶
- Q: Why are Vision Transformers important for multimodal AI?
-
A: ViTs convert images into sequences of patch embeddings using the same transformer architecture as language models. This architectural alignment is what makes multimodal models possible — you can project visual patch tokens into the same embedding space as text tokens, concatenate them, and let a single transformer process both modalities together. This is exactly how models like LLaVA and GPT-4o work: a ViT encodes the image into visual tokens, a projection layer maps them into the LLM's space, and the LLM attends to both visual and text tokens. Before ViTs, integrating CNNs with transformers required more complex adapter architectures.
-
Q: How does CLIP enable zero-shot image classification?
-
A: CLIP trains a shared embedding space for images and text using contrastive learning on 400M image-text pairs from the internet. During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. At inference, you encode the image with the vision encoder and encode candidate class descriptions ("a photo of a dog", "a photo of a cat") with the text encoder. The class whose text embedding is most similar to the image embedding is the prediction. No task-specific training needed — any text description works as a class label. The limitation is that CLIP's accuracy is lower than fine-tuned models on specific benchmarks, but its flexibility is unmatched.
-
Q: Design an image search system for an e-commerce platform.
- A: I'd build a two-stage retrieval + reranking pipeline. Stage 1: Use CLIP (or SigLIP) to encode all product images into embeddings, stored in a vector database (Qdrant or Pinecone). User queries (text or uploaded image) are encoded with the same model, and top-100 candidates retrieved by cosine similarity. Stage 2: A cross-encoder reranker (or VLM) scores each candidate for relevance, considering product metadata (category, price, availability). The embedding would update nightly via batch pipeline. For latency: embedding lookup < 50ms, reranking < 200ms. For cost: CLIP encoding is ~$0.001/image. I'd also add a feedback loop — user clicks improve the reranker over time.
◆ Hands-On Exercises¶
Exercise 1: Zero-Shot Image Classification¶
Goal: Classify images into custom categories without any training Time: 30 minutes Steps: 1. Load CLIP using the code above 2. Choose 5 images from different categories (food, animals, landscapes, tech, fashion) 3. Define 10 candidate text labels 4. Run zero-shot classification on all 5 images 5. Evaluate: which misclassifications occur? Why? Expected Output: Accuracy table, analysis of failure cases (CLIP confusion patterns)
Exercise 2: Build a Visual Search Engine¶
Goal: Build a text-to-image search over a local image collection Time: 45 minutes Steps: 1. Collect 50 images (or use CIFAR-10 sample) 2. Encode all images with CLIP into a numpy matrix 3. Implement text-query search: encode query → cosine similarity → top-5 results 4. Test with 5 queries: one precise ("red car"), one abstract ("peaceful"), one misleading 5. Evaluate retrieval quality — how does query specificity affect results? Expected Output: Working search system, quality analysis by query type
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Multimodal AI, Embeddings, Modern Architectures |
| Leads to | Diffusion Models (image generation), Document AI, Visual search systems |
| Compare with | Pure text NLP (no visual grounding), Traditional CV (pre-deep-learning, feature engineering) |
| Cross-domain | Robotics (perception), Medical imaging (radiology AI), Autonomous driving, Industrial inspection |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 🎓 Course | Stanford CS231n: Deep Learning for Computer Vision | The definitive CV course — CNNs, detection, segmentation, attention. Watch the lectures. |
| 📄 Paper | Dosovitskiy et al. "An Image is Worth 16×16 Words" (ViT, 2020) | The paper that launched Vision Transformers. Section 3 explains the patch embedding mechanism. |
| 📄 Paper | Radford et al. "Learning Transferable Visual Models" (CLIP, 2021) | How contrastive learning creates a shared vision-language embedding space |
| 🔧 Hands-on | HuggingFace Vision Transformers Tutorial | Practical guide to using ViT for classification, feature extraction, and fine-tuning |
| 📄 Paper | Kirillov et al. "Segment Anything" (SAM, 2023) | Zero-shot segmentation — the CLIP of pixel-level vision |
| 🎥 Video | Yannic Kilcher — "CLIP: Connecting Text and Images" | Excellent visual explanation of contrastive learning and CLIP architecture |
| 📘 Book | "Deep Learning for Vision Systems" by Elgendy (2020) | Practical introduction to CV with Python — good for building intuition |
★ Sources¶
- Dosovitskiy et al. "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale" (2020)
- Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
- Kirillov et al. "Segment Anything" (SAM, 2023)
- Stanford CS231n Lecture Notes — http://cs231n.stanford.edu/
- HuggingFace Transformers Documentation — https://huggingface.co/docs/transformers/
- Multimodal AI