Skip to content

Computer Vision Fundamentals for AI Builders

Bit: You don't need to be a vision PhD to build multimodal AI — but you do need to understand why a ViT-L/14 processes images as 16×16 patch sequences, why CLIP can search images with text, and why resolution quadruples compute cost. This note gives you that working knowledge.


★ TL;DR

  • What: The core concepts behind image understanding — CNNs, Vision Transformers, CLIP, detection, segmentation — that every GenAI builder needs
  • Why: Modern AI is multimodal. GPT-4o, Gemini, and Claude all process images. Understanding how pixels become representations is now a core GenAI skill.
  • Key point: Vision models turn pixel arrays into semantic embeddings. Those embeddings are then used for classification, retrieval, generation, or as input to language models.

★ Overview

Definition

Computer vision (CV) is the field of enabling machines to interpret and reason about visual data — images, video, documents, and 3D scenes.

Scope

Covers: Core CV tasks, how images become representations, CNN vs ViT architectures, CLIP and contrastive learning, practical code for image classification and similarity search. For text-to-image generation, see Diffusion Models. For the broader multimodal landscape, see Multimodal AI.

Significance

  • Multimodal is the default: GPT-4o, Gemini 2.5, Claude 3.5 — all flagship models process images natively. Vision is no longer a separate field.
  • Production use cases: Document understanding, visual search, screenshot-aware assistants, product image analysis, medical imaging, autonomous systems
  • Interview relevance: System design interviews increasingly include visual components ("design an image search system", "how does a multimodal model process screenshots?")

Prerequisites


★ Deep Dive

Core Vision Tasks

INPUT: Image (H × W × C pixel array)
┌──────────────────────────────────────────────────────────┐
│                   VISION TASKS                            │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  CLASSIFICATION        "This is a cat"                    │
│  Image → single label                                    │
│                                                           │
│  DETECTION            "Cat at (x1,y1,x2,y2)"            │
│  Image → bounding boxes + labels                         │
│                                                           │
│  SEGMENTATION         "These pixels are cat"             │
│  Image → per-pixel labels                                │
│                                                           │
│  OCR / DOCUMENT AI    "Invoice #1234, Total: $500"       │
│  Image → structured text extraction                      │
│                                                           │
│  IMAGE-TEXT MATCHING   "How similar is this image         │
│  (CLIP, SigLIP)        to 'a sunset over mountains'?"   │
│                                                           │
│  VISUAL QA            "How many people are in             │
│  (VLMs)                this photo?" → "Three"             │
│                                                           │
└──────────────────────────────────────────────────────────┘
Task Input Output Key Models (2026)
Classification Image Label(s) + confidence ViT, EfficientNet, ConvNeXt
Object Detection Image Bounding boxes + labels YOLO v9/v10, DETR, RT-DETR
Segmentation Image Per-pixel mask SAM 2, Mask2Former
OCR / Document AI Image Structured text PaddleOCR, Tesseract, DocTR
Image-Text Matching Image + Text Similarity score CLIP, SigLIP, EVA-CLIP
Visual QA Image + Question Answer text GPT-4o, Gemini, LLaVA

From Pixels to Representations

RAW IMAGE                    FEATURE EXTRACTION              SEMANTIC EMBEDDING

┌────────────────┐          ┌──────────────────┐            ┌──────────────┐
│ ░░▓▓██░░▓▓██  │          │  Edges → Shapes  │            │ [0.23, -0.1, │
│ ░░▓▓██░░▓▓██  │  ─CNN─►  │  Shapes → Parts  │  ─Pool─►   │  0.87, 0.45, │
│ ██░░▓▓██░░▓▓  │  or ViT  │  Parts → Objects │            │  ..., -0.33] │
│ ██░░▓▓██░░▓▓  │          │  Objects → Scene │            │              │
│                │          │                  │            │  768-dim vec  │
│ H×W×3 tensor   │          │  Feature maps    │            │              │
│ (e.g. 224×224×3)│         │                  │            └──────────────┘
└────────────────┘          └──────────────────┘
        │                                                          │
        │              Resolution matters:                          │
        │              224×224 = 150K pixels                        │
        │              512×512 = 786K pixels (5× more compute)     │
        │              1024×1024 = 3.1M pixels (20× more compute)  │

CNN vs Vision Transformer (ViT)

Aspect CNN (ConvNet) Vision Transformer (ViT)
How it works Slides learned filters across image, building local → global features Splits image into patches, treats each as a "token", processes with transformer
Key operation Convolution (local receptive field) Self-attention (global receptive field)
Inductive bias Translation invariance, locality Minimal — learns spatial relationships from data
Data efficiency Better with small datasets (built-in priors) Needs large datasets or pretraining
Scale behavior Diminishing returns past ~500M params Scales well to billions of parameters
Integration with LLMs Requires adapter/projection layer Natural fit — same architecture family
Current status Still excellent for edge/mobile (EfficientNet, MobileNet) Dominant for research and multimodal (ViT, SigLIP)

How ViT works (the key idea):

Image (224×224)
Split into patches: 14×14 grid of 16×16 pixel patches = 196 patches
Flatten each patch: 16×16×3 = 768 values per patch
Linear projection: 768 → D (embedding dimension)
Add position embeddings: Tell the model where each patch is
Prepend [CLS] token: Will hold the aggregate image representation
Process through Transformer encoder (same as BERT/GPT!)
[CLS] output = image embedding (768-dim or 1024-dim vector)

Why this matters for GenAI builders: ViT produces patch-level and image-level embeddings that can be directly consumed by language models. This is how multimodal models like LLaVA and GPT-4o work — a ViT encodes the image, a projection layer maps visual tokens into the LLM's embedding space, and the LLM processes visual + text tokens together.

CLIP: The Bridge Between Vision and Language

┌──────────────────────────────────────────────────────────────┐
│                     CLIP ARCHITECTURE                         │
│                                                               │
│   IMAGE                              TEXT                     │
│   "photo of a cat"                   "a photo of a cat"      │
│        │                                   │                  │
│        ▼                                   ▼                  │
│   ┌──────────┐                       ┌──────────┐            │
│   │  Image   │                       │  Text    │            │
│   │  Encoder │                       │  Encoder │            │
│   │  (ViT)   │                       │ (Transf) │            │
│   └────┬─────┘                       └────┬─────┘            │
│        │                                   │                  │
│        ▼                                   ▼                  │
│   [Image Embedding]              [Text Embedding]            │
│   768-dim vector                 768-dim vector               │
│        │                                   │                  │
│        └───────────┐   ┌─────────────────┘                  │
│                    ▼   ▼                                      │
│              Cosine Similarity                                │
│              (maximize for matching pairs)                    │
│                                                               │
│   Training: 400M image-text pairs from the internet          │
│   Result: Shared embedding space for images AND text         │
└──────────────────────────────────────────────────────────────┘

What CLIP enables:
  - Zero-shot image classification (no task-specific training!)
  - Image search with natural language queries
  - Image-text similarity scoring
  - Foundation for multimodal models (LLaVA, etc.)

How Multimodal LLMs Process Images

The architecture behind GPT-4o / Gemini / LLaVA:

Image ──► [Vision Encoder (ViT)] ──► [Projection Layer] ──► Visual Tokens
                                                          ┌─────┴─────┐
                                                          │           │
Text  ──► [Tokenizer] ──► Text Tokens ─────────────────► │    LLM    │
                                                          │  Decoder  │
                                                          │           │
                                                          └─────┬─────┘
                                                          Generated Text

Key insight: The vision encoder produces "visual tokens" that are
concatenated with text tokens. The LLM processes them together.

Image tokens per image (examples):
  - LLaVA: 576 tokens (24×24 grid)
  - GPT-4o: ~170 tokens (low detail) to ~1105 tokens (high detail)
  - Gemini: Variable, up to ~3000 tokens

More tokens = better detail but higher cost and latency

Object Detection & Segmentation (Quick Tour)

Model Family Task Speed Key Innovation
YOLO v9/v10 Detection ⚡ Real-time (< 10ms) Single-pass prediction, optimized for edge
RT-DETR Detection ⚡ Real-time Transformer-based, no NMS needed
DETR Detection 🐢 Slower End-to-end transformer detection
SAM 2 (Meta) Segmentation ⚡ Fast Segment anything — zero-shot, prompt-based
Mask2Former Segmentation 🐢 Medium Unified architecture for all segmentation types

★ Code & Implementation

Image Classification with a Pretrained ViT

# pip install transformers>=4.40 torch>=2.0 pillow>=10.0
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40

from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import torch

# Load pretrained ViT (ImageNet-21k → ImageNet-1k fine-tuned)
model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

# Classify an image
image = Image.open("photo.jpg")  # Any image
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

label = model.config.id2label[predicted_class]
confidence = torch.softmax(logits, dim=-1).max().item()
print(f"Prediction: {label} ({confidence:.1%})")

# Expected output: Prediction: golden_retriever (94.3%)

Zero-Shot Image Classification with CLIP

# pip install transformers>=4.40 torch>=2.0 pillow>=10.0
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Zero-shot classification — no training needed!
image = Image.open("product.jpg")
candidate_labels = [
    "a photograph of a laptop",
    "a photograph of a coffee mug",
    "a photograph of a smartphone",
    "a photograph of a book",
]

inputs = processor(
    text=candidate_labels,
    images=image,
    return_tensors="pt",
    padding=True,
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)[0]

for label, prob in sorted(zip(candidate_labels, probs), key=lambda x: -x[1]):
    print(f"  {prob:.1%}  {label}")

# Expected output:
#   72.3%  a photograph of a laptop
#   15.1%  a photograph of a smartphone
#    8.2%  a photograph of a book
#    4.4%  a photograph of a coffee mug

Image Similarity Search (Visual Retrieval)

# pip install transformers>=4.40 torch>=2.0 pillow>=10.0 numpy>=1.24
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch
import numpy as np

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_image_embedding(image_path: str) -> np.ndarray:
    """Get CLIP image embedding for similarity search."""
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = model.get_image_features(**inputs)
    # Normalize for cosine similarity
    embedding = embedding / embedding.norm(dim=-1, keepdim=True)
    return embedding.numpy()[0]

def get_text_embedding(text: str) -> np.ndarray:
    """Get CLIP text embedding for cross-modal search."""
    inputs = processor(text=[text], return_tensors="pt", padding=True)
    with torch.no_grad():
        embedding = model.get_text_features(**inputs)
    embedding = embedding / embedding.norm(dim=-1, keepdim=True)
    return embedding.numpy()[0]

# Build a visual search index
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
embeddings = np.array([get_image_embedding(p) for p in image_paths])

# Search with text query (cross-modal!)
query = "a red sports car on a highway"
query_emb = get_text_embedding(query)

# Cosine similarity (embeddings are already normalized)
similarities = embeddings @ query_emb
best_match_idx = similarities.argmax()
print(f"Best match: {image_paths[best_match_idx]} (similarity: {similarities[best_match_idx]:.3f})")

# Expected output: Best match: img2.jpg (similarity: 0.312)
# In production, use a vector DB (Pinecone, Qdrant) for scale

◆ Quick Reference

CV TASK DECISION GUIDE:
  "What is in this image?"              → Classification (ViT, EfficientNet)
  "Where are the objects?"              → Detection (YOLO, RT-DETR)
  "Which pixels belong to what?"        → Segmentation (SAM 2, Mask2Former)
  "What does this document say?"        → OCR (PaddleOCR, DocTR)
  "Find images similar to this text"    → CLIP / SigLIP embedding search
  "Answer questions about this image"   → VLM (GPT-4o, Gemini, LLaVA)

RESOLUTION vs COMPUTE TRADE-OFF:
  224×224:   Baseline (1×) — standard classification
  384×384:   2.9× compute — better detail, common for ViT-L
  512×512:   5.3× compute — detection / segmentation
  1024×1024: 21× compute  — high-res analysis

MODEL SIZE REFERENCE (ViT family):
  ViT-B/16:   86M params, 768-dim embedding
  ViT-L/14:  304M params, 1024-dim embedding
  ViT-H/14:  632M params, 1280-dim embedding
  EVA-CLIP:    1B+ params, SOTA image-text alignment

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Resolution mismatch Model misses small objects or text Input resized to 224×224, crushing fine detail Use higher resolution models (384/512), or tile + merge strategy
Domain shift High accuracy on ImageNet, poor on real data Training data distribution ≠ production data (medical, industrial, satellite) Fine-tune on domain-specific data, even 500-1000 images helps significantly
CLIP hallucination High similarity score for wrong matches CLIP latches onto spurious correlations (color, texture) Use as retrieval + rerank pipeline, not sole decision maker
Aspect ratio distortion Stretched/squished images produce wrong features Naive resize to square causes information loss Use padding/letterboxing, or models that handle variable aspect ratios
OCR on stylized text Fails on handwriting, unusual fonts, curved text OCR models trained primarily on printed text Use specialized models (TrOCR for handwriting), or VLMs (GPT-4o)

○ Gotchas & Common Mistakes

  • ⚠️ High ImageNet accuracy ≠ production robustness: A model with 90% on ImageNet may fail spectacularly on your specific photos (different lighting, angles, backgrounds).
  • ⚠️ Resolution is the hidden cost multiplier: Going from 224→512 resolution increases compute ~5× and memory ~5×. Budget this explicitly.
  • ⚠️ CLIP is powerful but not precise: CLIP excels at broad semantic matching but struggles with fine-grained distinctions ("two cats" vs "three cats"). Don't use it for counting or spatial reasoning.
  • ⚠️ OCR ≠ document understanding: Extracting text (OCR) is different from understanding layout and relationships (Document AI). Don't confuse them.
  • ⚠️ Vision tokens are expensive in VLMs: A single image in GPT-4o costs 85-1105 tokens. Processing 10 images per request can cost more than the text portion.

○ Interview Angles

  • Q: Why are Vision Transformers important for multimodal AI?
  • A: ViTs convert images into sequences of patch embeddings using the same transformer architecture as language models. This architectural alignment is what makes multimodal models possible — you can project visual patch tokens into the same embedding space as text tokens, concatenate them, and let a single transformer process both modalities together. This is exactly how models like LLaVA and GPT-4o work: a ViT encodes the image into visual tokens, a projection layer maps them into the LLM's space, and the LLM attends to both visual and text tokens. Before ViTs, integrating CNNs with transformers required more complex adapter architectures.

  • Q: How does CLIP enable zero-shot image classification?

  • A: CLIP trains a shared embedding space for images and text using contrastive learning on 400M image-text pairs from the internet. During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. At inference, you encode the image with the vision encoder and encode candidate class descriptions ("a photo of a dog", "a photo of a cat") with the text encoder. The class whose text embedding is most similar to the image embedding is the prediction. No task-specific training needed — any text description works as a class label. The limitation is that CLIP's accuracy is lower than fine-tuned models on specific benchmarks, but its flexibility is unmatched.

  • Q: Design an image search system for an e-commerce platform.

  • A: I'd build a two-stage retrieval + reranking pipeline. Stage 1: Use CLIP (or SigLIP) to encode all product images into embeddings, stored in a vector database (Qdrant or Pinecone). User queries (text or uploaded image) are encoded with the same model, and top-100 candidates retrieved by cosine similarity. Stage 2: A cross-encoder reranker (or VLM) scores each candidate for relevance, considering product metadata (category, price, availability). The embedding would update nightly via batch pipeline. For latency: embedding lookup < 50ms, reranking < 200ms. For cost: CLIP encoding is ~$0.001/image. I'd also add a feedback loop — user clicks improve the reranker over time.

◆ Hands-On Exercises

Exercise 1: Zero-Shot Image Classification

Goal: Classify images into custom categories without any training Time: 30 minutes Steps: 1. Load CLIP using the code above 2. Choose 5 images from different categories (food, animals, landscapes, tech, fashion) 3. Define 10 candidate text labels 4. Run zero-shot classification on all 5 images 5. Evaluate: which misclassifications occur? Why? Expected Output: Accuracy table, analysis of failure cases (CLIP confusion patterns)

Exercise 2: Build a Visual Search Engine

Goal: Build a text-to-image search over a local image collection Time: 45 minutes Steps: 1. Collect 50 images (or use CIFAR-10 sample) 2. Encode all images with CLIP into a numpy matrix 3. Implement text-query search: encode query → cosine similarity → top-5 results 4. Test with 5 queries: one precise ("red car"), one abstract ("peaceful"), one misleading 5. Evaluate retrieval quality — how does query specificity affect results? Expected Output: Working search system, quality analysis by query type


★ Connections

Relationship Topics
Builds on Multimodal AI, Embeddings, Modern Architectures
Leads to Diffusion Models (image generation), Document AI, Visual search systems
Compare with Pure text NLP (no visual grounding), Traditional CV (pre-deep-learning, feature engineering)
Cross-domain Robotics (perception), Medical imaging (radiology AI), Autonomous driving, Industrial inspection

Type Resource Why
🎓 Course Stanford CS231n: Deep Learning for Computer Vision The definitive CV course — CNNs, detection, segmentation, attention. Watch the lectures.
📄 Paper Dosovitskiy et al. "An Image is Worth 16×16 Words" (ViT, 2020) The paper that launched Vision Transformers. Section 3 explains the patch embedding mechanism.
📄 Paper Radford et al. "Learning Transferable Visual Models" (CLIP, 2021) How contrastive learning creates a shared vision-language embedding space
🔧 Hands-on HuggingFace Vision Transformers Tutorial Practical guide to using ViT for classification, feature extraction, and fine-tuning
📄 Paper Kirillov et al. "Segment Anything" (SAM, 2023) Zero-shot segmentation — the CLIP of pixel-level vision
🎥 Video Yannic Kilcher — "CLIP: Connecting Text and Images" Excellent visual explanation of contrastive learning and CLIP architecture
📘 Book "Deep Learning for Vision Systems" by Elgendy (2020) Practical introduction to CV with Python — good for building intuition

★ Sources

  • Dosovitskiy et al. "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale" (2020)
  • Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
  • Kirillov et al. "Segment Anything" (SAM, 2023)
  • Stanford CS231n Lecture Notes — http://cs231n.stanford.edu/
  • HuggingFace Transformers Documentation — https://huggingface.co/docs/transformers/
  • Multimodal AI