Generative AI¶

✨ Bit: "All models are wrong, but some are generative." — A class of AI that learns to model the distribution of its training data well enough to produce new samples from it. Whether that constitutes "understanding" is one of the most contested questions in the field.

★ TL;DR¶

What: AI systems that create new content (text, images, audio, video, code, 3D models) by learning to model the distribution of training data
Why: Transforming every industry — from software development to drug discovery. The defining technology of 2023–2026+
Key points: Built on Transformer architecture (2017). Two distinct scaling breakthroughs drove the revolution: (1) training compute scaling — bigger models on more data; and (2) post-training alignment — instruction tuning and RLHF that made raw models actually useful. A third scaling axis, test-time compute, emerged in 2024–2025

★ Overview¶

Definition¶

Generative AI is a class of artificial intelligence models that learn to model the underlying distribution of training data, enabling generation of new content consistent with that distribution. Unlike discriminative AI (which learns to classify or predict labels given inputs), generative AI learns what the data itself looks like — and can produce new examples: text, images, music, video, code, 3D models, and more.

Scope¶

This document covers the GenAI landscape at a high level. Each major sub-area has its own dedicated document:

Area	Document
Start here	Neural Networks, Linear Algebra for AI, Python for AI, Probability & Statistics for AI, Deep Learning Fundamentals, NLP Fundamentals
How it all works	Transformers, Attention Mechanism, Attention Deep Dive, Modern LLM Architectures, Scaling Laws & Pre-training, State Space Models
How text becomes numbers	Embeddings, Tokenization
Text generation	Large Language Models (LLMs), Reasoning Models & Test-Time Compute, LLM Landscape & Model Selection, Hallucination Detection & Mitigation
Image generation	Diffusion Models
Beyond text	Multimodal AI, Computer Vision Fundamentals for AI Builders
Making models work for you	Retrieval-Augmented Generation (RAG), Graph RAG & Advanced Retrieval, Fine-Tuning LLMs, Advanced Fine-Tuning for LLM Adaptation, Embedding Fine-Tuning, Long-Context Engineering, Structured Outputs & Constrained Generation, AI Agents, Agent Memory Systems, Multi-Agent Architectures, Agent Evaluation & Observability, Prompt Engineering, Function Calling, Structured Output & Tool Use, Context Engineering & Long Context
Alignment & training	Reinforcement Learning for LLM Alignment, Synthetic Data & Data Engineering for LLMs, Knowledge Distillation & Model Compression, Continual Learning & Lifelong AI
Agentic infrastructure	Agentic Protocols & Frameworks
Building with GenAI	GenAI Tools & Infrastructure, Vector Databases, Cloud ML Services & Managed AI Platforms, Distributed Systems Fundamentals for AI, ML Experiment & Data Management
AI applications	Code Generation & AI-Assisted Development, AI Coding Agents, Conversational AI & Dialogue Systems, API Design for AI Applications, AI Product Management Fundamentals, Voice AI & Speech, AI UX Patterns
Going to production	LLMOps & Production Deployment, AI System Design for GenAI Applications, Docker & Kubernetes for GenAI Deployment, Model Serving for LLM Applications, Monitoring & Observability for GenAI Systems, CI/CD for ML and LLM Systems, Cost Optimization for GenAI Systems, Classical ML for GenAI Builders, Latency & Throughput Engineering for AI Systems, LLM Routing & Model Selection, Guardrails & Content Filtering, Data Flywheel Design, Document Parsing & Extraction
Measuring quality	LLM Evaluation & Benchmarks, LLM Evaluation Deep Dive, Retrieval Evaluation, System Design for AI Interviews
Making it fast & cheap	Inference Optimization, GPU & CUDA Programming for AI Engineers, Distributed Inference & Serving Architecture
Keeping it safe	Ethics, Safety & Alignment, AI Regulation for Builders, Adversarial ML & AI Security, OWASP Top 10 for LLM Applications, Prompt Injection Deep Dive, MCP Security & Tool Trust
Research frontiers	Mechanistic Interpretability, Distributed Training & Training Infrastructure, Research Methodology & Paper Reading for AI
Career & job readiness	GenAI Career Roles — Complete Reference (2026), AI Engineer, Generative AI Engineer, LLM Engineer, RAG Engineer, Agentic AI Engineer, ML Engineer, MLOps / LLMOps Engineer

Significance¶

Market: Global GenAI market size is estimated at ~ $161 billion in April 2026 (multiple analyst estimates; definitions and methodologies vary)
Adoption: Growing rapidly across enterprise, with deployment maturity ranging from experimental chatbots to production agentic systems
Job impact: Creating new roles (AI engineers, LLMOps, AI safety researchers) while significantly transforming existing software and knowledge-work roles
Scientific impact: Accelerating drug discovery via generative molecular design (RFDiffusion, ProteinMPNN), materials science, and hypothesis generation at scale

Prerequisites¶

Neural Networks — neurons, layers, backpropagation
Linear Algebra for AI — vectors, matrices, dot products
Python for AI — NumPy, PyTorch, environment setup
Probability & Statistics for AI — distributions, loss functions, sampling
Deep Learning Fundamentals — training loop, optimizers, GPUs
NLP Fundamentals — BERT vs GPT, NER, text classification

★ Deep Dive¶

The GenAI Landscape (2025-2026)¶

Time-sensitive vendor examples in this section were last reviewed in 2026-04. Treat them as a snapshot, not a permanent ranking.

Generative AI
├── 🔤 Text/Language
│   ├── Large Language Models (LLMs)
│   │   ├── Closed-source: GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, Grok 4.3
│   │   └── Open-weight: Llama 4 (Scout/Maverick), Qwen 3.5, Mistral, DeepSeek V4, GLM-5.1
│   └── Reasoning modes — extended thinking at inference time, built into frontier models
│       (e.g. GPT-5.5 Thinking, Claude extended thinking, Gemini Deep Think)
│       See: Test-Time Compute scaling
│
├── 🖼️ Image Generation
│   ├── Diffusion models: Stable Diffusion 3.5, FLUX.1.1 Pro, Midjourney v7
│   ├── Autoregressive/Hybrid: GPT Image 1 (OpenAI, replaces DALL-E), Ideogram 3.0
│   ├── GANs (legacy — largely superseded by diffusion; still used in niche tasks)
│   └── Image editing: Inpainting, Outpainting, Style Transfer
│
├── 🎥 Video Generation
│   ├── Veo 3.1 (Google), Runway Gen-4.5, Kling 3.0, Seedance 2.0
│   └── Note: Sora was discontinued March 2026; API winds down September 2026
│
├── 🔊 Audio / Music
│   ├── Text-to-speech: ElevenLabs, Bark, Voxtral TTS (Mistral)
│   ├── Music generation: Suno v5.5, Udio v4, Google Lyria 3
│   └── Voice cloning
│
├── 🧊 3D Generation
│   ├── Text/image-to-3D mesh: Meshy, Rodin, Tripo3D
│   └── Applications: games, product visualisation, VR/AR
│
├── 🧬 Multimodal Input
│   ├── All frontier LLMs now accept multimodal input (text + image + audio + video)
│   ├── Output from LLMs is text — "multimodal" in marketing almost always means input, not output
│   └── True joint output (e.g. video + synchronised audio) exists only in dedicated video models
│
└── 🤖 Agentic AI (2025–2026's defining trend)
    ├── Agents that plan, reason, and act autonomously across multiple steps
    ├── Tool use, function calling, code execution, web browsing, computer use
    ├── Coding agents: Claude Code, OpenAI Codex, GitHub Copilot, Gemini CLI
    └── Multi-agent systems and orchestration frameworks

The Key Paradigm Shifts¶

Era	Paradigm	Example
Pre-2017	Rule-based / Statistical NLP	N-grams, RNNs, LSTMs
2017	Transformer architecture	"Attention Is All You Need" (Vaswani et al., NeurIPS 2017)
2018–2020	Pre-training + Fine-tuning	BERT (encoder), GPT-2 (decoder); transfer learning goes mainstream
2020–2022	Training compute scaling	GPT-3 (175B params, May 2020); Kaplan et al. scaling laws (2020); Chinchilla optimal compute laws (2022)
2022–2023	Instruction tuning + RLHF + Chat interface	InstructGPT → ChatGPT; alignment makes raw models usable
2023–2024	Multimodal input + Reasoning	GPT-4, Gemini; image understanding; chain-of-thought at scale
2024–2025	Test-time compute scaling	o1, DeepSeek-R1; thinking longer at inference rivals larger models
2025–2026	Agentic AI + Hybrid techniques	Autonomous coding agents, multi-step workflows, RAG + LoRA combos

How Generative AI Actually Works (Simplified)¶

Raw data: Massive datasets — internet text, images, code, audio, etc.
Tokenisation: Text is split into tokens (subword units); images into patches. This is how raw data becomes model input. See Tokenization
Architecture: Usually Transformer-based. See Transformers
Pre-training: The model learns to predict the next token (text) or denoise corrupted inputs (images) on enormous data — building a compressed representation of the world
Post-training / Alignment: Instruction fine-tuning and RLHF/DPO shape the pre-trained model into something that follows instructions and reduces harmful outputs. This stage is as important as pre-training for producing a usable model
Inference: Given an input, the model generates output token-by-token (text) or step-by-step (images). Extended thinking modes allow additional computation at this stage, trading latency for quality

The core insight: Pre-training at scale surfaces broad capabilities; post-training alignment directs them. The combination — not either alone — is what makes modern GenAI systems useful.

◆ Types & Classifications¶

By Output Modality¶

⚠️ This table classifies by output modality only. Code is a text output, not a separate modality. All frontier LLMs accept multimodal input but produce text as their primary output — multimodal input capability and multimodal output capability are fundamentally different things.

Output Modality	What It Generates	Leading Models (April 2026)
Text	Natural language, code, structured data	GPT-5.5, GPT-5.4, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.3, Llama 4 Scout/Maverick, DeepSeek V4, GLM-5.1, Qwen 3.5
Image	Still images from text or image prompts	Midjourney v7, FLUX.1.1 Pro, GPT Image 1 (replaces DALL-E, APIs sunset May 2026), Stable Diffusion 3.5, Adobe Firefly, Ideogram 3.0
Video	Video clips from text or image prompts	Veo 3.1, Runway Gen-4.5, Kling 3.0, Seedance 2.0 (Sora discontinued March 2026; API winds down Sep 2026)
Audio	Music, song, voice, sound effects	Suno v5.5, Udio v4, ElevenLabs, Google Lyria 3, MiniMax Music 2.5
3D	Meshes, textured 3D assets	Meshy, Rodin, Tripo3D

By Architecture¶

Architecture	How It Works	Used For
Transformer — Decoder-only (Autoregressive)	Predicts next token sequentially; attends only to prior context	LLMs: GPT, Claude, Llama, Gemini
Transformer — Encoder-only (Bidirectional)	Attends to full context in both directions; not generative by default	Embeddings, classification, retrieval (BERT, sentence-transformers)
Diffusion Models	Learns to iteratively denoise corrupted inputs; can be conditioned on text	Image and video generation (FLUX, Stable Diffusion, Veo)
GANs	Generator vs. Discriminator adversarial training	Image synthesis; largely superseded by diffusion for generation; still used in domain adaptation and some video tasks
VAEs	Encode inputs to a compressed latent space; decode back; used as a component within diffusion pipelines	Latent compression inside diffusion models (e.g. Stable Diffusion's image encoder/decoder); standalone generation use has largely been replaced
MoE (Mixture of Experts)	Route each token to a subset of specialised sub-networks; only a fraction of parameters active per token	Efficient scaling of large models: Llama 4, DeepSeek V4, Mixtral, Grok 4.3

By Access Model¶

Type	Examples	Pros	Cons
Closed-source API	GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro	Best frontier performance; no infrastructure burden; fast iteration	Per-token cost at scale; no weight access; vendor lock-in; data leaves your environment
Open-weight	Llama 4, Mistral Large 3, Qwen 3.5, DeepSeek V4, GLM-5.1	Self-hostable; fine-tunable; no per-token API fees; data stays on-premise	Requires GPU infrastructure; operational overhead; performance may trail frontier closed models
True open-source (weights + code + data)	OLMo (AllenAI), Pythia (EleutherAI), some smaller research models	Full reproducibility; auditable training; no licence ambiguity	Typically smaller and weaker than open-weight frontier models; limited commercial polish

Open-weight ≠ open-source: Most "open" frontier models (Llama, Mistral, Qwen) release model weights under a licence, but do not release training code, training data, or full methodology. True open-source AI — where weights, code, and data are all public — remains rare at frontier scale.

◆ Use Cases & Applications¶

Domain	Application	Impact
Software Dev	Code generation, debugging, review, autonomous coding agents	Reported 30–50% productivity gains; agentic systems now resolve real GitHub issues end-to-end
Content	Writing, marketing copy, design, video production	Democratising creative work; compressing production timelines from days to hours
Enterprise	Document processing, customer support, knowledge retrieval	Cost reduction; 24/7 availability; consistent quality at scale
Healthcare	Generative molecular design for drug discovery, medical imaging analysis, clinical note generation	Accelerating early-stage research; reducing documentation burden on clinicians
Finance	Risk analysis, report generation, fraud detection, earnings summaries	Speed and consistency; human review still required for high-stakes decisions
Education	Personalised tutoring, adaptive content generation, feedback at scale	Extending quality instruction to under-resourced contexts
Science	Generative protein and molecule design (RFDiffusion, ProteinMPNN); materials discovery; simulation; hypothesis generation	Compressing years of experimental iteration into weeks; note that AlphaFold is predictive AI (structure prediction), not generative
Legal	Contract review, case research, first-draft generation	Hours → minutes for document review; accuracy verification remains essential

○ Gotchas & Common Mistakes¶

⚠️ Hallucinations: Models generate false information with full confidence. This is an architectural tendency, not a bug to be patched. Always verify factual claims, especially for high-stakes use cases.
⚠️ The "wrapper" trap: Building purely on API calls without understanding model internals limits your ability to debug failures, optimise costs, evaluate quality, and customise behaviour — the core skills in production AI engineering.
⚠️ Prompt engineering is necessary but not sufficient: It is a real and non-trivial skill (see Prompt Engineering), but building reliable production systems requires Fine-Tuning, RAG, Evaluation, and proper LLMOps.
⚠️ Benchmark gaming: Models and their makers are incentivised to optimise for published benchmarks. Independent third-party evaluations and task-specific testing on your own data are more informative than leaderboard positions.
⚠️ Cost blindness: Inference costs dominate training costs for most organisations at scale. At production volume, factors like prompt caching (60–90% cost reduction), batch APIs, model routing (cheap model for easy tasks, expensive model for hard ones), and context window discipline can make the difference between a viable and unviable system.
⚠️ Multimodal confusion: In LLM marketing, "multimodal" almost always describes input capability (the model can read images, audio, video). The output is still text. True multimodal output — e.g. a model that simultaneously generates video and synchronised audio — is a distinct capability found only in dedicated video generation models.
⚠️ "Emergent capabilities" are partly a training artefact: Capabilities that appear to emerge spontaneously from scale in pre-trained models are often significantly shaped or amplified by deliberate post-training choices — instruction tuning data, RLHF reward signals, synthetic data curation. Attributing everything to pre-training scale alone understates the importance of alignment work.

○ Interview Angles¶

Q: What's the difference between discriminative and generative AI?
A: Discriminative models learn P(y|x) — "given this input, what is the most likely label?" They draw a boundary between classes. Generative models learn P(x) or P(x|y) — they model what the data itself looks like and can produce new samples. In practice: a spam classifier is discriminative; a model that writes emails is generative.
Q: Why did Transformers enable the GenAI revolution?
A: Three reasons: (1) parallelisable training — unlike RNNs, Transformers process all tokens simultaneously, making large-scale training feasible on GPUs; (2) the attention mechanism captures long-range dependencies that RNNs struggled with; (3) the architecture scales predictably — more data and compute reliably yield better models. All three properties together made training on internet-scale datasets tractable.
Q: What's the difference between fine-tuning and RAG?
A: Fine-tuning modifies the model's weights — baking knowledge or behaviour changes in permanently. RAG (Retrieval-Augmented Generation) leaves weights unchanged and instead retrieves relevant external documents at inference time, injecting them into the context window. Fine-tuning is better for style, format, or behaviour changes; RAG is better for keeping knowledge current and reducing hallucination on factual queries. The two are frequently combined.
Q: What is test-time compute scaling?
A: A third scaling axis, distinct from training compute scaling. Instead of training a bigger model, you allocate more computation at inference time — letting the model think through a problem in multiple steps, verify its own reasoning, or explore multiple solution paths before answering. OpenAI's o1, DeepSeek-R1, and extended thinking modes in Claude and Gemini are examples. The insight: for hard reasoning tasks, thinking longer can outperform training larger.

★ Connections¶

Relationship	Topics
Builds on	Transformers, Attention Mechanism
Leads to	Large Language Models (LLMs), Diffusion Models, Retrieval-Augmented Generation (RAG), AI Agents, Model Merging
Compare with	Traditional ML (discriminative models), Rule-based AI
Cross-domain	Neuroscience (how brains generate), Information Theory

○ Notes¶

The Current Competitive Landscape (28 April 2026)¶

Model lineups shift weekly. Verify against official release notes before using for decisions.

Company	Latest Generally Available Model	Standout Strength
OpenAI	GPT-5.5 / GPT-5.5 Pro (Apr 23 2026)	Best all-rounder; leads Artificial Analysis Intelligence Index; agentic coding + computer use
Anthropic	Claude Opus 4.7 (Apr 16 2026); Mythos Preview (invite-only, Project Glasswing — cybersecurity research only, not publicly available)	Leads SWE-bench Pro (64.3%) & Verified (87.6%); best for agentic and coding workflows
Google DeepMind	Gemini 3.1 Pro (Feb 19 2026)	Leads GPQA Diamond (94.3%) & ARC-AGI-2 (77.1%); best price-to-performance ratio at the frontier ($2/$12 per million tokens)
xAI	Grok 4.3 (Apr 2026)	Unique 4-agent parallel architecture; only frontier model with native real-time X/Twitter data
Meta	Llama 4 Scout / Maverick (open-weight, Apr 5 2025); Muse Spark (closed proprietary, Apr 8 2026 — text output only, marks Meta's break from open-source strategy)	Llama 4: best open-weight family (Scout: 10M token context, single-GPU deployable); Muse Spark: first model from Meta Superintelligence Labs under Alexandr Wang
Alibaba / Qwen	Qwen 3.5 (122B MoE)	Top open-weight model for coding; runs on consumer hardware (64GB RAM); 73%+ SWE-bench
Mistral AI	Mistral Small 4 / Large 3	Leading European open alternative; Small 4 at $0.15/M input tokens; Voxtral TTS added Mar 2026
DeepSeek	DeepSeek V4 (1.6T params, Apr 2026)	MIT-licensed; rewrites cost economics at frontier scale; competitive with closed models on reasoning
Zhipu AI (Z.ai)	GLM-5.1 (MIT license, Apr 2026)	First open-weight model to reach #1 on SWE-bench Pro; fully self-hostable
Moonshot AI	Kimi K2.6 (Apr 2026)	Leading swarm orchestrator; ties GPT-5.5 on coding benchmarks

Key Trend: 2025–2026 = The Agentic Era¶

The industry has decisively shifted from "chatbots that respond" to "agents that act." AI systems now plan multi-step tasks, use tools, execute code, browse the web, operate computers, and complete long-horizon workflows with minimal human intervention. Every major model release in 2026 is benchmarked primarily on agentic capability — SWE-bench (autonomous software engineering), Terminal-Bench (command-line tasks), and Finance Agent benchmarks have replaced pure language understanding tests as the key evaluation axes.

Secondary trends shaping the landscape right now:

Open-weight parity — GLM-5.1 and Kimi K2.6 briefly held the #1 spot on SWE-bench Pro over all closed models in April 2026. The performance gap between open-weight and closed models has collapsed on many benchmarks; the remaining advantage of closed models is primarily in reliability, alignment quality, and tooling ecosystem.
Safety as a product differentiator — Anthropic's Project Glasswing (Mythos Preview, limited to cybersecurity defence use cases) and OpenAI's GPT-5.4 Cyber variant signal that the most capable models are now gated behind formal safety programmes rather than being released generally. This is new in 2026.
Pricing fragmentation — The flat-rate $20/month era is ending. Agentic workflows consume 100× the tokens of a chat session. This is forcing tiered compute pricing across Claude Code, GitHub Copilot, and Codex, and making prompt caching and model routing non-optional optimisations for teams at scale.
Chinese labs as genuine frontier competitors — DeepSeek V4, GLM-5.1, Qwen 3.5, and Kimi K2.6 are not budget alternatives. Several have led independent benchmark rankings. The "Chinese models are cheaper but weaker" characterisation is no longer accurate.
Model Context Protocol (MCP) — Anthropic's open standard for connecting agents to external tools and data sources has become the de facto industry protocol, adopted across Claude, GitHub Copilot, Cursor, Gemini CLI, and third-party tooling. The key concept: instead of bespoke integrations per tool, agents use MCP as a universal connector.
Meta's strategic pivot — With Muse Spark (Apr 2026), Meta shipped its first closed proprietary model, breaking from the Llama open-weight strategy that had defined its AI identity since 2023. Llama continues under a separate team, but Meta's frontier research output is now closed. This is a significant structural shift in the open-weight ecosystem.

★ Sources¶

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS 2017)
Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556 (Chinchilla)
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022 (InstructGPT / RLHF)
OpenAI model changelog (2025–2026) — GPT-5 series releases; GPT-5.5 announcement Apr 23 2026
Google Gemini release notes — Gemini 3.1 Pro announcement Feb 19 2026
Anthropic Claude model cards — Claude Opus 4.7 announcement Apr 16 2026; Mythos Preview Apr 7 2026
Meta Llama 4 announcement — Scout & Maverick released April 5, 2025; Muse Spark released April 8, 2026
McKinsey "State of AI 2025" report
Sebastian Raschka's LLM year-in-review (2025)