Skip to content

Generative AI

Bit: "All models are wrong, but some are generative." — A class of AI that learns to model the distribution of its training data well enough to produce new samples from it. Whether that constitutes "understanding" is one of the most contested questions in the field.


★ TL;DR

  • What: AI systems that create new content (text, images, audio, video, code, 3D models) by learning to model the distribution of training data
  • Why: Transforming every industry — from software development to drug discovery. The defining technology of 2023–2026+
  • Key points: Built on Transformer architecture (2017). Two distinct scaling breakthroughs drove the revolution: (1) training compute scaling — bigger models on more data; and (2) post-training alignment — instruction tuning and RLHF that made raw models actually useful. A third scaling axis, test-time compute, emerged in 2024–2025

★ Overview

Definition

Generative AI is a class of artificial intelligence models that learn to model the underlying distribution of training data, enabling generation of new content consistent with that distribution. Unlike discriminative AI (which learns to classify or predict labels given inputs), generative AI learns what the data itself looks like — and can produce new examples: text, images, music, video, code, 3D models, and more.

Scope

This document covers the GenAI landscape at a high level. Each major sub-area has its own dedicated document:

Area Document
Start here Neural Networks, Linear Algebra for AI, Python for AI, Probability & Statistics for AI, Deep Learning Fundamentals, NLP Fundamentals
How it all works Transformers, Attention Mechanism, Attention Deep Dive, Modern LLM Architectures, Scaling Laws & Pre-training, State Space Models
How text becomes numbers Embeddings, Tokenization
Text generation Large Language Models (LLMs), Reasoning Models & Test-Time Compute, LLM Landscape & Model Selection, Hallucination Detection & Mitigation
Image generation Diffusion Models
Beyond text Multimodal AI, Computer Vision Fundamentals for AI Builders
Making models work for you Retrieval-Augmented Generation (RAG), Graph RAG & Advanced Retrieval, Fine-Tuning LLMs, Advanced Fine-Tuning for LLM Adaptation, Embedding Fine-Tuning, Long-Context Engineering, Structured Outputs & Constrained Generation, AI Agents, Agent Memory Systems, Multi-Agent Architectures, Agent Evaluation & Observability, Prompt Engineering, Function Calling, Structured Output & Tool Use, Context Engineering & Long Context
Alignment & training Reinforcement Learning for LLM Alignment, Synthetic Data & Data Engineering for LLMs, Knowledge Distillation & Model Compression, Continual Learning & Lifelong AI
Agentic infrastructure Agentic Protocols & Frameworks
Building with GenAI GenAI Tools & Infrastructure, Vector Databases, Cloud ML Services & Managed AI Platforms, Distributed Systems Fundamentals for AI, ML Experiment & Data Management
AI applications Code Generation & AI-Assisted Development, AI Coding Agents, Conversational AI & Dialogue Systems, API Design for AI Applications, AI Product Management Fundamentals, Voice AI & Speech, AI UX Patterns
Going to production LLMOps & Production Deployment, AI System Design for GenAI Applications, Docker & Kubernetes for GenAI Deployment, Model Serving for LLM Applications, Monitoring & Observability for GenAI Systems, CI/CD for ML and LLM Systems, Cost Optimization for GenAI Systems, Classical ML for GenAI Builders, Latency & Throughput Engineering for AI Systems, LLM Routing & Model Selection, Guardrails & Content Filtering, Data Flywheel Design, Document Parsing & Extraction
Measuring quality LLM Evaluation & Benchmarks, LLM Evaluation Deep Dive, Retrieval Evaluation, System Design for AI Interviews
Making it fast & cheap Inference Optimization, GPU & CUDA Programming for AI Engineers, Distributed Inference & Serving Architecture
Keeping it safe Ethics, Safety & Alignment, AI Regulation for Builders, Adversarial ML & AI Security, OWASP Top 10 for LLM Applications, Prompt Injection Deep Dive, MCP Security & Tool Trust
Research frontiers Mechanistic Interpretability, Distributed Training & Training Infrastructure, Research Methodology & Paper Reading for AI
Career & job readiness GenAI Career Roles — Complete Reference (2026), AI Engineer, Generative AI Engineer, LLM Engineer, RAG Engineer, Agentic AI Engineer, ML Engineer, MLOps / LLMOps Engineer

Significance

  • Market: Global GenAI market size is estimated at ~ $161 billion in April 2026 (multiple analyst estimates; definitions and methodologies vary)
  • Adoption: Growing rapidly across enterprise, with deployment maturity ranging from experimental chatbots to production agentic systems
  • Job impact: Creating new roles (AI engineers, LLMOps, AI safety researchers) while significantly transforming existing software and knowledge-work roles
  • Scientific impact: Accelerating drug discovery via generative molecular design (RFDiffusion, ProteinMPNN), materials science, and hypothesis generation at scale

Prerequisites


★ Deep Dive

The GenAI Landscape (2025-2026)

Time-sensitive vendor examples in this section were last reviewed in 2026-04. Treat them as a snapshot, not a permanent ranking.

Generative AI
├── 🔤 Text/Language
│   ├── Large Language Models (LLMs)
│   │   ├── Closed-source: GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, Grok 4.3
│   │   └── Open-weight: Llama 4 (Scout/Maverick), Qwen 3.5, Mistral, DeepSeek V4, GLM-5.1
│   └── Reasoning modes — extended thinking at inference time, built into frontier models
│       (e.g. GPT-5.5 Thinking, Claude extended thinking, Gemini Deep Think)
│       See: Test-Time Compute scaling
├── 🖼️ Image Generation
│   ├── Diffusion models: Stable Diffusion 3.5, FLUX.1.1 Pro, Midjourney v7
│   ├── Autoregressive/Hybrid: GPT Image 1 (OpenAI, replaces DALL-E), Ideogram 3.0
│   ├── GANs (legacy — largely superseded by diffusion; still used in niche tasks)
│   └── Image editing: Inpainting, Outpainting, Style Transfer
├── 🎥 Video Generation
│   ├── Veo 3.1 (Google), Runway Gen-4.5, Kling 3.0, Seedance 2.0
│   └── Note: Sora was discontinued March 2026; API winds down September 2026
├── 🔊 Audio / Music
│   ├── Text-to-speech: ElevenLabs, Bark, Voxtral TTS (Mistral)
│   ├── Music generation: Suno v5.5, Udio v4, Google Lyria 3
│   └── Voice cloning
├── 🧊 3D Generation
│   ├── Text/image-to-3D mesh: Meshy, Rodin, Tripo3D
│   └── Applications: games, product visualisation, VR/AR
├── 🧬 Multimodal Input
│   ├── All frontier LLMs now accept multimodal input (text + image + audio + video)
│   ├── Output from LLMs is text — "multimodal" in marketing almost always means input, not output
│   └── True joint output (e.g. video + synchronised audio) exists only in dedicated video models
└── 🤖 Agentic AI (2025–2026's defining trend)
    ├── Agents that plan, reason, and act autonomously across multiple steps
    ├── Tool use, function calling, code execution, web browsing, computer use
    ├── Coding agents: Claude Code, OpenAI Codex, GitHub Copilot, Gemini CLI
    └── Multi-agent systems and orchestration frameworks

The Key Paradigm Shifts

Era Paradigm Example
Pre-2017 Rule-based / Statistical NLP N-grams, RNNs, LSTMs
2017 Transformer architecture "Attention Is All You Need" (Vaswani et al., NeurIPS 2017)
2018–2020 Pre-training + Fine-tuning BERT (encoder), GPT-2 (decoder); transfer learning goes mainstream
2020–2022 Training compute scaling GPT-3 (175B params, May 2020); Kaplan et al. scaling laws (2020); Chinchilla optimal compute laws (2022)
2022–2023 Instruction tuning + RLHF + Chat interface InstructGPT → ChatGPT; alignment makes raw models usable
2023–2024 Multimodal input + Reasoning GPT-4, Gemini; image understanding; chain-of-thought at scale
2024–2025 Test-time compute scaling o1, DeepSeek-R1; thinking longer at inference rivals larger models
2025–2026 Agentic AI + Hybrid techniques Autonomous coding agents, multi-step workflows, RAG + LoRA combos

How Generative AI Actually Works (Simplified)

  1. Raw data: Massive datasets — internet text, images, code, audio, etc.
  2. Tokenisation: Text is split into tokens (subword units); images into patches. This is how raw data becomes model input. See Tokenization
  3. Architecture: Usually Transformer-based. See Transformers
  4. Pre-training: The model learns to predict the next token (text) or denoise corrupted inputs (images) on enormous data — building a compressed representation of the world
  5. Post-training / Alignment: Instruction fine-tuning and RLHF/DPO shape the pre-trained model into something that follows instructions and reduces harmful outputs. This stage is as important as pre-training for producing a usable model
  6. Inference: Given an input, the model generates output token-by-token (text) or step-by-step (images). Extended thinking modes allow additional computation at this stage, trading latency for quality

The core insight: Pre-training at scale surfaces broad capabilities; post-training alignment directs them. The combination — not either alone — is what makes modern GenAI systems useful.


◆ Types & Classifications

By Output Modality

⚠️ This table classifies by output modality only. Code is a text output, not a separate modality. All frontier LLMs accept multimodal input but produce text as their primary output — multimodal input capability and multimodal output capability are fundamentally different things.

Output Modality What It Generates Leading Models (April 2026)
Text Natural language, code, structured data GPT-5.5, GPT-5.4, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.3, Llama 4 Scout/Maverick, DeepSeek V4, GLM-5.1, Qwen 3.5
Image Still images from text or image prompts Midjourney v7, FLUX.1.1 Pro, GPT Image 1 (replaces DALL-E, APIs sunset May 2026), Stable Diffusion 3.5, Adobe Firefly, Ideogram 3.0
Video Video clips from text or image prompts Veo 3.1, Runway Gen-4.5, Kling 3.0, Seedance 2.0 (Sora discontinued March 2026; API winds down Sep 2026)
Audio Music, song, voice, sound effects Suno v5.5, Udio v4, ElevenLabs, Google Lyria 3, MiniMax Music 2.5
3D Meshes, textured 3D assets Meshy, Rodin, Tripo3D

By Architecture

Architecture How It Works Used For
Transformer — Decoder-only (Autoregressive) Predicts next token sequentially; attends only to prior context LLMs: GPT, Claude, Llama, Gemini
Transformer — Encoder-only (Bidirectional) Attends to full context in both directions; not generative by default Embeddings, classification, retrieval (BERT, sentence-transformers)
Diffusion Models Learns to iteratively denoise corrupted inputs; can be conditioned on text Image and video generation (FLUX, Stable Diffusion, Veo)
GANs Generator vs. Discriminator adversarial training Image synthesis; largely superseded by diffusion for generation; still used in domain adaptation and some video tasks
VAEs Encode inputs to a compressed latent space; decode back; used as a component within diffusion pipelines Latent compression inside diffusion models (e.g. Stable Diffusion's image encoder/decoder); standalone generation use has largely been replaced
MoE (Mixture of Experts) Route each token to a subset of specialised sub-networks; only a fraction of parameters active per token Efficient scaling of large models: Llama 4, DeepSeek V4, Mixtral, Grok 4.3

By Access Model

Type Examples Pros Cons
Closed-source API GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro Best frontier performance; no infrastructure burden; fast iteration Per-token cost at scale; no weight access; vendor lock-in; data leaves your environment
Open-weight Llama 4, Mistral Large 3, Qwen 3.5, DeepSeek V4, GLM-5.1 Self-hostable; fine-tunable; no per-token API fees; data stays on-premise Requires GPU infrastructure; operational overhead; performance may trail frontier closed models
True open-source (weights + code + data) OLMo (AllenAI), Pythia (EleutherAI), some smaller research models Full reproducibility; auditable training; no licence ambiguity Typically smaller and weaker than open-weight frontier models; limited commercial polish

Open-weight ≠ open-source: Most "open" frontier models (Llama, Mistral, Qwen) release model weights under a licence, but do not release training code, training data, or full methodology. True open-source AI — where weights, code, and data are all public — remains rare at frontier scale.


◆ Use Cases & Applications

Domain Application Impact
Software Dev Code generation, debugging, review, autonomous coding agents Reported 30–50% productivity gains; agentic systems now resolve real GitHub issues end-to-end
Content Writing, marketing copy, design, video production Democratising creative work; compressing production timelines from days to hours
Enterprise Document processing, customer support, knowledge retrieval Cost reduction; 24/7 availability; consistent quality at scale
Healthcare Generative molecular design for drug discovery, medical imaging analysis, clinical note generation Accelerating early-stage research; reducing documentation burden on clinicians
Finance Risk analysis, report generation, fraud detection, earnings summaries Speed and consistency; human review still required for high-stakes decisions
Education Personalised tutoring, adaptive content generation, feedback at scale Extending quality instruction to under-resourced contexts
Science Generative protein and molecule design (RFDiffusion, ProteinMPNN); materials discovery; simulation; hypothesis generation Compressing years of experimental iteration into weeks; note that AlphaFold is predictive AI (structure prediction), not generative
Legal Contract review, case research, first-draft generation Hours → minutes for document review; accuracy verification remains essential

○ Gotchas & Common Mistakes

  • ⚠️ Hallucinations: Models generate false information with full confidence. This is an architectural tendency, not a bug to be patched. Always verify factual claims, especially for high-stakes use cases.
  • ⚠️ The "wrapper" trap: Building purely on API calls without understanding model internals limits your ability to debug failures, optimise costs, evaluate quality, and customise behaviour — the core skills in production AI engineering.
  • ⚠️ Prompt engineering is necessary but not sufficient: It is a real and non-trivial skill (see Prompt Engineering), but building reliable production systems requires Fine-Tuning, RAG, Evaluation, and proper LLMOps.
  • ⚠️ Benchmark gaming: Models and their makers are incentivised to optimise for published benchmarks. Independent third-party evaluations and task-specific testing on your own data are more informative than leaderboard positions.
  • ⚠️ Cost blindness: Inference costs dominate training costs for most organisations at scale. At production volume, factors like prompt caching (60–90% cost reduction), batch APIs, model routing (cheap model for easy tasks, expensive model for hard ones), and context window discipline can make the difference between a viable and unviable system.
  • ⚠️ Multimodal confusion: In LLM marketing, "multimodal" almost always describes input capability (the model can read images, audio, video). The output is still text. True multimodal output — e.g. a model that simultaneously generates video and synchronised audio — is a distinct capability found only in dedicated video generation models.
  • ⚠️ "Emergent capabilities" are partly a training artefact: Capabilities that appear to emerge spontaneously from scale in pre-trained models are often significantly shaped or amplified by deliberate post-training choices — instruction tuning data, RLHF reward signals, synthetic data curation. Attributing everything to pre-training scale alone understates the importance of alignment work.

○ Interview Angles

  • Q: What's the difference between discriminative and generative AI?
  • A: Discriminative models learn P(y|x) — "given this input, what is the most likely label?" They draw a boundary between classes. Generative models learn P(x) or P(x|y) — they model what the data itself looks like and can produce new samples. In practice: a spam classifier is discriminative; a model that writes emails is generative.

  • Q: Why did Transformers enable the GenAI revolution?

  • A: Three reasons: (1) parallelisable training — unlike RNNs, Transformers process all tokens simultaneously, making large-scale training feasible on GPUs; (2) the attention mechanism captures long-range dependencies that RNNs struggled with; (3) the architecture scales predictably — more data and compute reliably yield better models. All three properties together made training on internet-scale datasets tractable.

  • Q: What's the difference between fine-tuning and RAG?

  • A: Fine-tuning modifies the model's weights — baking knowledge or behaviour changes in permanently. RAG (Retrieval-Augmented Generation) leaves weights unchanged and instead retrieves relevant external documents at inference time, injecting them into the context window. Fine-tuning is better for style, format, or behaviour changes; RAG is better for keeping knowledge current and reducing hallucination on factual queries. The two are frequently combined.

  • Q: What is test-time compute scaling?

  • A: A third scaling axis, distinct from training compute scaling. Instead of training a bigger model, you allocate more computation at inference time — letting the model think through a problem in multiple steps, verify its own reasoning, or explore multiple solution paths before answering. OpenAI's o1, DeepSeek-R1, and extended thinking modes in Claude and Gemini are examples. The insight: for hard reasoning tasks, thinking longer can outperform training larger.

★ Connections

Relationship Topics
Builds on Transformers, Attention Mechanism
Leads to Large Language Models (LLMs), Diffusion Models, Retrieval-Augmented Generation (RAG), AI Agents, Model Merging
Compare with Traditional ML (discriminative models), Rule-based AI
Cross-domain Neuroscience (how brains generate), Information Theory

○ Notes

The Current Competitive Landscape (28 April 2026)

Model lineups shift weekly. Verify against official release notes before using for decisions.

Company Latest Generally Available Model Standout Strength
OpenAI GPT-5.5 / GPT-5.5 Pro (Apr 23 2026) Best all-rounder; leads Artificial Analysis Intelligence Index; agentic coding + computer use
Anthropic Claude Opus 4.7 (Apr 16 2026); Mythos Preview (invite-only, Project Glasswing — cybersecurity research only, not publicly available) Leads SWE-bench Pro (64.3%) & Verified (87.6%); best for agentic and coding workflows
Google DeepMind Gemini 3.1 Pro (Feb 19 2026) Leads GPQA Diamond (94.3%) & ARC-AGI-2 (77.1%); best price-to-performance ratio at the frontier ($2/$12 per million tokens)
xAI Grok 4.3 (Apr 2026) Unique 4-agent parallel architecture; only frontier model with native real-time X/Twitter data
Meta Llama 4 Scout / Maverick (open-weight, Apr 5 2025); Muse Spark (closed proprietary, Apr 8 2026 — text output only, marks Meta's break from open-source strategy) Llama 4: best open-weight family (Scout: 10M token context, single-GPU deployable); Muse Spark: first model from Meta Superintelligence Labs under Alexandr Wang
Alibaba / Qwen Qwen 3.5 (122B MoE) Top open-weight model for coding; runs on consumer hardware (64GB RAM); 73%+ SWE-bench
Mistral AI Mistral Small 4 / Large 3 Leading European open alternative; Small 4 at $0.15/M input tokens; Voxtral TTS added Mar 2026
DeepSeek DeepSeek V4 (1.6T params, Apr 2026) MIT-licensed; rewrites cost economics at frontier scale; competitive with closed models on reasoning
Zhipu AI (Z.ai) GLM-5.1 (MIT license, Apr 2026) First open-weight model to reach #1 on SWE-bench Pro; fully self-hostable
Moonshot AI Kimi K2.6 (Apr 2026) Leading swarm orchestrator; ties GPT-5.5 on coding benchmarks

Key Trend: 2025–2026 = The Agentic Era

The industry has decisively shifted from "chatbots that respond" to "agents that act." AI systems now plan multi-step tasks, use tools, execute code, browse the web, operate computers, and complete long-horizon workflows with minimal human intervention. Every major model release in 2026 is benchmarked primarily on agentic capability — SWE-bench (autonomous software engineering), Terminal-Bench (command-line tasks), and Finance Agent benchmarks have replaced pure language understanding tests as the key evaluation axes.

Secondary trends shaping the landscape right now:

  • Open-weight parity — GLM-5.1 and Kimi K2.6 briefly held the #1 spot on SWE-bench Pro over all closed models in April 2026. The performance gap between open-weight and closed models has collapsed on many benchmarks; the remaining advantage of closed models is primarily in reliability, alignment quality, and tooling ecosystem.
  • Safety as a product differentiator — Anthropic's Project Glasswing (Mythos Preview, limited to cybersecurity defence use cases) and OpenAI's GPT-5.4 Cyber variant signal that the most capable models are now gated behind formal safety programmes rather than being released generally. This is new in 2026.
  • Pricing fragmentation — The flat-rate $20/month era is ending. Agentic workflows consume 100× the tokens of a chat session. This is forcing tiered compute pricing across Claude Code, GitHub Copilot, and Codex, and making prompt caching and model routing non-optional optimisations for teams at scale.
  • Chinese labs as genuine frontier competitors — DeepSeek V4, GLM-5.1, Qwen 3.5, and Kimi K2.6 are not budget alternatives. Several have led independent benchmark rankings. The "Chinese models are cheaper but weaker" characterisation is no longer accurate.
  • Model Context Protocol (MCP) — Anthropic's open standard for connecting agents to external tools and data sources has become the de facto industry protocol, adopted across Claude, GitHub Copilot, Cursor, Gemini CLI, and third-party tooling. The key concept: instead of bespoke integrations per tool, agents use MCP as a universal connector.
  • Meta's strategic pivot — With Muse Spark (Apr 2026), Meta shipped its first closed proprietary model, breaking from the Llama open-weight strategy that had defined its AI identity since 2023. Llama continues under a separate team, but Meta's frontier research output is now closed. This is a significant structural shift in the open-weight ecosystem.

★ Sources

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS 2017)
  • Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
  • Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556 (Chinchilla)
  • Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022 (InstructGPT / RLHF)
  • OpenAI model changelog (2025–2026) — GPT-5 series releases; GPT-5.5 announcement Apr 23 2026
  • Google Gemini release notes — Gemini 3.1 Pro announcement Feb 19 2026
  • Anthropic Claude model cards — Claude Opus 4.7 announcement Apr 16 2026; Mythos Preview Apr 7 2026
  • Meta Llama 4 announcement — Scout & Maverick released April 5, 2025; Muse Spark released April 8, 2026
  • McKinsey "State of AI 2025" report
  • Sebastian Raschka's LLM year-in-review (2025)