GPU & CUDA Programming for AI Engineers¶

Modern AI depends on GPUs, but high-level frameworks hide the hardware until performance forces you to care. This note is about that moment.

★ TL;DR¶

What: The hardware and programming concepts behind GPU-accelerated AI workloads.
Why: Many training and inference bottlenecks make sense only if you understand memory hierarchy, parallel execution, and kernel behavior.
Key point: In AI systems, moving data efficiently is often harder than doing the math.

★ Overview¶

Definition¶

A GPU is a massively parallel processor optimized for throughput-oriented numeric computation. CUDA is NVIDIA's programming model and toolchain for writing GPU-accelerated programs.

Scope¶

This note gives AI engineers a practical systems view: how GPU execution works, what kernels do, why memory matters, and how that connects to LLM performance.

Significance¶

CUDA literacy explains why some inference optimizations work and others do not.
AI infrastructure, inference, and compiler roles depend on this layer deeply.
Even application engineers benefit from understanding hardware-shaped trade-offs.

Prerequisites¶

★ Deep Dive¶

GPU Mental Model¶

GPUs are built for many operations in parallel:

thousands of lightweight threads
high memory bandwidth
hardware specialized for matrix-heavy workloads

They are excellent for dense tensor math and poor at control-heavy, branchy logic.

Core Concepts¶

Concept	Meaning	Why It Matters
Kernel	Function launched on the GPU	Unit of GPU work
Thread block	Group of cooperating threads	Shares fast on-chip memory
Warp	Hardware scheduling group of threads	Divergence hurts efficiency
Global memory	Large but slower GPU memory	Main bottleneck for many LLM workloads
Shared memory	Small fast block-local memory	Useful for reuse and tiling
Occupancy	How fully the GPU is utilized	Higher is not always better, but often useful

Memory Hierarchy Matters¶

CPU RAM
-> PCIe / NVLink transfer
-> GPU global memory
-> shared memory / registers
-> arithmetic units

AI performance often depends on reducing the cost of moving data between these levels.

Why LLMs Stress GPUs¶

LLM workloads include:

large matrix multiplies
attention kernels
KV-cache reads and writes
memory-heavy decode loops

Prefill tends to be more compute-heavy. Decode often becomes memory-bound.

CUDA Performance Ideas You Should Recognize¶

Idea	Practical Meaning
Kernel fusion	Combine steps to reduce memory traffic
Tiling	Reuse data in fast memory before reloading
Coalesced access	Neighboring threads access neighboring memory for efficiency
Asynchronous execution	Overlap transfer and compute where possible
CUDA graphs	Reduce launch overhead for repeated execution patterns

Minimal CUDA Kernel Example¶

__global__ void add_vectors(const float* a, const float* b, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = a[idx] + b[idx];
    }
}

This example is simple, but it shows the pattern:

many threads launched at once
each thread works on one slice of data
boundary checks matter

Practical AI Relevance¶

Understanding CUDA helps explain:

why FlashAttention is faster
why KV-cache layout matters
why quantization can reduce cost dramatically
why batching changes throughput
why some models saturate memory before compute

Profiling Mindset¶

Ask:

Is the workload compute-bound or memory-bound?
Are kernels too small or too fragmented?
Is data transfer dominating?
Is GPU utilization low because of the software stack above it?

◆ Quick Reference¶

Question	Heuristic
Slow decode on large model	suspect memory bandwidth or KV-cache behavior
Low GPU utilization	inspect batching, kernel launch shape, or host-side bottlenecks
Model fits but still underperforms	profile memory movement, not just FLOPs
CPU-heavy preprocessing	may starve the GPU
Lots of tiny kernels	fusion or graph capture may help

○ Gotchas & Common Mistakes¶

GPU utilization percentages are useful but incomplete.
Compute throughput and memory throughput are different bottlenecks.
Kernel-level optimization is usually wasted if the higher-level architecture is wrong.
CUDA knowledge helps diagnosis, but not every team needs custom kernels.

○ Interview Angles¶

Q: Why are LLM decode steps often memory-bound?
A: Each generated token requires repeatedly loading weights and KV-cache state, so memory movement can dominate arithmetic. That is why layout, caching, and serving-engine design matter so much.
Q: What is the practical value of understanding CUDA for an AI engineer?
A: It helps you reason about hardware bottlenecks, choose the right optimizations, and communicate effectively with systems or inference teams when performance issues appear.

★ Connections¶

Relationship	Topics
Builds on	Transformers, Inference Optimization
Leads to	Distributed Training for Large Models, advanced inference engineering
Compare with	CPU execution, high-level framework-only view
Cross-domain	computer architecture, compilers, HPC

★ Code & Implementation¶

GPU Memory Profiling with PyTorch¶

# pip install torch>=2.0
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.0

import torch

def gpu_memory_report():
    """Print current GPU memory usage."""
    if not torch.cuda.is_available():
        print("No GPU available")
        return

    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    max_allocated = torch.cuda.max_memory_allocated() / 1e9
    total = torch.cuda.get_device_properties(0).total_mem / 1e9

    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved:  {reserved:.2f} GB")
    print(f"Peak:      {max_allocated:.2f} GB")
    print(f"Total:     {total:.2f} GB")
    print(f"Free:      {total - reserved:.2f} GB")

# Example: profile loading a model
from transformers import AutoModelForCausalLM

gpu_memory_report()  # Before
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="cuda"
)
gpu_memory_report()  # After

# Expected output:
# GPU: NVIDIA A100 80GB
# Allocated: 15.20 GB  (8B params × 2 bytes)
# Reserved:  16.00 GB
# Peak:      15.20 GB
# Total:     80.00 GB
# Free:      64.00 GB

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
CUDA OOM	`RuntimeError: CUDA out of memory`	Model + activations + KV-cache exceed GPU memory	Reduce batch size, enable gradient checkpointing, use quantization
Memory-bound decode	Low GPU compute utilization, high memory bandwidth usage	Each token loads full KV-cache, bottlenecked by HBM bandwidth	Use FlashAttention, PagedAttention (vLLM), quantized KV-cache
Kernel launch overhead	Many tiny operations, GPU mostly idle	Thousands of small kernels with CPU launch overhead	CUDA graphs, kernel fusion, torch.compile
PCIe bottleneck	CPU preprocessing faster than GPU transfer	Large data transfers over PCIe instead of NVLink	Prefetch data, pin memory, overlap transfer with compute

◆ Hands-On Exercises¶

Exercise 1: GPU Memory Estimation¶

Goal: Build intuition for GPU memory requirements Time: 20 minutes Steps: 1. Calculate memory for a 7B model in fp32, fp16, int8, and int4 2. With the model loaded in bf16, estimate remaining memory for KV-cache 3. Calculate max batch size × sequence length that fits in remaining memory 4. Compare your estimates with actual usage using the profiling code above Expected Output: Memory estimation table matching real GPU measurements within 10%

★ Recommended Resources¶

Type	Resource	Why
🎓 Course	NVIDIA CUDA Programming Guide	Official reference for CUDA concepts and programming model
🎓 Course	Stanford CS149: Parallel Computing	Deep dive into GPU parallelism, memory hierarchy, and scheduling
📄 Paper	Dao et al. "FlashAttention" (2022)	Shows how IO-aware kernel design transforms attention performance
🔧 Hands-on	NVIDIA Nsight Systems / Compute	Essential GPU profiling tools for identifying bottlenecks
🎥 Video	Jeremy Howard — "CUDA Programming" (fast.ai)	Practical introduction to CUDA for ML engineers

★ Sources¶

NVIDIA CUDA Programming Guide — https://docs.nvidia.com/cuda/
NVIDIA Nsight Documentation — https://developer.nvidia.com/nsight-systems
Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022)
Inference Optimization