Research And Infrastructure Roles¶

Use this guide if you want to build the deepest layers of AI systems: training stacks, inference engines, research experiments, and high-performance infrastructure.

Included Roles¶

Role	Layer	Best Fit	What Differentiates It
Inference Optimization Engineer	Layer 3	systems-minded performance work	latency, throughput, kernels, batching, memory
Foundation Model Engineer	Layer 2	pretraining and adaptation at scale	training data, scaling, alignment, long-run experiments
AI Research Scientist	Layer 2	frontier experimentation and novel methods	hypothesis design and paper-grade rigor
Applied AI Scientist	Layer 2	research translated into practical model gains	strong experimentation plus delivery sense
AI Infra / Platform Engineer	Layer 1	clusters, serving platforms, and reliability	platform abstractions, fleet operation, GPU orchestration
AI Compiler / Kernel Engineer	Layer 1	deepest performance stack	compilers, kernels, hardware-near optimization

Learning Path¶

Phase 1: Foundation¶

Complete Part 1 of the Learning Path first, then commit to the deeper systems and research path.

Phase 2: Shared Core¶

#	Topic	Note	Priority	Est. Time
1	Scaling laws and pretraining	scaling-laws-and-pretraining	Must	4h
2	Distributed training	distributed-training	Must	4h
3	Training infrastructure	training-infrastructure	Must	3h
4	GPU and CUDA programming	gpu-cuda-programming	Must	4h
5	Distributed inference and serving architecture	distributed-inference-and-serving-architecture	Must	3h
6	Mechanistic interpretability	interpretability	Must	2h

Phase 3: Role-Specific Emphasis¶

Role	High-Leverage Notes	Why
Inference Optimization Engineer	inference-optimization, latency-and-throughput-engineering, model-serving	performance and serving-path control
Foundation Model Engineer	advanced-fine-tuning, continual-learning, synthetic-data-and-data-engineering	adaptation after pretraining
AI Research Scientist	research-methodology-and-paper-reading, reasoning-models, multimodal-ai	frontier hypothesis generation and transfer
Applied AI Scientist	llm-evaluation-deep-dive, advanced-fine-tuning, hallucination-detection	rigorous iteration on practical model behavior
AI Infra / Platform Engineer	docker-and-kubernetes, distributed-systems-for-ai, cost-optimization	fleet and platform operation at scale
AI Compiler / Kernel Engineer	gpu-cuda-programming, inference-optimization, distributed-training	hardware-near performance work

Phase 4: External Skills¶

#	Skill	Recommended Focus	Priority
1	C++, CUDA, and systems profiling	especially for infrastructure and optimization roles	Must
2	Reproducibility discipline	experiment tracking, benchmark hygiene, ablation thinking	Must
3	Distributed-compute literacy	networking, memory hierarchy, cluster scheduling	Must

Skills Breakdown¶

Common Technical Skills¶

performance intuition around memory, batching, and distributed work
experiment rigor and baseline comparison
ability to reason about trade-offs across training and serving systems

Differentiators By Role¶

research roles need stronger hypothesis formation and literature fluency
infrastructure roles need stronger operational and systems depth
optimization roles sit closest to performance bottlenecks and tooling

Soft Skills¶

patience with ambiguity
disciplined measurement over intuition-only claims
precise communication of limits, assumptions, and regressions

Portfolio Project Ideas¶

Project	Description	Skills Demonstrated	Difficulty
Serving benchmark harness	compare latency and throughput across two serving setups with clear metrics	inference systems, profiling, experiment rigor	Hard
Mini research replication	reproduce a paper result on smaller hardware and document what transfers	research methodology, critical reading, adaptation	Hard

Interview Preparation¶

Review distributed-training, gpu-cuda-programming, inference-optimization, and research-methodology-and-paper-reading.

Common themes:

Where is the true bottleneck: compute, memory, bandwidth, or orchestration?
How do you verify that a claimed gain survives a real baseline comparison?
When do you choose architectural change versus systems optimization?

Sources¶

GenAI Career Roles - Complete Reference (2026)
Repo notes linked above