career
distributed-training
foundation-models
inference
infrastructure
research
Research And Infrastructure Roles
Use this guide if you want to build the deepest layers of AI systems: training stacks, inference engines, research experiments, and high-performance infrastructure.
Included Roles
Role
Layer
Best Fit
What Differentiates It
Inference Optimization Engineer
Layer 3
systems-minded performance work
latency, throughput, kernels, batching, memory
Foundation Model Engineer
Layer 2
pretraining and adaptation at scale
training data, scaling, alignment, long-run experiments
AI Research Scientist
Layer 2
frontier experimentation and novel methods
hypothesis design and paper-grade rigor
Applied AI Scientist
Layer 2
research translated into practical model gains
strong experimentation plus delivery sense
AI Infra / Platform Engineer
Layer 1
clusters, serving platforms, and reliability
platform abstractions, fleet operation, GPU orchestration
AI Compiler / Kernel Engineer
Layer 1
deepest performance stack
compilers, kernels, hardware-near optimization
Learning Path
Phase 1: Foundation
Complete Part 1 of the Learning Path first, then commit to the deeper systems and research path.
Phase 2: Shared Core
Phase 3: Role-Specific Emphasis
Role
High-Leverage Notes
Why
Inference Optimization Engineer
inference-optimization , latency-and-throughput-engineering , model-serving
performance and serving-path control
Foundation Model Engineer
advanced-fine-tuning , continual-learning , synthetic-data-and-data-engineering
adaptation after pretraining
AI Research Scientist
research-methodology-and-paper-reading , reasoning-models , multimodal-ai
frontier hypothesis generation and transfer
Applied AI Scientist
llm-evaluation-deep-dive , advanced-fine-tuning , hallucination-detection
rigorous iteration on practical model behavior
AI Infra / Platform Engineer
docker-and-kubernetes , distributed-systems-for-ai , cost-optimization
fleet and platform operation at scale
AI Compiler / Kernel Engineer
gpu-cuda-programming , inference-optimization , distributed-training
hardware-near performance work
Phase 4: External Skills
#
Skill
Recommended Focus
Priority
1
C++, CUDA, and systems profiling
especially for infrastructure and optimization roles
Must
2
Reproducibility discipline
experiment tracking, benchmark hygiene, ablation thinking
Must
3
Distributed-compute literacy
networking, memory hierarchy, cluster scheduling
Must
Skills Breakdown
Common Technical Skills
performance intuition around memory, batching, and distributed work
experiment rigor and baseline comparison
ability to reason about trade-offs across training and serving systems
Differentiators By Role
research roles need stronger hypothesis formation and literature fluency
infrastructure roles need stronger operational and systems depth
optimization roles sit closest to performance bottlenecks and tooling
Soft Skills
patience with ambiguity
disciplined measurement over intuition-only claims
precise communication of limits, assumptions, and regressions
Portfolio Project Ideas
Project
Description
Skills Demonstrated
Difficulty
Serving benchmark harness
compare latency and throughput across two serving setups with clear metrics
inference systems, profiling, experiment rigor
Hard
Mini research replication
reproduce a paper result on smaller hardware and document what transfers
research methodology, critical reading, adaptation
Hard
Interview Preparation
Review distributed-training , gpu-cuda-programming , inference-optimization , and research-methodology-and-paper-reading .
Common themes:
Where is the true bottleneck: compute, memory, bandwidth, or orchestration?
How do you verify that a claimed gain survives a real baseline comparison?
When do you choose architectural change versus systems optimization?
Sources