What makes an AI agent different from a chatbot?	A chatbot responds to messages. An agent sets goals, plans multi-step approaches, uses tools, observes results, and iterates. Agents are autonomous; chatbots are reactive.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How would you prevent an AI agent from getting stuck in a loop?	Max iteration limits, self-reflection prompts ("Am I making progress?"), fallback to human, diverse retry strategies (try different tools/approaches), and logging for debugging.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What's the ReAct pattern?	Reason + Act. The agent alternates between thinking (reasoning about what to do) and acting (calling tools). After each action, it observes the result and reasons about next steps. This interleaving of thought and action is more reliable than planning everything upfront.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How does agent memory work?	Four types: short-term (conversation context), long-term (vector DB storing facts/preferences across sessions), episodic (summaries of past task executions for learning), and procedural (learned strategies and tool patterns). In practice, most production agents use short-term + simple long-term memory with vector retrieval.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
Why divide by √d_k in attention?	Without it, for large d_k, dot products become huge → softmax saturates → near-zero gradients. Scaling keeps variance at ~1.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What's the difference between MHA, MQA, and GQA?	MHA: separate K,V per head (most expressive, slowest). MQA: one shared K,V (fastest, some quality loss). GQA: groups of heads share K,V (good balance). LLaMA 2+ uses GQA.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How does Flash Attention improve efficiency without changing the math?	It tiles the computation to fit in SRAM (fast cache), avoiding materialization of the full n×n attention matrix in slow HBM (GPU memory). Same result, ~2-4x faster.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What makes CI/CD for LLM systems different from regular CI/CD?	The output behavior is probabilistic and influenced by prompts, models, and datasets, so the pipeline needs evaluation gates, cost checks, and rollout safety beyond normal software tests.<br><br>Source: production/cicd-for-ml.md<br>Tags: C, automation, cicd, deployment, llmops, mlops, production, testing
What would you gate before shipping a model change?	Quality against a regression set, safety checks, latency, token cost, and rollback readiness. If the change affects retrieval or agents, I would also inspect representative traces.<br><br>Source: production/cicd-for-ml.md<br>Tags: C, automation, cicd, deployment, llmops, mlops, production, testing
Where does classical ML fit in a GenAI system?	Classical ML handles the narrow, structured, cost-sensitive decisions around the LLM core. The most common patterns are: (1) request routing — classifying which model tier handles each request, saving 80%+ on API costs, (2) reranking — scoring and sorting retrieved documents faster than LLM-based reranking, (3) quality gates — fast pass/fail classifiers on LLM output before returning to users, (4) anomaly detection — flagging unusual requests or outputs for human review. The key insight is that production GenAI systems are hybrids: the LLM handles generation and reasoning, while classical models handle the structured decisions that need to be fast, cheap, and deterministic.<br><br>Source: production/classical-ml-for-genai.md<br>Tags: C, classical-ml, production, ranking, routing, sklearn, xgboost
How would you design a model routing system?	I'd start by defining 3 tiers: cached responses (free), small model (cheap), and large model (expensive). Feature engineering would extract request complexity indicators: length, question count, topic embedding, tool requirements, and context size. I'd train a logistic regression initially (simple, interpretable, fast to iterate) and upgrade to XGBoost once I have enough labeled data (1000+ examples). The labeling strategy: run all requests through the large model for a week, then label each request with the cheapest tier that achieved acceptable quality (measured by user satisfaction or LLM-judge score). Critical: add a confidence threshold — if the router is < 70% confident, default to the expensive model. This avoids the worst failure mode (misrouting a complex request to a cheap model).<br><br>Source: production/classical-ml-for-genai.md<br>Tags: C, classical-ml, production, ranking, routing, sklearn, xgboost
How would you choose between SageMaker, Vertex AI, and Azure AI Foundry?	I would start with the existing cloud footprint, governance requirements, workload type, and team skills. The best choice is usually the platform that fits the organization's operating context, not the one with the longest feature list.<br><br>Source: tools-and-infra/cloud-ml-services.md<br>Tags: C, azure-ai-foundry, cloud, infrastructure, mlops, sagemaker, tools-and-infra, vertex-ai
When would you avoid a full managed platform?	When the team needs extreme portability, a highly custom serving stack, or the platform overhead outweighs the operational value for the size of the workload.<br><br>Source: tools-and-infra/cloud-ml-services.md<br>Tags: C, azure-ai-foundry, cloud, infrastructure, mlops, sagemaker, tools-and-infra, vertex-ai
When would you use RAG vs just a long context window?	Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is context engineering?	Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What are the biggest cost levers in a GenAI application?	Model routing, context control, caching, retrieval discipline, and serving choices. Small prompt tweaks help, but architecture decisions usually dominate the savings.<br><br>Source: production/cost-optimization.md<br>Tags: B, C, caching, cost, llmops, optimization, production, routing, token-cost
What metric is better than cost per request?	Cost per successful task, because it reflects whether the spend actually produced value. A cheap request path that often fails can be more expensive overall.<br><br>Source: production/cost-optimization.md<br>Tags: B, C, caching, cost, llmops, optimization, production, routing, token-cost
What optimizer do you use for training Transformers and why?	AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How would you handle GPU memory limitations when training?	(1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
What is the difference between scaling replicas and sharding a model?	Replica scaling duplicates the full serving stack to handle more requests, while model sharding splits one model across multiple devices because it is too large or too expensive to serve as one unit.<br><br>Source: inference/distributed-inference-and-serving-architecture.md<br>Tags: C, architecture, distributed-inference, inference, kv-cache, scaling, serving
Why is KV-cache locality important?	Because rebuilding or transferring cache state is costly. Good locality helps keep follow-up tokens and turns efficient instead of repeatedly paying for the same context.<br><br>Source: inference/distributed-inference-and-serving-architecture.md<br>Tags: C, architecture, distributed-inference, inference, kv-cache, scaling, serving
Why do AI systems need distributed-systems knowledge?	Because production AI is composed of many interacting services with partial failure, variable latency, and expensive state transitions. Reliability depends on queueing, retry policy, caching, and graceful degradation.<br><br>Source: tools-and-infra/distributed-systems-for-ai.md<br>Tags: C, consistency, distributed-systems, infrastructure, queues, scaling, tools-and-infra
What is backpressure in an AI system?	It is the mechanism that prevents fast upstream components from overwhelming a slower downstream stage such as inference or retrieval. Without it, latency and failure rates can cascade through the system.<br><br>Source: tools-and-infra/distributed-systems-for-ai.md<br>Tags: C, consistency, distributed-systems, infrastructure, queues, scaling, tools-and-infra
When would you choose Kubernetes for a GenAI system?	When the system has multiple independently scaled services, controlled rollouts, background jobs, observability requirements, or self-hosted inference that needs GPU scheduling. For smaller systems, a managed platform or simple container deployment may be better.<br><br>Source: production/docker-and-kubernetes.md<br>Tags: C, containers, deployment, docker, kubernetes, llmops, production
What is the main Docker benefit for AI teams?	Reproducibility. It removes "works on my machine" failures across notebooks, CI, and production while standardizing dependencies and rollout behavior.<br><br>Source: production/docker-and-kubernetes.md<br>Tags: C, containers, deployment, docker, kubernetes, llmops, production
What are embeddings and why do they matter for GenAI?	Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
What's the difference between word embeddings and sentence embeddings?	Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
When would you fine-tune vs use RAG?	Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: **combine both** — LoRA for behavior, RAG for facts.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
Explain how LoRA reduces memory requirements.	Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
How does function calling work in LLMs?	You define tools with names, descriptions, and parameter schemas. The LLM receives the user message + tool definitions, decides if a tool should be called, and generates a JSON object with the function name and arguments. YOUR code executes the function and feeds the result back to the LLM for final response generation. The LLM never actually runs the function.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is MCP and why does it matter?	Model Context Protocol is an open standard for connecting LLMs to external tools. Before MCP, every tool needed custom integration for each model. MCP provides a universal interface — any MCP-compatible tool works with any MCP-compatible client. It's becoming the "USB standard" for AI tool integration.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
How does quantization make LLMs run on consumer hardware?	By representing model weights in fewer bits (INT4 = 4 bits vs FP16 = 16 bits), memory drops 4x. LLaMA 70B goes from 140GB (needs 2× A100) to 35GB (fits on 1× RTX 4090). Modern quantization methods (AWQ, GPTQ) preserve quality by protecting important weights and using calibration data.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
Explain speculative decoding.	A small draft model rapidly generates N candidate tokens. The large target model verifies all N in a single parallel forward pass (since verification is parallelizable). Accepted tokens are kept, rejected ones trigger regeneration. Provably lossless (same distribution as target model) with 2-3x speedup.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
What is PagedAttention and why does vLLM use it?	Like OS virtual memory paging. KV cache is stored in non-contiguous memory blocks ("pages") instead of one contiguous block. This eliminates fragmentation, allows dynamic memory allocation, and enables sharing cache between requests with common prefixes. Result: 2-4x higher throughput.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
How would you evaluate a RAG system?	Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are traditional benchmarks becoming less useful?	Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Explain the training pipeline of a modern LLM.	Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What's the difference between dense and MoE architectures?	Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?	API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What is the difference between latency and throughput, and when do they conflict?	Latency is the time a single request takes from submission to completion. Throughput is the total work the system handles per unit time (req/s or tokens/s). They conflict because optimizing throughput often means larger batch sizes and higher GPU utilization, which increases queuing time and thus individual request latency. In production, you typically set a latency SLA first (e.g., P95 < 2s), then maximize throughput within that constraint. The mathematical relationship is captured by Little's Law: L = λW — at a given concurrency level (L), you can trade latency (W) for throughput (λ) and vice versa.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
Why is P95/P99 latency more important than average latency in AI systems?	Because user experience is defined by the slowest interaction, not the average one. If your P50 is 800ms but P99 is 5000ms, then 1 in 100 users waits 5+ seconds — and those users remember. It gets worse with fan-out: a page showing 10 AI-powered components has a 40% chance that at least one component hits P95 latency (0.95^10 = 0.60, so 40% chance of at least one slow response). Furthermore, tail latency often reveals systemic issues (garbage collection, memory pressure, noisy neighbors) that averages hide. In system design interviews, always say "I'd measure P95 and P99" — it signals production experience.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
How would you capacity-plan an LLM serving system for 50 requests/second?	I'd use Little's Law. If average latency is 2 seconds and each GPU handles 8 concurrent requests at 80% utilization (to avoid queue explosion), then: Required concurrency = λ × W = 50 × 2 = 100 concurrent slots. Effective capacity per GPU = 8 × 0.8 = 6.4. GPUs needed = 100 / 6.4 = 16 GPUs. I'd add 20% headroom for traffic bursts, so 19-20 GPUs. Then I'd validate with load testing, watching P99 latency and queue depth to confirm the model holds under realistic traffic patterns.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
Why is the dot product central to the attention mechanism?	Attention computes Q·Kᵀ where Q = query and K = key. The dot product measures how "related" each query is to each key — high dot product = high attention. This is then softmaxed into weights that determine how much each value V contributes to the output.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
Why are GPUs used for deep learning?	Neural networks are fundamentally matrix multiplications. GPUs have thousands of cores designed for parallel math operations. A CPU might do matrix multiply sequentially; a GPU does thousands of multiply-adds simultaneously.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
What is the minimum metadata you would track for an ML run?	Parameters, metrics, code version, data version, artifacts, and environment details. Without that set, comparing or reproducing results is unreliable. For GenAI specifically, I'd also track prompt versions, eval set versions, and token costs.<br><br>Source: tools-and-infra/ml-experiment-and-data-management.md<br>Tags: C, data-versioning, dvc, experiment-tracking, infrastructure, lineage, mlflow, tools-and-infra, wandb
Why is data versioning essential for ML reproducibility?	Because code alone does not determine model behavior. You need the exact dataset state, splits, and lineage to reproduce or explain results. In GenAI, this extends to RAG corpus snapshots, preference data, and evaluation sets.<br><br>Source: tools-and-infra/ml-experiment-and-data-management.md<br>Tags: C, data-versioning, dvc, experiment-tracking, infrastructure, lineage, mlflow, tools-and-infra, wandb
What is the difference between inference optimization and model serving?	Inference optimization focuses on making the core generation path more efficient, for example quantization or KV-cache improvements. Model serving covers the full production runtime around that path, including APIs, routing, scheduling, scaling, and failure handling.<br><br>Source: production/model-serving.md<br>Tags: B, C, inference, llmops, production, serving, tgi, triton, vllm
When would you self-host instead of using a managed API?	When privacy, volume economics, model customization, or latency control outweigh the extra operational burden. Otherwise, managed APIs are usually the faster path.<br><br>Source: production/model-serving.md<br>Tags: B, C, inference, llmops, production, serving, tgi, triton, vllm
What is Mixture of Experts and why does LLaMA 4 use it?	MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What is GQA and how does it save memory?	Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
Why is observability harder for LLM systems than for normal APIs?	Because correctness is not binary. The system can return a 200 response and still be wrong, unsafe, or unhelpful. You need traceable context, output quality signals, and user feedback, not just uptime metrics.<br><br>Source: production/monitoring-observability.md<br>Tags: B, C, evals, llmops, monitoring, observability, production, tracing
What is the minimum telemetry for a production RAG system?	Request id, model and prompt version, retrieval documents and scores, token usage, latency, validation status, and user feedback. That gives you enough context to debug both system and semantic failures.<br><br>Source: production/monitoring-observability.md<br>Tags: B, C, evals, llmops, monitoring, observability, production, tracing
What's the difference between BERT and GPT?	BERT is an ENCODER that sees all tokens bidirectionally (optimized for understanding — classification, NER, embeddings). GPT is a DECODER that sees only past tokens (optimized for generation — text, code, chat). Both use Transformers, but BERT predicts masked tokens while GPT predicts the next token.<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
Has GenAI made traditional NLP obsolete?	Mostly, yes. LLMs handle most NLP tasks via prompting, often better than task-specific models. However, BERT-based models survive for: (1) embeddings (BGE, E5), (2) sub-100ms classification at scale, (3) BM25/TF-IDF for initial retrieval in RAG. The field has consolidated around "one model, many tasks."<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
What does backpropagation actually compute?	The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
Why do we need activation functions?	Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
What is temperature in LLM generation?	Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What loss function do LLMs use and why?	Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What's the difference between zero-shot, few-shot, and chain-of-thought prompting?	Zero-shot: just instructions, no examples. Few-shot: include examples of desired input→output pairs. CoT: ask model to show reasoning steps. Each adds more guidance and typically improves quality.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
How would you handle prompt injection in a production system?	Input sanitization, separate system/user prompts, output validation, don't include raw user input in system prompts. Use the model's built-in system prompt separation. For critical apps, add a second LLM call to verify the first output makes sense.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
Why is Python dominant in AI if it is slower than C++?	Python gives fast iteration and a huge ecosystem, while the expensive numerical work runs underneath in optimized C, C++, CUDA, or vendor kernels. Python is the control layer, not the performance bottleneck.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What Python tools matter most for GenAI work?	NumPy for array thinking, PyTorch for tensors and training, Hugging Face libraries for models and tokenizers, plus environment management so your CUDA and package versions stay reproducible.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is the most common beginner mistake when starting AI Python work?	Treating the environment as an afterthought. Many early failures come from incompatible package versions, wrong CUDA installs, or tensors ending up on different devices.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
How would you improve a RAG pipeline that's giving wrong answers?	Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
When would you choose RAG over fine-tuning?	RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the difference between semantic and keyword search in RAG.	Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the Chinchilla scaling laws.	For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
How is a large language model pre-trained?	(1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
Why do LLMs use sub-word tokenization instead of word-level?	Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why is tokenization a source of bias?	Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why do Transformers use scaled dot-product attention (divide by √d_k)?	Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
What's the computational complexity of self-attention?	O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
Why decoder-only for generation instead of encoder-decoder?	Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers