What makes an AI agent different from a chatbot?	A chatbot responds to messages. An agent sets goals, plans multi-step approaches, uses tools, observes results, and iterates. Agents are autonomous; chatbots are reactive.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How would you prevent an AI agent from getting stuck in a loop?	Max iteration limits, self-reflection prompts ("Am I making progress?"), fallback to human, diverse retry strategies (try different tools/approaches), and logging for debugging.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What's the ReAct pattern?	Reason + Act. The agent alternates between thinking (reasoning about what to do) and acting (calling tools). After each action, it observes the result and reasons about next steps. This interleaving of thought and action is more reliable than planning everything upfront.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How does agent memory work?	Four types: short-term (conversation context), long-term (vector DB storing facts/preferences across sessions), episodic (summaries of past task executions for learning), and procedural (learned strategies and tool patterns). In practice, most production agents use short-term + simple long-term memory with vector retrieval.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
Why divide by √d_k in attention?	Without it, for large d_k, dot products become huge → softmax saturates → near-zero gradients. Scaling keeps variance at ~1.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What's the difference between MHA, MQA, and GQA?	MHA: separate K,V per head (most expressive, slowest). MQA: one shared K,V (fastest, some quality loss). GQA: groups of heads share K,V (good balance). LLaMA 2+ uses GQA.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How does Flash Attention improve efficiency without changing the math?	It tiles the computation to fit in SRAM (fast cache), avoiding materialization of the full n×n attention matrix in slow HBM (GPU memory). Same result, ~2-4x faster.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
When would you use RAG vs just a long context window?	Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is context engineering?	Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What optimizer do you use for training Transformers and why?	AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How would you handle GPU memory limitations when training?	(1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
What are embeddings and why do they matter for GenAI?	Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
What's the difference between word embeddings and sentence embeddings?	Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
When would you fine-tune vs use RAG?	Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: **combine both** — LoRA for behavior, RAG for facts.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
Explain how LoRA reduces memory requirements.	Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
How does function calling work in LLMs?	You define tools with names, descriptions, and parameter schemas. The LLM receives the user message + tool definitions, decides if a tool should be called, and generates a JSON object with the function name and arguments. YOUR code executes the function and feeds the result back to the LLM for final response generation. The LLM never actually runs the function.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is MCP and why does it matter?	Model Context Protocol is an open standard for connecting LLMs to external tools. Before MCP, every tool needed custom integration for each model. MCP provides a universal interface — any MCP-compatible tool works with any MCP-compatible client. It's becoming the "USB standard" for AI tool integration.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
How would you evaluate a RAG system?	Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are traditional benchmarks becoming less useful?	Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Explain the training pipeline of a modern LLM.	Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What's the difference between dense and MoE architectures?	Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?	API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
Why is the dot product central to the attention mechanism?	Attention computes Q·Kᵀ where Q = query and K = key. The dot product measures how "related" each query is to each key — high dot product = high attention. This is then softmaxed into weights that determine how much each value V contributes to the output.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
Why are GPUs used for deep learning?	Neural networks are fundamentally matrix multiplications. GPUs have thousands of cores designed for parallel math operations. A CPU might do matrix multiply sequentially; a GPU does thousands of multiply-adds simultaneously.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
What is Mixture of Experts and why does LLaMA 4 use it?	MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What is GQA and how does it save memory?	Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What's the difference between BERT and GPT?	BERT is an ENCODER that sees all tokens bidirectionally (optimized for understanding — classification, NER, embeddings). GPT is a DECODER that sees only past tokens (optimized for generation — text, code, chat). Both use Transformers, but BERT predicts masked tokens while GPT predicts the next token.<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
Has GenAI made traditional NLP obsolete?	Mostly, yes. LLMs handle most NLP tasks via prompting, often better than task-specific models. However, BERT-based models survive for: (1) embeddings (BGE, E5), (2) sub-100ms classification at scale, (3) BM25/TF-IDF for initial retrieval in RAG. The field has consolidated around "one model, many tasks."<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
What does backpropagation actually compute?	The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
Why do we need activation functions?	Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
What is temperature in LLM generation?	Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What loss function do LLMs use and why?	Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What's the difference between zero-shot, few-shot, and chain-of-thought prompting?	Zero-shot: just instructions, no examples. Few-shot: include examples of desired input→output pairs. CoT: ask model to show reasoning steps. Each adds more guidance and typically improves quality.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
How would you handle prompt injection in a production system?	Input sanitization, separate system/user prompts, output validation, don't include raw user input in system prompts. Use the model's built-in system prompt separation. For critical apps, add a second LLM call to verify the first output makes sense.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
Why is Python dominant in AI if it is slower than C++?	Python gives fast iteration and a huge ecosystem, while the expensive numerical work runs underneath in optimized C, C++, CUDA, or vendor kernels. Python is the control layer, not the performance bottleneck.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What Python tools matter most for GenAI work?	NumPy for array thinking, PyTorch for tensors and training, Hugging Face libraries for models and tokenizers, plus environment management so your CUDA and package versions stay reproducible.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is the most common beginner mistake when starting AI Python work?	Treating the environment as an afterthought. Many early failures come from incompatible package versions, wrong CUDA installs, or tensors ending up on different devices.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
How would you improve a RAG pipeline that's giving wrong answers?	Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
When would you choose RAG over fine-tuning?	RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the difference between semantic and keyword search in RAG.	Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the Chinchilla scaling laws.	For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
How is a large language model pre-trained?	(1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
Why do LLMs use sub-word tokenization instead of word-level?	Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why is tokenization a source of bias?	Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why do Transformers use scaled dot-product attention (divide by √d_k)?	Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
What's the computational complexity of self-attention?	O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
Why decoder-only for generation instead of encoder-decoder?	Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
