What makes an AI agent different from a chatbot?	A chatbot responds to messages. An agent sets goals, plans multi-step approaches, uses tools, observes results, and iterates. Agents are autonomous; chatbots are reactive.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How would you prevent an AI agent from getting stuck in a loop?	Max iteration limits, self-reflection prompts ("Am I making progress?"), fallback to human, diverse retry strategies (try different tools/approaches), and logging for debugging.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What's the ReAct pattern?	Reason + Act. The agent alternates between thinking (reasoning about what to do) and acting (calling tools). After each action, it observes the result and reasons about next steps. This interleaving of thought and action is more reliable than planning everything upfront.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How does agent memory work?	Four types: short-term (conversation context), long-term (vector DB storing facts/preferences across sessions), episodic (summaries of past task executions for learning), and procedural (learned strategies and tool patterns). In practice, most production agents use short-term + simple long-term memory with vector retrieval.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What should an AI engineer do when working on a potentially regulated use case?	Classify the use case early against EU AI Act Annex III categories, document system purpose and limitations, build evaluation and monitoring into the workflow, and pull in legal or compliance partners before launch. An engineer who flags a high-risk classification before code is written saves far more time than one who raises it post-launch.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
Why does the NIST AI RMF matter if it is voluntary?	It provides an operational structure for governance and risk management that maps directly to engineering workflows (Govern → Map → Measure → Manage). Many enterprises use it to organize trustworthy-AI programs, and US federal procurement increasingly references it. It's also a good baseline before your jurisdiction mandates something more specific.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
What is the significance of August 2, 2026 for engineering teams?	It's the date the EU AI Act becomes fully applicable for Annex III high-risk AI systems. After this date, deploying a non-compliant high-risk AI system in the EU exposes the deployer to fines up to €35M or 7% of global annual turnover. Engineering teams should treat this as a hard product deadline: conformity assessments, human oversight mechanisms, logging (Art.12), and testing records (Art.9) must all be in place.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
Why is prompt injection a security problem and not only a quality problem?	Because malicious instructions can manipulate system behavior, trigger data leakage, or cause unauthorized actions through tools and downstream systems. That makes it part of the application's security surface.<br><br>Source: ethics-and-safety/adversarial-ml-and-ai-security.md<br>Tags: E, adversarial-ml, ethics-and-safety, jailbreaks, prompt-injection, red-teaming, security
What is the first rule for AI security in agent systems?	Minimize and constrain what the agent can do. Least privilege, validation, and human approval for sensitive actions matter more than clever prompting alone.<br><br>Source: ethics-and-safety/adversarial-ml-and-ai-security.md<br>Tags: E, adversarial-ml, ethics-and-safety, jailbreaks, prompt-injection, red-teaming, security
Why divide by √d_k in attention?	Without it, for large d_k, dot products become huge → softmax saturates → near-zero gradients. Scaling keeps variance at ~1.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What's the difference between MHA, MQA, and GQA?	MHA: separate K,V per head (most expressive, slowest). MQA: one shared K,V (fastest, some quality loss). GQA: groups of heads share K,V (good balance). LLaMA 2+ uses GQA.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How does Flash Attention improve efficiency without changing the math?	It tiles the computation to fit in SRAM (fast cache), avoiding materialization of the full n×n attention matrix in slow HBM (GPU memory). Same result, ~2-4x faster.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
When would you use RAG vs just a long context window?	Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is context engineering?	Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What optimizer do you use for training Transformers and why?	AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How would you handle GPU memory limitations when training?	(1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
What are embeddings and why do they matter for GenAI?	Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
What's the difference between word embeddings and sentence embeddings?	Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
How does RLHF work?	Generate multiple responses → humans rank them by preference → train a reward model on those rankings → use RL (PPO) to fine-tune the LLM to maximize the reward model's score. This teaches the model nuanced preferences (helpful, harmless, honest) that explicit rules can't capture.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
How would you handle hallucination in a production system?	Layer defenses: (1) RAG for factual grounding, (2) Force citations/sources, (3) Low temperature for factual tasks, (4) Output validation (check claims against a knowledge base), (5) Human-in-the-loop for critical decisions for high-stakes scenarios.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
What's the difference between RLHF and DPO?	Both learn from human preference pairs (A is better than B). RLHF first trains a separate reward model, then uses RL to optimize. DPO skips the reward model and directly optimizes the LLM on preference pairs — simpler, cheaper, similar quality.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
When would you fine-tune vs use RAG?	Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: **combine both** — LoRA for behavior, RAG for facts.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
Explain how LoRA reduces memory requirements.	Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
How does function calling work in LLMs?	You define tools with names, descriptions, and parameter schemas. The LLM receives the user message + tool definitions, decides if a tool should be called, and generates a JSON object with the function name and arguments. YOUR code executes the function and feeds the result back to the LLM for final response generation. The LLM never actually runs the function.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is MCP and why does it matter?	Model Context Protocol is an open standard for connecting LLMs to external tools. Before MCP, every tool needed custom integration for each model. MCP provides a universal interface — any MCP-compatible tool works with any MCP-compatible client. It's becoming the "USB standard" for AI tool integration.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is the most effective way to reduce hallucination in enterprise assistants?	Ground the answer on retrieval or tool outputs, require evidence in the response path, and add a post-generation verification step with abstention when confidence is low.<br><br>Source: llms/hallucination-detection.md<br>Tags: B, E, factuality, groundedness, hallucination, llm, llms, reliability
How do detection and mitigation differ?	Detection estimates whether an answer is unsupported. Mitigation changes the system so unsupported answers happen less often or are blocked before the user sees them.<br><br>Source: llms/hallucination-detection.md<br>Tags: B, E, factuality, groundedness, hallucination, llm, llms, reliability
How would you evaluate a RAG system?	Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are traditional benchmarks becoming less useful?	Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are benchmarks not enough for production LLM evaluation?	Benchmarks measure generic capability, but production systems depend on domain data, UX constraints, retrieval quality, safety needs, and business outcomes. You need task-specific evaluation tied to real failure modes.<br><br>Source: evaluation/llm-evaluation-deep-dive.md<br>Tags: B, E, deepeval, evaluation, judges, llm-as-judge, ragas, regression
What would you measure in a RAG eval suite?	Retrieval quality, groundedness, answer usefulness, latency, and cost. I would also include adversarial and ambiguous queries because those reveal brittle behavior quickly.<br><br>Source: evaluation/llm-evaluation-deep-dive.md<br>Tags: B, E, deepeval, evaluation, judges, llm-as-judge, ragas, regression
Explain the training pipeline of a modern LLM.	Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What's the difference between dense and MoE architectures?	Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?	API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
Why is the dot product central to the attention mechanism?	Attention computes Q·Kᵀ where Q = query and K = key. The dot product measures how "related" each query is to each key — high dot product = high attention. This is then softmaxed into weights that determine how much each value V contributes to the output.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
Why are GPUs used for deep learning?	Neural networks are fundamentally matrix multiplications. GPUs have thousands of cores designed for parallel math operations. A CPU might do matrix multiply sequentially; a GPU does thousands of multiply-adds simultaneously.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
What is Tool Poisoning in MCP and how do you defend against it?	Tool Poisoning is a form of indirect prompt injection where an attacker embeds malicious instructions in a tool's description or metadata. When an LLM connects to the MCP server and reads tool definitions, it treats those instructions as authoritative. The model might then exfiltrate data, bypass controls, or perform unauthorized actions. Defense: (1) audit all tool descriptions before approval, (2) use regex and LLM-based scanners for injection patterns, (3) sandbox tool execution so even a compromised tool can't access sensitive data, (4) pin tool definition hashes to detect rug-pull updates.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
How would you design a security layer for MCP tools in an enterprise?	Zero Trust architecture with five layers. (1) Allowlisting — maintain a curated registry of approved MCP servers. (2) Description audit — automated scanning of all tool descriptions for injection patterns, with human review for new servers. (3) Least privilege — each tool gets minimum OAuth scopes and file path restrictions. (4) Sandbox — run MCP servers in containers with network egress rules and no access to host secrets. (5) Monitoring — structured logs of every tool invocation with parameters, anomaly detection for unusual patterns like accessing credentials or sending data to external URLs.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
How does MCP security relate to OWASP LLM06 (Excessive Agency)?	LLM06 warns about giving AI systems too much capability without guardrails. MCP directly manifests this — each tool grants new capabilities to the agent. The risk compounds: an agent with a file-reading tool, a network tool, and an email tool has the attack surface of all three combined. Mitigation: treat each tool as a separate capability grant, enforce least privilege per tool (not per server), and require explicit user approval for sensitive operations.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
What is Mixture of Experts and why does LLaMA 4 use it?	MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What is GQA and how does it save memory?	Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What's the difference between BERT and GPT?	BERT is an ENCODER that sees all tokens bidirectionally (optimized for understanding — classification, NER, embeddings). GPT is a DECODER that sees only past tokens (optimized for generation — text, code, chat). Both use Transformers, but BERT predicts masked tokens while GPT predicts the next token.<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
Has GenAI made traditional NLP obsolete?	Mostly, yes. LLMs handle most NLP tasks via prompting, often better than task-specific models. However, BERT-based models survive for: (1) embeddings (BGE, E5), (2) sub-100ms classification at scale, (3) BM25/TF-IDF for initial retrieval in RAG. The field has consolidated around "one model, many tasks."<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
What does backpropagation actually compute?	The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
Why do we need activation functions?	Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
What changed in the OWASP LLM Top 10 from 2023 to 2025, and why does it matter?	The 2025 edition introduced two new categories — LLM07 (System Prompt Leakage) and LLM08 (Vector and Embedding Weaknesses) — while removing "Insecure Plugin Design" and "Model Theft" as standalone categories (both were absorbed into adjacent risks). The additions reflect real production patterns: most GenAI systems now use system prompts to define behavior (making prompt leakage a genuine attack surface) and RAG pipelines (making vector store poisoning a primary threat). The 2023 list was designed for isolated API-based LLM calls; the 2025 list assumes a more complex agentic architecture.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
Which OWASP 2025 categories matter most for an agentic RAG system?	Four categories are critical. LLM01 (Prompt Injection) — because agents execute tool calls based on model outputs, a single injected instruction can trigger real-world actions. LLM06 (Excessive Agency) — agents need tight permission scoping and irreversible-action gates. LLM07 (System Prompt Leakage) — agent system prompts typically contain tool schemas, personas, and routing logic that could be exploited. LLM08 (Vector and Embedding Weaknesses) — the RAG corpus is the most exploitable ingestion surface for an agentic system that retrieves before acting.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
How do you use the OWASP list in a real security review?	Use it as a structured checklist per system component, not per-category. For each component (LLM API, RAG pipeline, agent loop, output surface, data ingestion), identify which categories apply, then ask the specific design question for each: "Can user input reach the system prompt?" (LLM01/LLM07), "Does this component retrieve from an unvalidated source?" (LLM08), "Can model output execute or render unsafely?" (LLM05). Combine with threat modeling for domain-specific risks not covered by OWASP.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
What is temperature in LLM generation?	Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What loss function do LLMs use and why?	Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What's the difference between zero-shot, few-shot, and chain-of-thought prompting?	Zero-shot: just instructions, no examples. Few-shot: include examples of desired input→output pairs. CoT: ask model to show reasoning steps. Each adds more guidance and typically improves quality.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
How would you handle prompt injection in a production system?	Input sanitization, separate system/user prompts, output validation, don't include raw user input in system prompts. Use the model's built-in system prompt separation. For critical apps, add a second LLM call to verify the first output makes sense.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
What is prompt injection and how would you defend against it?	Prompt injection is when user input overrides system instructions — like SQL injection but for LLMs. I'd defend with 4 layers: (1) Input scanning with regex + classifier to catch obvious attacks. (2) Prompt architecture — use clear delimiters (XML tags) to separate instructions from untrusted user data. (3) Output validation — check that responses don't leak system prompts or follow injected instructions. (4) Architectural controls — least-privilege tool access, human-in-the-loop for sensitive actions, and separate LLM instances for different trust levels. The critical insight is that no single defense is sufficient — defense-in-depth is the only viable strategy.<br><br>Source: ethics-and-safety/prompt-injection-deep-dive.md<br>Tags: E, defense, ethics-and-safety, jailbreak, production, prompt-injection, red-teaming, security
Why is Python dominant in AI if it is slower than C++?	Python gives fast iteration and a huge ecosystem, while the expensive numerical work runs underneath in optimized C, C++, CUDA, or vendor kernels. Python is the control layer, not the performance bottleneck.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What Python tools matter most for GenAI work?	NumPy for array thinking, PyTorch for tensors and training, Hugging Face libraries for models and tokenizers, plus environment management so your CUDA and package versions stay reproducible.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is the most common beginner mistake when starting AI Python work?	Treating the environment as an afterthought. Many early failures come from incompatible package versions, wrong CUDA installs, or tensors ending up on different devices.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
How would you improve a RAG pipeline that's giving wrong answers?	Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
When would you choose RAG over fine-tuning?	RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the difference between semantic and keyword search in RAG.	Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the Chinchilla scaling laws.	For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
How is a large language model pre-trained?	(1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
Why do LLMs use sub-word tokenization instead of word-level?	Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why is tokenization a source of bias?	Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why do Transformers use scaled dot-product attention (divide by √d_k)?	Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
What's the computational complexity of self-attention?	O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
Why decoder-only for generation instead of encoder-decoder?	Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers