What makes an AI agent different from a chatbot?	A chatbot responds to messages. An agent sets goals, plans multi-step approaches, uses tools, observes results, and iterates. Agents are autonomous; chatbots are reactive.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How would you prevent an AI agent from getting stuck in a loop?	Max iteration limits, self-reflection prompts ("Am I making progress?"), fallback to human, diverse retry strategies (try different tools/approaches), and logging for debugging.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What's the ReAct pattern?	Reason + Act. The agent alternates between thinking (reasoning about what to do) and acting (calling tools). After each action, it observes the result and reasons about next steps. This interleaving of thought and action is more reliable than planning everything upfront.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How does agent memory work?	Four types: short-term (conversation context), long-term (vector DB storing facts/preferences across sessions), episodic (summaries of past task executions for learning), and procedural (learned strategies and tool patterns). In practice, most production agents use short-term + simple long-term memory with vector retrieval.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How do modern coding agents handle large codebases that don't fit in context?	Three techniques. (1) Repository indexing — parse ASTs and dependency graphs to understand code structure without reading every file. (2) Progressive context loading — only pull in files relevant to the current step, not the entire repo. (3) Context compaction — periodically summarize the conversation history to free up tokens. The best agents combine all three: index the repo upfront, retrieve relevant files via codebase RAG, and compact history when approaching the context limit.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
What's the most common failure mode of coding agents and how do you mitigate it?	Infinite edit loops — the agent encounters an error, makes a change that doesn't fix it, sees the same error, and repeats. Mitigation: (1) Track state diffs between iterations — if the agent's edit doesn't change the test output, intervene. (2) Set hard max iteration limits (typically 10-20 steps). (3) Have the agent explicitly explain its hypothesis before each edit so you can catch circular reasoning.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
When would you choose a cloud sandbox agent vs an IDE-integrated agent?	Cloud sandbox (like Devin) for tasks that are well-defined, can run unattended, and benefit from isolation — ticket-based bug fixes, migrations, boilerplate generation. IDE-integrated (like Cursor) for tasks requiring rapid human feedback — feature development, debugging, and any work where you need to steer the agent in real-time. The tradeoff is autonomy vs control.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
How do you decide whether an AI feature is worth building?	I start with the user workflow and measurable outcome, then test whether AI materially improves that workflow at an acceptable quality, trust, and cost level. If it does not, I narrow the scope or avoid the feature.<br><br>Source: applications/ai-product-management-fundamentals.md<br>Tags: A, ai-product-management, applications, evaluation, product, strategy
What is the most important metric for an AI product?	There is rarely one metric. I want a small stack that includes task success, user trust or escalation, latency, and cost per successful task.<br><br>Source: applications/ai-product-management-fundamentals.md<br>Tags: A, ai-product-management, applications, evaluation, product, strategy
What should an AI engineer do when working on a potentially regulated use case?	Classify the use case early against EU AI Act Annex III categories, document system purpose and limitations, build evaluation and monitoring into the workflow, and pull in legal or compliance partners before launch. An engineer who flags a high-risk classification before code is written saves far more time than one who raises it post-launch.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
Why does the NIST AI RMF matter if it is voluntary?	It provides an operational structure for governance and risk management that maps directly to engineering workflows (Govern → Map → Measure → Manage). Many enterprises use it to organize trustworthy-AI programs, and US federal procurement increasingly references it. It's also a good baseline before your jurisdiction mandates something more specific.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
What is the significance of August 2, 2026 for engineering teams?	It's the date the EU AI Act becomes fully applicable for Annex III high-risk AI systems. After this date, deploying a non-compliant high-risk AI system in the EU exposes the deployer to fines up to €35M or 7% of global annual turnover. Engineering teams should treat this as a hard product deadline: conformity assessments, human oversight mechanisms, logging (Art.12), and testing records (Art.9) must all be in place.<br><br>Source: ethics-and-safety/ai-regulation.md<br>Tags: E, compliance, digital-omnibus, ethics-and-safety, eu-ai-act, governance, nist, regulation, responsible-ai
When would you choose RAG over fine-tuning?	When the knowledge changes often, needs citations, or comes from private documents. Fine-tuning is better when the behavior itself must change consistently.<br><br>Source: production/ai-system-design.md<br>Tags: A, ai-architecture, genai, llmops, production, system-design
What are the minimum production components for a GenAI assistant?	Auth, prompt assembly, model invocation, safety checks, observability, and evaluation. If the task depends on facts outside the model, add retrieval.<br><br>Source: production/ai-system-design.md<br>Tags: A, ai-architecture, genai, llmops, production, system-design
How would you design the UX for an AI research assistant?	Three core principles. Speed: stream responses token-by-token with a skeleton loading state. Trust: every claim gets an inline citation with a link to the source document — clicking opens the relevant passage highlighted. Control: users can regenerate, edit the response, or thumbs-down with a reason. I'd add progressive disclosure — a TL;DR summary with expandable details underneath. For uncertainty, I'd use a confidence indicator and have the AI explicitly say "I'm not sure about this" rather than hallucinating confidently.<br><br>Source: applications/ai-ux-patterns.md<br>Tags: B, ai-product, applications, design, feedback, streaming, trust, ui, ux
How do you handle the trust problem with AI-generated content?	Trust is built through transparency and verifiability. Three patterns: (1) Citation cards — every factual claim links to its source; users can verify. (2) Explicit uncertainty — "I'm not confident about this" is better than false confidence. (3) Graceful correction — make it trivially easy to edit, regenerate, or flag wrong answers. The key insight: users don't need AI to be perfect, they need to know *when* to trust it and when to double-check.<br><br>Source: applications/ai-ux-patterns.md<br>Tags: B, ai-product, applications, design, feedback, streaming, trust, ui, ux
Streaming responses seem simple — what are the hard engineering tradeoffs?	Three non-obvious challenges. (1) Partial markdown — streaming mid-table or mid-code-block means your frontend must handle incomplete syntax gracefully without layout breaking. (2) Cancellation — users abort early; you need to cleanly close SSE connections and stop generation to avoid wasted cost. (3) Error recovery — if the stream breaks after 50 tokens, resume or restart gracefully, not leave a half-rendered response. At scale: buffer DOM updates to batches of ~50ms to avoid 100+ React re-renders/second, and cache common prompt prefixes server-side.<br><br>Source: applications/ai-ux-patterns.md<br>Tags: B, ai-product, applications, design, feedback, streaming, trust, ui, ux
What should an AI API return besides the answer?	Usually a request id, status or finish reason, and optionally citations or usage metadata depending on the product. Those fields make debugging, billing, and trust much easier.<br><br>Source: applications/api-design-for-ai.md<br>Tags: A, ai-architecture, api, applications, async, rest, streaming, webhooks
When would you choose an async job API?	When the workflow is too long or variable for an interactive request, such as large document pipelines, multi-step agent tasks, or offline generation jobs.<br><br>Source: applications/api-design-for-ai.md<br>Tags: A, ai-architecture, api, applications, async, rest, streaming, webhooks
Why has DPO become more popular than PPO for alignment?	DPO reformulates the RLHF objective so that the optimal policy can be extracted directly from preference pairs, without needing a separate reward model or the unstable PPO training loop. This makes it dramatically simpler to implement — you just need ranked pairs of "chosen" and "rejected" responses, a reference model, and a standard classification-like loss. PPO requires training a reward model, running policy rollouts, computing advantages, and maintaining a value function — all of which introduce instability and hyperparameter sensitivity. In practice, DPO achieves comparable alignment quality to PPO with 3-5× less infrastructure complexity. The tradeoff is that DPO is an offline method (it uses a fixed dataset), while PPO can potentially explore and self-improve through online generation. This is where GRPO bridges the gap — it gets PPO-like self-improvement with DPO-like simplicity.<br><br>Source: techniques/advanced-fine-tuning.md<br>Tags: B, D, dpo, fine-tuning, grpo, llm-training, techniques, trl, unsloth
When would you choose GRPO over DPO?	GRPO shines in two scenarios. First, when you have a verifiable reward function — like math problems (answer is correct or not), code generation (tests pass or fail), or structured output (valid JSON or not). DPO needs someone to label which response is "better," but GRPO can generate its own training signal. Second, when you want the model to explore and find better solutions than what's in your training data. DPO is limited to the quality of your preference pairs — the model can only learn to prefer responses already in the dataset. GRPO generates new candidates and improves on them, enabling genuine self-improvement. DeepSeek used GRPO for training reasoning models (DeepSeek-R1) specifically because math reasoning benefits from this verifiable-reward, generate-and-rank approach.<br><br>Source: techniques/advanced-fine-tuning.md<br>Tags: B, D, dpo, fine-tuning, grpo, llm-training, techniques, trl, unsloth
How do you prevent capability regression during fine-tuning?	Three practical strategies. First, always maintain a regression test suite that covers core capabilities you care about preserving — general knowledge, instruction following, safety, and language quality. Run this after every training run, not just the final one. Second, keep training short (1-2 epochs for DPO) and use a higher β value (0.2-0.3) to keep the model closer to the reference. Third, use LoRA/QLoRA rather than full fine-tuning — by only modifying a small number of parameters, you inherently limit how much the model can drift from its base capabilities. If you detect regression, you can blend LoRA weights at inference time to find the optimal balance between new capability and preserved performance.<br><br>Source: techniques/advanced-fine-tuning.md<br>Tags: B, D, dpo, fine-tuning, grpo, llm-training, techniques, trl, unsloth
Why is prompt injection a security problem and not only a quality problem?	Because malicious instructions can manipulate system behavior, trigger data leakage, or cause unauthorized actions through tools and downstream systems. That makes it part of the application's security surface.<br><br>Source: ethics-and-safety/adversarial-ml-and-ai-security.md<br>Tags: E, adversarial-ml, ethics-and-safety, jailbreaks, prompt-injection, red-teaming, security
What is the first rule for AI security in agent systems?	Minimize and constrain what the agent can do. Least privilege, validation, and human approval for sensitive actions matter more than clever prompting alone.<br><br>Source: ethics-and-safety/adversarial-ml-and-ai-security.md<br>Tags: E, adversarial-ml, ethics-and-safety, jailbreaks, prompt-injection, red-teaming, security
How would you evaluate an agent beyond task success?	I would score the trajectory: tool selection, tool arguments, retry behavior, safety, latency, cost, and whether the final answer actually used the evidence produced during execution. I'd build a composite score weighing task completion (40%), tool precision (30%), and cost efficiency (30%), and track regression across prompt/model changes.<br><br>Source: agents/agent-evaluation.md<br>Tags: B, agent-eval, agents, evaluation, genai-techniques, observability, tracing
Why is observability mandatory for agents?	Because agent failures are sequential. Without traces, you only see the bad final answer, not the exact step where the planner, tool call, or verifier went wrong. A refund agent that loops 8 times and then gives the right answer looks "correct" in a success-only metric but costs 4x and takes 30 seconds.<br><br>Source: agents/agent-evaluation.md<br>Tags: B, agent-eval, agents, evaluation, genai-techniques, observability, tracing
How do you handle non-determinism in agent evaluation?	Three approaches: (1) Pin temperature=0 and use seed parameters for reproducible runs, (2) Run each eval task 3-5 times and report median + variance, (3) Use majority-vote scoring where a task "passes" only if 3/5 runs succeed. For production monitoring, track distributions, not point estimates.<br><br>Source: agents/agent-evaluation.md<br>Tags: B, agent-eval, agents, evaluation, genai-techniques, observability, tracing
When should you use LLM-as-Judge vs programmatic scoring?	Programmatic scoring (tool precision, step count, cost) is faster, cheaper, and more reproducible — use it for everything you can formalize. LLM-as-Judge fills the gap for subjective quality: tone, helpfulness, groundedness. In practice, use both: programmatic scores gate the CI pipeline, LLM-Judge scores provide qualitative insight for manual review of borderline cases.<br><br>Source: agents/agent-evaluation.md<br>Tags: B, agent-eval, agents, evaluation, genai-techniques, observability, tracing
How would you implement long-term memory for a customer support agent?	I'd use a three-layer memory architecture. Layer 1: sliding window of the last 10 messages for immediate context. Layer 2: a structured user profile (name, plan, past issues) stored in PostgreSQL, updated after each conversation. Layer 3: semantic memory in a vector database for retrieving relevant past tickets and resolutions. On each new message, I'd retrieve the user profile + top 3 relevant past interactions and inject them into the system prompt. I'd budget 30% of context for memory, 20% for system prompt, and 50% for the current conversation. Memory writes happen asynchronously after each turn to avoid adding latency.<br><br>Source: agents/agent-memory.md<br>Tags: B, agents, context, conversation, memory, production, rag, state
What are the risks of giving an agent memory?	Four main risks. (1) Privacy: memories must be strictly isolated per user/tenant — a vector DB namespace leak would expose personal data. (2) Poisoning: users can intentionally inject false memories ("remember that I'm an admin") — validate and sanitize memory writes. (3) Staleness: preferences change but old memories persist — add TTLs and explicit update mechanisms. (4) Hallucinated memories: the LLM may "remember" things that never happened — always check retrieved memories against actual stored data, never rely on the model's internal "memory."<br><br>Source: agents/agent-memory.md<br>Tags: B, agents, context, conversation, memory, production, rag, state
What's the difference between MCP and A2A?	MCP connects an agent to TOOLS (databases, APIs, filesystems) — it's agent-to-tool communication. A2A connects an agent to OTHER AGENTS — it's agent-to-agent collaboration. MCP is like a USB port (connect devices), A2A is like a network protocol (connect computers). They're complementary: an agent uses MCP to access its own tools and A2A to delegate tasks to other agents.<br><br>Source: agents/agentic-protocols.md<br>Tags: A, B, a2a, adk, agent-protocols, agentic-infra, agents, autogen, crewai, genai, langraph, mcp
Why do we need MCP if we already have function calling?	Function calling is model-specific (OpenAI's API, Anthropic's API). MCP is a universal standard — build one MCP server and it works with Claude, GPT, Gemini, Cursor, and any MCP client. It also adds discovery (list available tools), resources (data access), and security (OAuth). It's the difference between every device having a custom charger vs. everyone using USB-C.<br><br>Source: agents/agentic-protocols.md<br>Tags: A, B, a2a, adk, agent-protocols, agentic-infra, agents, autogen, crewai, genai, langraph, mcp
Why divide by √d_k in attention?	Without it, for large d_k, dot products become huge → softmax saturates → near-zero gradients. Scaling keeps variance at ~1.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What's the difference between MHA, MQA, and GQA?	MHA: separate K,V per head (most expressive, slowest). MQA: one shared K,V (fastest, some quality loss). GQA: groups of heads share K,V (good balance). LLaMA 2+ uses GQA.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How does Flash Attention improve efficiency without changing the math?	It tiles the computation to fit in SRAM (fast cache), avoiding materialization of the full n×n attention matrix in slow HBM (GPU memory). Same result, ~2-4x faster.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What is Multi-Query Attention and why does it matter?	In standard Multi-Head Attention, each attention head has its own K and V projections — meaning the KV-cache scales linearly with the number of heads. Multi-Query Attention shares a single K, V pair across all query heads. This reduces KV-cache by the number of heads (e.g., 64× for LLaMA with 64 heads), enabling much longer context windows and higher batch sizes during inference. Grouped-Query Attention (GQA) is the practical middle ground — using 8 KV groups instead of 64 or 1 — giving most of the memory savings with minimal quality loss. This is what LLaMA 3 uses.<br><br>Source: foundations/attention-deep-dive.md<br>Tags: D, attention, flash-attention, foundations, gqa, kv-cache, mha, mqa, transformers
What makes CI/CD for LLM systems different from regular CI/CD?	The output behavior is probabilistic and influenced by prompts, models, and datasets, so the pipeline needs evaluation gates, cost checks, and rollout safety beyond normal software tests.<br><br>Source: production/cicd-for-ml.md<br>Tags: C, automation, cicd, deployment, llmops, mlops, production, testing
What would you gate before shipping a model change?	Quality against a regression set, safety checks, latency, token cost, and rollback readiness. If the change affects retrieval or agents, I would also inspect representative traces.<br><br>Source: production/cicd-for-ml.md<br>Tags: C, automation, cicd, deployment, llmops, mlops, production, testing
Where does classical ML fit in a GenAI system?	Classical ML handles the narrow, structured, cost-sensitive decisions around the LLM core. The most common patterns are: (1) request routing — classifying which model tier handles each request, saving 80%+ on API costs, (2) reranking — scoring and sorting retrieved documents faster than LLM-based reranking, (3) quality gates — fast pass/fail classifiers on LLM output before returning to users, (4) anomaly detection — flagging unusual requests or outputs for human review. The key insight is that production GenAI systems are hybrids: the LLM handles generation and reasoning, while classical models handle the structured decisions that need to be fast, cheap, and deterministic.<br><br>Source: production/classical-ml-for-genai.md<br>Tags: C, classical-ml, production, ranking, routing, sklearn, xgboost
How would you design a model routing system?	I'd start by defining 3 tiers: cached responses (free), small model (cheap), and large model (expensive). Feature engineering would extract request complexity indicators: length, question count, topic embedding, tool requirements, and context size. I'd train a logistic regression initially (simple, interpretable, fast to iterate) and upgrade to XGBoost once I have enough labeled data (1000+ examples). The labeling strategy: run all requests through the large model for a week, then label each request with the cheapest tier that achieved acceptable quality (measured by user satisfaction or LLM-judge score). Critical: add a confidence threshold — if the router is < 70% confident, default to the expensive model. This avoids the worst failure mode (misrouting a complex request to a cheap model).<br><br>Source: production/classical-ml-for-genai.md<br>Tags: C, classical-ml, production, ranking, routing, sklearn, xgboost
How would you choose between SageMaker, Vertex AI, and Azure AI Foundry?	I would start with the existing cloud footprint, governance requirements, workload type, and team skills. The best choice is usually the platform that fits the organization's operating context, not the one with the longest feature list.<br><br>Source: tools-and-infra/cloud-ml-services.md<br>Tags: C, azure-ai-foundry, cloud, infrastructure, mlops, sagemaker, tools-and-infra, vertex-ai
When would you avoid a full managed platform?	When the team needs extreme portability, a highly custom serving stack, or the platform overhead outweighs the operational value for the size of the workload.<br><br>Source: tools-and-infra/cloud-ml-services.md<br>Tags: C, azure-ai-foundry, cloud, infrastructure, mlops, sagemaker, tools-and-infra, vertex-ai
How do modern AI coding agents work?	They follow a plan-act-observe loop: (1) understand the task by reading the codebase context, (2) plan which files to change, (3) implement changes across multiple files, (4) run tests and linters to verify, (5) iterate on failures, (6) present a diff for human review. Tools like Antigravity and Cursor provide IDE integration, while Gemini CLI and Claude Code work from the terminal. The key differentiator in 2026 is MCP support — agents can connect to databases, APIs, and external tools.<br><br>Source: applications/code-generation.md<br>Tags: A, antigravity, applications, claude-code, code-generation, coding-agents, copilot, cursor, devin, gemini-cli, genai, windsurf
Compare Copilot, Cursor, and Antigravity.	Copilot is a platform (extension + agent) best for GitHub-native workflows — evolved from autocomplete to multi-agent orchestration. Cursor is a VS Code fork with AI deeply integrated (Composer for multi-file edits, Supermaven for autocomplete) — best for developers who want AI-enhanced traditional editing. Antigravity is agent-first — designed around delegating to autonomous agents with a Manager View for orchestrating multiple agents simultaneously — best for developers who want to direct rather than write code.<br><br>Source: applications/code-generation.md<br>Tags: A, antigravity, applications, claude-code, code-generation, coding-agents, copilot, cursor, devin, gemini-cli, genai, windsurf
Why are Vision Transformers important for multimodal AI?	ViTs convert images into sequences of patch embeddings using the same transformer architecture as language models. This architectural alignment is what makes multimodal models possible — you can project visual patch tokens into the same embedding space as text tokens, concatenate them, and let a single transformer process both modalities together. This is exactly how models like LLaVA and GPT-4o work: a ViT encodes the image into visual tokens, a projection layer maps them into the LLM's space, and the LLM attends to both visual and text tokens. Before ViTs, integrating CNNs with transformers required more complex adapter architectures.<br><br>Source: multimodal/computer-vision-fundamentals.md<br>Tags: computer-vision, cv, detection, images, multimodal, vit
How does CLIP enable zero-shot image classification?	CLIP trains a shared embedding space for images and text using contrastive learning on 400M image-text pairs from the internet. During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. At inference, you encode the image with the vision encoder and encode candidate class descriptions ("a photo of a dog", "a photo of a cat") with the text encoder. The class whose text embedding is most similar to the image embedding is the prediction. No task-specific training needed — any text description works as a class label. The limitation is that CLIP's accuracy is lower than fine-tuned models on specific benchmarks, but its flexibility is unmatched.<br><br>Source: multimodal/computer-vision-fundamentals.md<br>Tags: computer-vision, cv, detection, images, multimodal, vit
Design an image search system for an e-commerce platform.	I'd build a two-stage retrieval + reranking pipeline. Stage 1: Use CLIP (or SigLIP) to encode all product images into embeddings, stored in a vector database (Qdrant or Pinecone). User queries (text or uploaded image) are encoded with the same model, and top-100 candidates retrieved by cosine similarity. Stage 2: A cross-encoder reranker (or VLM) scores each candidate for relevance, considering product metadata (category, price, availability). The embedding would update nightly via batch pipeline. For latency: embedding lookup < 50ms, reranking < 200ms. For cost: CLIP encoding is ~$0.001/image. I'd also add a feedback loop — user clicks improve the reranker over time.<br><br>Source: multimodal/computer-vision-fundamentals.md<br>Tags: computer-vision, cv, detection, images, multimodal, vit
When would you use RAG vs just a long context window?	Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is context engineering?	Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is catastrophic forgetting?	When a neural network trained on task A is subsequently trained on task B, it tends to lose its ability to perform task A. This happens because gradient updates for B overwrite the weights optimized for A. It's fundamental to how neural networks learn — they don't have separate memory systems like human brains.<br><br>Source: techniques/continual-learning.md<br>Tags: catastrophic-forgetting, continual-learning, genai, knowledge-update, lifelong-learning, techniques
How do production LLMs handle knowledge updates without continual learning?	Three main approaches: (1) RAG — retrieve latest information at inference time without changing model weights, (2) Periodic retraining from scratch on updated data, (3) Modular adapters (LoRA) for new capabilities. True continual learning is still mostly a research challenge.<br><br>Source: techniques/continual-learning.md<br>Tags: catastrophic-forgetting, continual-learning, genai, knowledge-update, lifelong-learning, techniques
How is conversational AI different from a basic chatbot?	A basic chatbot generates locally plausible replies — it answers the current message without tracking state. A conversational AI system manages dialogue state across turns (tracking intent, confirmed slots, pending questions), handles ambiguity through clarification, recovers from misunderstandings, uses tools to take real actions, and knows when to escalate to a human. The key difference is that a conversational system has explicit state management (what has been said, what's confirmed, what's pending) rather than relying purely on the LLM's context window to "remember" everything.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
Design a customer support chatbot for an e-commerce company.	I'd start by defining the scope: order status, returns/refunds, product questions, and escalation to human agents. The architecture would be a LangGraph-based conversation flow with: (1) an intent classifier node that routes to specialized sub-flows, (2) structured state tracking order IDs, customer info, and issue type, (3) tool integrations for order lookup, return initiation, and ticket creation, (4) a summarization memory layer for conversations > 10 turns, (5) guardrails for PII handling and policy compliance. For latency, I'd target TTFT < 500ms with streaming. For evaluation, I'd track task completion rate, turns-to-resolution, escalation rate, and CSAT scores. The critical design decision is the escalation policy — I'd implement confidence-based routing where the bot hands off proactively when confidence drops below 0.7, rather than waiting for the user to ask for a human.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
What should a conversational system remember and forget?	This is a product decision, not a technical one. Remember: user's stated goal, confirmed facts (slots), tool results, and explicit preferences. Forget: rejected alternatives, small talk, verbose explanations, and intermediate reasoning steps. The implementation I'd use is a hybrid: structured state for confirmed facts (a Pydantic model with intent, slots, phase), periodic summarization for conversation flow, and the last 4-6 turns verbatim for immediate context. Critical rule: never "remember" something that was said in a summary that wasn't in the original messages — that's how summary drift causes hallucinated memories.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
What are the biggest cost levers in a GenAI application?	Model routing, context control, caching, retrieval discipline, and serving choices. Small prompt tweaks help, but architecture decisions usually dominate the savings.<br><br>Source: production/cost-optimization.md<br>Tags: B, C, caching, cost, llmops, optimization, production, routing, token-cost
What metric is better than cost per request?	Cost per successful task, because it reflects whether the spend actually produced value. A cheap request path that often fails can be more expensive overall.<br><br>Source: production/cost-optimization.md<br>Tags: B, C, caching, cost, llmops, optimization, production, routing, token-cost
How would you build a system that improves from user feedback?	I'd design a 4-stage data flywheel. Stage 1: Collect both explicit signals (thumbs up/down, user edits) and implicit signals (regeneration, session abandonment) from every interaction. Stage 2: Curate — user corrections become the highest-quality training data; thumbs-up responses become positive examples; thumbs-down + regeneration patterns reveal failure modes. Stage 3: Improve iteratively — start with prompt refinements (days), then retrieval improvements (weeks), then embedding fine-tuning (weeks), then model fine-tuning quarterly. Stage 4: Measure impact with A/B tests — compare flywheel-improved version vs control. I'd target 2-5% quality improvement per month, compounding over time.<br><br>Source: production/data-flywheel-design.md<br>Tags: B, continuous-improvement, data-flywheel, feedback-loops, llmops, production
What optimizer do you use for training Transformers and why?	AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How would you handle GPU memory limitations when training?	(1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How do diffusion models generate images?	During training, the model learns to predict noise added to images at various levels. During generation, start from pure noise and iteratively denoise over many steps, guided by a text prompt using classifier-free guidance.<br><br>Source: multimodal/diffusion-models.md<br>Tags: dall-e, diffusion, genai, image-generation, multimodal, stable-diffusion
Why did diffusion models replace GANs?	More stable training (no adversarial min-max game), no mode collapse, higher diversity, better controllability (CFG, ControlNet), and the quality caught up and surpassed GANs.<br><br>Source: multimodal/diffusion-models.md<br>Tags: dall-e, diffusion, genai, image-generation, multimodal, stable-diffusion
What is classifier-free guidance?	During training, randomly drop the text condition. At inference, run both conditional and unconditional passes, then amplify the difference. This lets you control how strongly the model follows the prompt (guidance scale parameter).<br><br>Source: multimodal/diffusion-models.md<br>Tags: dall-e, diffusion, genai, image-generation, multimodal, stable-diffusion
What is the difference between scaling replicas and sharding a model?	Replica scaling duplicates the full serving stack to handle more requests, while model sharding splits one model across multiple devices because it is too large or too expensive to serve as one unit.<br><br>Source: inference/distributed-inference-and-serving-architecture.md<br>Tags: C, architecture, distributed-inference, inference, kv-cache, scaling, serving
Why is KV-cache locality important?	Because rebuilding or transferring cache state is costly. Good locality helps keep follow-up tokens and turns efficient instead of repeatedly paying for the same context.<br><br>Source: inference/distributed-inference-and-serving-architecture.md<br>Tags: C, architecture, distributed-inference, inference, kv-cache, scaling, serving
Why do AI systems need distributed-systems knowledge?	Because production AI is composed of many interacting services with partial failure, variable latency, and expensive state transitions. Reliability depends on queueing, retry policy, caching, and graceful degradation.<br><br>Source: tools-and-infra/distributed-systems-for-ai.md<br>Tags: C, consistency, distributed-systems, infrastructure, queues, scaling, tools-and-infra
What is backpressure in an AI system?	It is the mechanism that prevents fast upstream components from overwhelming a slower downstream stage such as inference or retrieval. Without it, latency and failure rates can cascade through the system.<br><br>Source: tools-and-infra/distributed-systems-for-ai.md<br>Tags: C, consistency, distributed-systems, infrastructure, queues, scaling, tools-and-infra
What's the difference between data parallelism and tensor parallelism?	Data parallelism replicates the entire model on each GPU and splits the batch — each GPU processes different data, then gradients are synchronized via all-reduce. This scales throughput linearly for models that fit on a single GPU. Tensor parallelism splits individual layer computations across GPUs — for example, a large matrix multiplication is split column-wise across 4 GPUs, each computing 1/4 of the result. This enables layers that are too large for one GPU but requires extremely fast inter-GPU communication (NVLink, not Ethernet) because activations must be synchronized at every layer boundary. In practice, tensor parallelism is used intra-node (within a server with NVLink) while data parallelism is used inter-node (across servers).<br><br>Source: research-frontiers/distributed-training.md<br>Tags: D, checkpointing, clusters, deepspeed, distributed-training, fsdp, research, research-frontiers, tensor-parallelism, training-infrastructure, zero
Explain ZeRO optimization stages.	ZeRO addresses the memory inefficiency of standard DDP, where each GPU holds a full copy of model weights, optimizer state, and gradients. ZeRO eliminates this redundancy in 3 stages. Stage 1 shards only the optimizer state (Adam momentum + variance) — this alone saves ~60% of optimizer memory with minimal communication overhead, making it the best first step. Stage 2 additionally shards gradients via reduce-scatter instead of all-reduce. Stage 3 (equivalent to FSDP) shards everything including model weights — each GPU holds only 1/N of parameters and uses all-gather to reconstruct weights before each forward pass. The tradeoff is progressive: each stage saves more memory but adds more communication. For fine-tuning, Stage 2 is usually the sweet spot; for training models that truly don't fit, Stage 3 is necessary.<br><br>Source: research-frontiers/distributed-training.md<br>Tags: D, checkpointing, clusters, deepspeed, distributed-training, fsdp, research, research-frontiers, tensor-parallelism, training-infrastructure, zero
When would you choose Kubernetes for a GenAI system?	When the system has multiple independently scaled services, controlled rollouts, background jobs, observability requirements, or self-hosted inference that needs GPU scheduling. For smaller systems, a managed platform or simple container deployment may be better.<br><br>Source: production/docker-and-kubernetes.md<br>Tags: C, containers, deployment, docker, kubernetes, llmops, production
What is the main Docker benefit for AI teams?	Reproducibility. It removes "works on my machine" failures across notebooks, CI, and production while standardizing dependencies and rollout behavior.<br><br>Source: production/docker-and-kubernetes.md<br>Tags: C, containers, deployment, docker, kubernetes, llmops, production
How would you build a document processing pipeline for a RAG system?	I'd build a 5-stage pipeline. (1) Format detection to route PDFs, DOCX, HTML to appropriate parsers. (2) Text extraction — PyMuPDF for digital PDFs, pdfplumber for table-heavy PDFs, Tesseract+layout detection for scans. (3) Structure preservation — keep headings, lists, and table structure using markdown formatting. (4) Document-aware chunking — split at section boundaries with 200-500 token chunks and 50-token overlap, keeping section headers as metadata. (5) Metadata enrichment — attach source file, page number, section heading to each chunk. I'd evaluate quality by sampling 50 chunks and manually checking if they preserve the meaning of the original content.<br><br>Source: production/document-parsing-and-extraction.md<br>Tags: B, chunking, document-parsing, extraction, ocr, pdf, production, rag
How would you improve retrieval quality in a RAG system?	I'd follow a priority ladder. First, measure baseline retrieval quality (Precision@5, Recall@5) to quantify the gap. Second, check chunking — are chunks the right size (200-500 tokens) with enough context? Third, try hybrid search (semantic + keyword with BM25). Fourth, add a cross-encoder reranker on top-20 results. If the domain is specialized (medical, legal), I'd fine-tune the embedding model on 5K-10K domain-specific (query, document) pairs using contrastive learning — this typically gives 10-30% improvement on domain queries. I'd evaluate each change independently to measure its contribution.<br><br>Source: techniques/embedding-fine-tuning.md<br>Tags: B, contrastive-learning, embeddings, fine-tuning, production, rag, retrieval, techniques
What are embeddings and why do they matter for GenAI?	Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
What's the difference between word embeddings and sentence embeddings?	Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
How does RLHF work?	Generate multiple responses → humans rank them by preference → train a reward model on those rankings → use RL (PPO) to fine-tune the LLM to maximize the reward model's score. This teaches the model nuanced preferences (helpful, harmless, honest) that explicit rules can't capture.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
How would you handle hallucination in a production system?	Layer defenses: (1) RAG for factual grounding, (2) Force citations/sources, (3) Low temperature for factual tasks, (4) Output validation (check claims against a knowledge base), (5) Human-in-the-loop for critical decisions for high-stakes scenarios.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
What's the difference between RLHF and DPO?	Both learn from human preference pairs (A is better than B). RLHF first trains a separate reward model, then uses RL to optimize. DPO skips the reward model and directly optimizes the LLM on preference pairs — simpler, cheaper, similar quality.<br><br>Source: ethics-and-safety/ethics-safety-alignment.md<br>Tags: E, alignment, bias, ethics, ethics-and-safety, genai, hallucination, responsible-ai, rlhf, safety
When would you fine-tune vs use RAG?	Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: **combine both** — LoRA for behavior, RAG for facts.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
Explain how LoRA reduces memory requirements.	Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
How does function calling work in LLMs?	You define tools with names, descriptions, and parameter schemas. The LLM receives the user message + tool definitions, decides if a tool should be called, and generates a JSON object with the function name and arguments. YOUR code executes the function and feeds the result back to the LLM for final response generation. The LLM never actually runs the function.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is MCP and why does it matter?	Model Context Protocol is an open standard for connecting LLMs to external tools. Before MCP, every tool needed custom integration for each model. MCP provides a universal interface — any MCP-compatible tool works with any MCP-compatible client. It's becoming the "USB standard" for AI tool integration.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
Why are LLM decode steps often memory-bound?	Each generated token requires repeatedly loading weights and KV-cache state, so memory movement can dominate arithmetic. That is why layout, caching, and serving-engine design matter so much.<br><br>Source: inference/gpu-cuda-programming.md<br>Tags: D, ai-infra, cuda, gpu, inference, kernels, memory, performance
What is the practical value of understanding CUDA for an AI engineer?	It helps you reason about hardware bottlenecks, choose the right optimizations, and communicate effectively with systems or inference teams when performance issues appear.<br><br>Source: inference/gpu-cuda-programming.md<br>Tags: D, ai-infra, cuda, gpu, inference, kernels, memory, performance
How would you architect a production RAG system?	LLM via API (with fallback), vector DB (Qdrant/Pinecone) with hybrid search, LangChain/LlamaIndex for orchestration, LangSmith for tracing, RAGAS for eval. Add caching layer for repeated queries, rate limiting, and graceful degradation when LLM is unavailable.<br><br>Source: tools-and-infra/tools-overview.md<br>Tags: genai, infrastructure, langchain, llamaindex, serving, tools, tools-and-infra, vector-db
When would you self-host vs use an API?	Self-host when: high volume (cost), privacy requirements (data governance), latency needs (no network hop), or need fine-tuned open models. Use API when: low volume, need best quality (GPT-5/Claude 4), fast iteration, no GPU infra.<br><br>Source: tools-and-infra/tools-overview.md<br>Tags: genai, infrastructure, langchain, llamaindex, serving, tools, tools-and-infra, vector-db
What's the difference between discriminative and generative AI?	Discriminative models learn P(y|x) — "given this input, what is the most likely label?" They draw a boundary between classes. Generative models learn P(x) or P(x|y) — they model what the data itself looks like and can produce new samples. In practice: a spam classifier is discriminative; a model that writes emails is generative.<br><br>Source: genai.md<br>Tags: ai, deep-learning, genai, machine-learning, root
Why did Transformers enable the GenAI revolution?	Three reasons: (1) parallelisable training — unlike RNNs, Transformers process all tokens simultaneously, making large-scale training feasible on GPUs; (2) the attention mechanism captures long-range dependencies that RNNs struggled with; (3) the architecture scales predictably — more data and compute reliably yield better models. All three properties together made training on internet-scale datasets tractable.<br><br>Source: genai.md<br>Tags: ai, deep-learning, genai, machine-learning, root
What's the difference between fine-tuning and RAG?	Fine-tuning modifies the model's weights — baking knowledge or behaviour changes in permanently. RAG (Retrieval-Augmented Generation) leaves weights unchanged and instead retrieves relevant external documents at inference time, injecting them into the context window. Fine-tuning is better for style, format, or behaviour changes; RAG is better for keeping knowledge current and reducing hallucination on factual queries. The two are frequently combined.<br><br>Source: genai.md<br>Tags: ai, deep-learning, genai, machine-learning, root
What is test-time compute scaling?	A third scaling axis, distinct from training compute scaling. Instead of training a bigger model, you allocate more computation *at inference time* — letting the model think through a problem in multiple steps, verify its own reasoning, or explore multiple solution paths before answering. OpenAI's o1, DeepSeek-R1, and extended thinking modes in Claude and Gemini are examples. The insight: for hard reasoning tasks, thinking longer can outperform training larger.<br><br>Source: genai.md<br>Tags: ai, deep-learning, genai, machine-learning, root
What is Graph RAG and when would you use it over standard RAG?	Graph RAG combines knowledge graphs with RAG. Standard RAG retrieves text chunks by similarity — great for "what does X mean?" but fails at "how are X and Y connected?" or "summarize all instances of Z." Graph RAG extracts entities and relationships into a knowledge graph, enabling multi-hop reasoning and aggregation. Use it when: data is entity-heavy (people, organizations, events), questions require relationship traversal, or you need thematic summary across large document sets.<br><br>Source: techniques/graph-rag.md<br>Tags: A, B, agentic-rag, genai, graph-rag, graphrag, knowledge-graph, multi-hop-reasoning, techniques
What is Agentic RAG?	Agentic RAG gives retrieval an autonomous agent that can dynamically choose retrieval strategies (vector search, graph query, SQL, web search), self-correct when results are poor, decompose complex questions into sub-queries, and verify answers before returning. It transforms RAG from a fixed pipeline into an adaptive reasoning loop. This is the emerging standard for enterprise AI in 2026.<br><br>Source: techniques/graph-rag.md<br>Tags: A, B, agentic-rag, genai, graph-rag, graphrag, knowledge-graph, multi-hop-reasoning, techniques
Design a guardrail system for a healthcare chatbot.	Three-layer approach. Input: PII detection (redact SSN, DOB before model sees them), injection detection, and topic filter (reject non-health queries). Model: system prompt with strict medical disclaimer rules, temperature=0 for consistency, structured output for treatment recommendations. Output: medical claim classifier (flag unverified treatment claims), PII leakage check, mandatory disclaimer injection. I'd add a HIPAA compliance layer that logs all interactions without PII for audit. Latency budget: < 200ms total guardrail overhead. For high-risk responses (medication, diagnosis), add a human-review queue.<br><br>Source: production/guardrails-and-content-filtering.md<br>Tags: B, content-filtering, guardrails, llmops, moderation, production, safety
What is the most effective way to reduce hallucination in enterprise assistants?	Ground the answer on retrieval or tool outputs, require evidence in the response path, and add a post-generation verification step with abstention when confidence is low.<br><br>Source: llms/hallucination-detection.md<br>Tags: B, E, factuality, groundedness, hallucination, llm, llms, reliability
How do detection and mitigation differ?	Detection estimates whether an answer is unsupported. Mitigation changes the system so unsupported answers happen less often or are blocked before the user sees them.<br><br>Source: llms/hallucination-detection.md<br>Tags: B, E, factuality, groundedness, hallucination, llm, llms, reliability
How does quantization make LLMs run on consumer hardware?	By representing model weights in fewer bits (INT4 = 4 bits vs FP16 = 16 bits), memory drops 4x. LLaMA 70B goes from 140GB (needs 2× A100) to 35GB (fits on 1× RTX 4090). Modern quantization methods (AWQ, GPTQ) preserve quality by protecting important weights and using calibration data.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
Explain speculative decoding.	A small draft model rapidly generates N candidate tokens. The large target model verifies all N in a single parallel forward pass (since verification is parallelizable). Accepted tokens are kept, rejected ones trigger regeneration. Provably lossless (same distribution as target model) with 2-3x speedup.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
What is PagedAttention and why does vLLM use it?	Like OS virtual memory paging. KV cache is stored in non-contiguous memory blocks ("pages") instead of one contiguous block. This eliminates fragmentation, allows dynamic memory allocation, and enables sharing cache between requests with common prefixes. Result: 2-4x higher throughput.<br><br>Source: inference/inference-optimization.md<br>Tags: C, genai, inference, kv-cache, pd-disaggregation, performance, quantization, radix-attention, serving, sglang, speculative-decoding
How does knowledge distillation work?	A large "teacher" model's soft probability outputs (including relationships between classes) are used as training targets for a smaller "student" model. The student learns to match the teacher's full output distribution using KL divergence loss, not just the correct answer. This transfers "dark knowledge" — the teacher's implicit understanding of which concepts are similar.<br><br>Source: techniques/distillation-and-compression.md<br>Tags: B, D, compression, distillation, efficiency, genai, pruning, teacher-student, techniques
How is DeepSeek-R1-Distill created?	DeepSeek-R1 (671B MoE) generates reasoning chains for thousands of problems. These (input, reasoning_chain + answer) pairs become fine-tuning data for smaller models like Qwen-14B. The small model literally learns to REASON like R1 by mimicking its step-by-step thinking.<br><br>Source: techniques/distillation-and-compression.md<br>Tags: B, D, compression, distillation, efficiency, genai, pruning, teacher-student, techniques
How would you evaluate a RAG system?	Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are traditional benchmarks becoming less useful?	Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are benchmarks not enough for production LLM evaluation?	Benchmarks measure generic capability, but production systems depend on domain data, UX constraints, retrieval quality, safety needs, and business outcomes. You need task-specific evaluation tied to real failure modes.<br><br>Source: evaluation/llm-evaluation-deep-dive.md<br>Tags: B, E, deepeval, evaluation, judges, llm-as-judge, ragas, regression
What would you measure in a RAG eval suite?	Retrieval quality, groundedness, answer usefulness, latency, and cost. I would also include adversarial and ambiguous queries because those reveal brittle behavior quickly.<br><br>Source: evaluation/llm-evaluation-deep-dive.md<br>Tags: B, E, deepeval, evaluation, judges, llm-as-judge, ragas, regression
Which LLM would you choose for a production RAG system?	Depends on constraints. For highest quality: Claude Opus 4.6 (1M context, best at following complex instructions with citations). For cost efficiency: GPT-5.4 mini (near GPT-5.4 quality at fraction of cost). For data privacy: LLaMA 4 Scout self-hosted (10M context, fits on 1 H100). For multimodal RAG: Gemini 3.1 Pro (native vision for image documents). In practice, use a cheaper model for retrieval/routing and a powerful model for generation.<br><br>Source: llms/llm-landscape.md<br>Tags: claude-4, gemini-3, genai, gpt-5, llama-4, llm-comparison, llms, model-selection, open-vs-closed
Open source vs closed source — when?	Closed (GPT-5.4, Claude) when: you need cutting-edge capability, have budget, want zero-ops, and your data policies allow API calls. Open (LLaMA 4, Gemma 4, DeepSeek) when: data must stay on-premise (healthcare, finance, government), you need fine-tuning beyond what APIs allow, or cost at scale is prohibitive. Trend in 2026: Gemma 4 31B and LLaMA 4 Scout are competitive with mid-tier closed models while being fully self-hostable.<br><br>Source: llms/llm-landscape.md<br>Tags: claude-4, gemini-3, genai, gpt-5, llama-4, llm-comparison, llms, model-selection, open-vs-closed
How would you reduce LLM costs by 80% without losing quality?	Model routing. I'd analyze our traffic and find that 70-80% of requests are simple (classification, extraction, formatting) and can be handled by a cheap model like GPT-4o-mini or Gemini Flash at 1/100th the cost of GPT-4. I'd implement a classifier-based router trained on labeled examples of easy/medium/hard queries. For the remaining 20-30% of complex requests, I'd use a mid-tier model, reserving expensive models (GPT-4, Opus) for only the hardest 2-5%. I'd monitor quality per route with automated evals and adjust thresholds weekly. Expected savings: 5-10× reduction in average cost per request.<br><br>Source: production/llm-routing-and-model-selection.md<br>Tags: B, cost, latency, llmops, model-selection, production, routing
How would you take an LLM prototype to production?	(1) Create an eval suite (50+ golden examples), (2) Add input/output guardrails, (3) Implement observability (Langfuse/LangSmith), (4) Set up cost alerting, (5) Abstract the LLM provider behind a gateway for fallbacks, (6) CI/CD pipeline that runs eval suite on every prompt/code change, (7) Canary deployment with quality monitoring.<br><br>Source: production/llmops.md<br>Tags: A, B, ci-cd, deployment, genai, llmops, monitoring, observability, production
How do you handle LLM quality degradation in production?	Continuous monitoring via automated evals, user feedback (👍/👎), drift detection. When quality drops: check if the provider updated the model, run regression analysis against golden set, roll back prompts if needed, or switch to a backup model.<br><br>Source: production/llmops.md<br>Tags: A, B, ci-cd, deployment, genai, llmops, monitoring, observability, production
Explain the training pipeline of a modern LLM.	Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What's the difference between dense and MoE architectures?	Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?	API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What is the difference between latency and throughput, and when do they conflict?	Latency is the time a single request takes from submission to completion. Throughput is the total work the system handles per unit time (req/s or tokens/s). They conflict because optimizing throughput often means larger batch sizes and higher GPU utilization, which increases queuing time and thus individual request latency. In production, you typically set a latency SLA first (e.g., P95 < 2s), then maximize throughput within that constraint. The mathematical relationship is captured by Little's Law: L = λW — at a given concurrency level (L), you can trade latency (W) for throughput (λ) and vice versa.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
Why is P95/P99 latency more important than average latency in AI systems?	Because user experience is defined by the slowest interaction, not the average one. If your P50 is 800ms but P99 is 5000ms, then 1 in 100 users waits 5+ seconds — and those users remember. It gets worse with fan-out: a page showing 10 AI-powered components has a 40% chance that at least one component hits P95 latency (0.95^10 = 0.60, so 40% chance of at least one slow response). Furthermore, tail latency often reveals systemic issues (garbage collection, memory pressure, noisy neighbors) that averages hide. In system design interviews, always say "I'd measure P95 and P99" — it signals production experience.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
How would you capacity-plan an LLM serving system for 50 requests/second?	I'd use Little's Law. If average latency is 2 seconds and each GPU handles 8 concurrent requests at 80% utilization (to avoid queue explosion), then: Required concurrency = λ × W = 50 × 2 = 100 concurrent slots. Effective capacity per GPU = 8 × 0.8 = 6.4. GPUs needed = 100 / 6.4 = 16 GPUs. I'd add 20% headroom for traffic bursts, so 19-20 GPUs. Then I'd validate with load testing, watching P99 latency and queue depth to confirm the model holds under realistic traffic patterns.<br><br>Source: production/latency-and-throughput-engineering.md<br>Tags: C, latency, llmops, performance, production, queueing, throughput
Why is the dot product central to the attention mechanism?	Attention computes Q·Kᵀ where Q = query and K = key. The dot product measures how "related" each query is to each key — high dot product = high attention. This is then softmaxed into weights that determine how much each value V contributes to the output.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
Why are GPUs used for deep learning?	Neural networks are fundamentally matrix multiplications. GPUs have thousands of cores designed for parallel math operations. A CPU might do matrix multiply sequentially; a GPU does thousands of multiply-adds simultaneously.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
When would you use RAG vs long context?	It depends on corpus size, cost tolerance, and update frequency. If the source material is < 50K tokens and relatively static (e.g., a product manual), I'd stuff it directly into context — simpler architecture, no retrieval failures. For 50K-500K tokens, I'd use a hybrid: retrieve the top 5-10 most relevant chunks via RAG, then include them in a long-context prompt with supporting background. For > 500K tokens (entire codebases, large doc collections), RAG is necessary — no model reliably processes that much context. I'd also consider cost: a 200K-token context costs $0.50-$15 per request at API prices, vs $0.01-$0.10 for RAG with small chunks.<br><br>Source: techniques/long-context-engineering.md<br>Tags: B, chunking, context-window, long-context, needle-in-haystack, production, rag, techniques
What is Tool Poisoning in MCP and how do you defend against it?	Tool Poisoning is a form of indirect prompt injection where an attacker embeds malicious instructions in a tool's description or metadata. When an LLM connects to the MCP server and reads tool definitions, it treats those instructions as authoritative. The model might then exfiltrate data, bypass controls, or perform unauthorized actions. Defense: (1) audit all tool descriptions before approval, (2) use regex and LLM-based scanners for injection patterns, (3) sandbox tool execution so even a compromised tool can't access sensitive data, (4) pin tool definition hashes to detect rug-pull updates.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
How would you design a security layer for MCP tools in an enterprise?	Zero Trust architecture with five layers. (1) Allowlisting — maintain a curated registry of approved MCP servers. (2) Description audit — automated scanning of all tool descriptions for injection patterns, with human review for new servers. (3) Least privilege — each tool gets minimum OAuth scopes and file path restrictions. (4) Sandbox — run MCP servers in containers with network egress rules and no access to host secrets. (5) Monitoring — structured logs of every tool invocation with parameters, anomaly detection for unusual patterns like accessing credentials or sending data to external URLs.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
How does MCP security relate to OWASP LLM06 (Excessive Agency)?	LLM06 warns about giving AI systems too much capability without guardrails. MCP directly manifests this — each tool grants new capabilities to the agent. The risk compounds: an agent with a file-reading tool, a network tool, and an email tool has the attack surface of all three combined. Mitigation: treat each tool as a separate capability grant, enforce least privilege per tool (not per server), and require explicit user approval for sensitive operations.<br><br>Source: ethics-and-safety/mcp-security.md<br>Tags: E, adversarial, agentic-security, ethics-and-safety, genai-safety, mcp, security, supply-chain, tool-poisoning
What is the minimum metadata you would track for an ML run?	Parameters, metrics, code version, data version, artifacts, and environment details. Without that set, comparing or reproducing results is unreliable. For GenAI specifically, I'd also track prompt versions, eval set versions, and token costs.<br><br>Source: tools-and-infra/ml-experiment-and-data-management.md<br>Tags: C, data-versioning, dvc, experiment-tracking, infrastructure, lineage, mlflow, tools-and-infra, wandb
Why is data versioning essential for ML reproducibility?	Because code alone does not determine model behavior. You need the exact dataset state, splits, and lineage to reproduce or explain results. In GenAI, this extends to RAG corpus snapshots, preference data, and evaluation sets.<br><br>Source: tools-and-infra/ml-experiment-and-data-management.md<br>Tags: C, data-versioning, dvc, experiment-tracking, infrastructure, lineage, mlflow, tools-and-infra, wandb
What is superposition in neural networks?	Neural networks represent more concepts (features) than they have neurons. Features are encoded as DIRECTIONS in activation space, not individual neurons. Multiple features share the same neurons through superposition — similar to how compressed audio encodes many frequencies in fewer data points. Sparse autoencoders can decompose these back into individual features.<br><br>Source: research-frontiers/interpretability.md<br>Tags: D, circuits, genai, interpretability, mech-interp, research-frontiers, sparse-autoencoders, superposition
Why does mechanistic interpretability matter for AI safety?	We need to understand what models are doing internally — not just what they output. Mech-interp can detect deceptive behavior (features that activate during strategic dishonesty), verify alignment (the model genuinely follows safety training, not just surface compliance), and enable targeted interventions (edit specific behaviors without retraining).<br><br>Source: research-frontiers/interpretability.md<br>Tags: D, circuits, genai, interpretability, mech-interp, research-frontiers, sparse-autoencoders, superposition
When would you use model merging instead of fine-tuning on combined data?	Merging when: (1) you already have multiple fine-tuned models and want to avoid retraining, (2) you don't have access to the original training data, (3) you want to quickly iterate on combinations (merging takes minutes, fine-tuning takes hours). Fine-tuning on combined data when: (1) you need guaranteed quality, (2) you have the data and compute budget, (3) the task combination is complex enough that weight averaging won't capture interactions.<br><br>Source: techniques/model-merging.md<br>Tags: B, D, dare, fine-tuning, genai, mergekit, model-merging, open-weight, slerp, techniques, ties
How do you evaluate whether a merge was successful?	Three levels. First, run each parent model's original eval suite against the merge — the merge should retain ≥90% of each parent's task-specific performance. Second, run general benchmarks (MMLU-Pro, HumanEval) to ensure no broad capability loss. Third, red-team for safety regressions, especially if one parent was a safety-tuned model. If any parent's capability drops below acceptable threshold, adjust merge weights or switch to TIES/DARE to reduce interference.<br><br>Source: techniques/model-merging.md<br>Tags: B, D, dare, fine-tuning, genai, mergekit, model-merging, open-weight, slerp, techniques, ties
What is the difference between inference optimization and model serving?	Inference optimization focuses on making the core generation path more efficient, for example quantization or KV-cache improvements. Model serving covers the full production runtime around that path, including APIs, routing, scheduling, scaling, and failure handling.<br><br>Source: production/model-serving.md<br>Tags: B, C, inference, llmops, production, serving, tgi, triton, vllm
When would you self-host instead of using a managed API?	When privacy, volume economics, model customization, or latency control outweigh the extra operational burden. Otherwise, managed APIs are usually the faster path.<br><br>Source: production/model-serving.md<br>Tags: B, C, inference, llmops, production, serving, tgi, triton, vllm
What is Mixture of Experts and why does LLaMA 4 use it?	MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What is GQA and how does it save memory?	Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
Why is observability harder for LLM systems than for normal APIs?	Because correctness is not binary. The system can return a 200 response and still be wrong, unsafe, or unhelpful. You need traceable context, output quality signals, and user feedback, not just uptime metrics.<br><br>Source: production/monitoring-observability.md<br>Tags: B, C, evals, llmops, monitoring, observability, production, tracing
What is the minimum telemetry for a production RAG system?	Request id, model and prompt version, retrieval documents and scores, token usage, latency, validation status, and user feedback. That gives you enough context to debug both system and semantic failures.<br><br>Source: production/monitoring-observability.md<br>Tags: B, C, evals, llmops, monitoring, observability, production, tracing
When would you choose multi-agent over single-agent?	I'd choose multi-agent when three conditions are met: (1) the task has natural decomposition boundaries where different sub-tasks benefit from different tool access, system prompts, or contexts — for example, a research agent with web search and a coding agent with a sandbox; (2) the quality improvement from specialization is measurable and significant, not incremental; and (3) the latency and cost multiplier (3-7× more expensive) is acceptable for the use case. I'd always benchmark a single-agent baseline first. If one agent with well-designed tools achieves 80%+ of the quality, the coordination overhead of multi-agent isn't justified. The exception is adversarial review: having a critic agent that challenges the primary agent's output catches errors that self-review misses.<br><br>Source: agents/multi-agent-architectures.md<br>Tags: A, B, agents, coordination, genai-techniques, multi-agent, orchestration
Design a multi-agent system for automated code review.	I'd use a pipeline pattern with 3 agents. First, a Code Analyzer agent with access to static analysis tools (linting, complexity metrics, type checking) processes the diff and produces a structured analysis. Second, a Logic Reviewer agent with access to the codebase context (via RAG over the repo) evaluates correctness, identifies potential bugs, and checks for security issues. Third, a Summary Agent synthesizes both analyses into a human-readable review with actionable suggestions, severity levels, and specific line references. State management: each agent writes to a shared ReviewState dict with typed fields (analysis, logic_issues, suggestions). I'd add a max_cost guard ($0.50/review), trajectory logging via LangSmith, and a confidence score—if any agent is < 70% confident, flag for human review instead of auto-approving.<br><br>Source: agents/multi-agent-architectures.md<br>Tags: A, B, agents, coordination, genai-techniques, multi-agent, orchestration
How do multimodal models process images?	Images are encoded by a vision encoder (like ViT) into a sequence of "visual tokens," similar to text tokens. These are concatenated with text tokens and processed by the same Transformer backbone. Cross-attention allows the model to reason about both text and visual information together.<br><br>Source: multimodal/multimodal-ai.md<br>Tags: genai, multimodal, sora, text-to-audio, text-to-video, veo, vision-language
Why is native multimodality important vs bolting vision onto a text model?	Bolted-on vision (early GPT-4V approach) processes modalities separately and aligns them — creating artifacts. Native multimodality (Gemini, LLaMA 4) trains on all modalities from the start, creating deeper cross-modal understanding and more natural integration.<br><br>Source: multimodal/multimodal-ai.md<br>Tags: genai, multimodal, sora, text-to-audio, text-to-video, veo, vision-language
What's the difference between BERT and GPT?	BERT is an ENCODER that sees all tokens bidirectionally (optimized for understanding — classification, NER, embeddings). GPT is a DECODER that sees only past tokens (optimized for generation — text, code, chat). Both use Transformers, but BERT predicts masked tokens while GPT predicts the next token.<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
Has GenAI made traditional NLP obsolete?	Mostly, yes. LLMs handle most NLP tasks via prompting, often better than task-specific models. However, BERT-based models survive for: (1) embeddings (BGE, E5), (2) sub-100ms classification at scale, (3) BM25/TF-IDF for initial retrieval in RAG. The field has consolidated around "one model, many tasks."<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
What does backpropagation actually compute?	The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
Why do we need activation functions?	Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
What changed in the OWASP LLM Top 10 from 2023 to 2025, and why does it matter?	The 2025 edition introduced two new categories — LLM07 (System Prompt Leakage) and LLM08 (Vector and Embedding Weaknesses) — while removing "Insecure Plugin Design" and "Model Theft" as standalone categories (both were absorbed into adjacent risks). The additions reflect real production patterns: most GenAI systems now use system prompts to define behavior (making prompt leakage a genuine attack surface) and RAG pipelines (making vector store poisoning a primary threat). The 2023 list was designed for isolated API-based LLM calls; the 2025 list assumes a more complex agentic architecture.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
Which OWASP 2025 categories matter most for an agentic RAG system?	Four categories are critical. LLM01 (Prompt Injection) — because agents execute tool calls based on model outputs, a single injected instruction can trigger real-world actions. LLM06 (Excessive Agency) — agents need tight permission scoping and irreversible-action gates. LLM07 (System Prompt Leakage) — agent system prompts typically contain tool schemas, personas, and routing logic that could be exploited. LLM08 (Vector and Embedding Weaknesses) — the RAG corpus is the most exploitable ingestion surface for an agentic system that retrieves before acting.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
How do you use the OWASP list in a real security review?	Use it as a structured checklist per system component, not per-category. For each component (LLM API, RAG pipeline, agent loop, output surface, data ingestion), identify which categories apply, then ask the specific design question for each: "Can user input reach the system prompt?" (LLM01/LLM07), "Does this component retrieve from an unvalidated source?" (LLM08), "Can model output execute or render unsafely?" (LLM05). Combine with threat modeling for domain-specific risks not covered by OWASP.<br><br>Source: ethics-and-safety/owasp-llm-top-10.md<br>Tags: E, ethics-and-safety, genai-security, llm-top-10, owasp, prompt-injection, risks, security, vector-security
What is temperature in LLM generation?	Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What loss function do LLMs use and why?	Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What's the difference between zero-shot, few-shot, and chain-of-thought prompting?	Zero-shot: just instructions, no examples. Few-shot: include examples of desired input→output pairs. CoT: ask model to show reasoning steps. Each adds more guidance and typically improves quality.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
How would you handle prompt injection in a production system?	Input sanitization, separate system/user prompts, output validation, don't include raw user input in system prompts. Use the model's built-in system prompt separation. For critical apps, add a second LLM call to verify the first output makes sense.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
What is prompt injection and how would you defend against it?	Prompt injection is when user input overrides system instructions — like SQL injection but for LLMs. I'd defend with 4 layers: (1) Input scanning with regex + classifier to catch obvious attacks. (2) Prompt architecture — use clear delimiters (XML tags) to separate instructions from untrusted user data. (3) Output validation — check that responses don't leak system prompts or follow injected instructions. (4) Architectural controls — least-privilege tool access, human-in-the-loop for sensitive actions, and separate LLM instances for different trust levels. The critical insight is that no single defense is sufficient — defense-in-depth is the only viable strategy.<br><br>Source: ethics-and-safety/prompt-injection-deep-dive.md<br>Tags: E, defense, ethics-and-safety, jailbreak, production, prompt-injection, red-teaming, security
Why is Python dominant in AI if it is slower than C++?	Python gives fast iteration and a huge ecosystem, while the expensive numerical work runs underneath in optimized C, C++, CUDA, or vendor kernels. Python is the control layer, not the performance bottleneck.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What Python tools matter most for GenAI work?	NumPy for array thinking, PyTorch for tensors and training, Hugging Face libraries for models and tokenizers, plus environment management so your CUDA and package versions stay reproducible.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is the most common beginner mistake when starting AI Python work?	Treating the environment as an afterthought. Many early failures come from incompatible package versions, wrong CUDA installs, or tensors ending up on different devices.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is test-time compute scaling and why does it matter?	Instead of scaling model size (pre-training compute), you scale compute at inference — let the model "think longer" on harder problems. This is more efficient because you allocate compute per-problem (easy = cheap, hard = expensive) rather than baking it all into a massive model. o1/o3 showed this can match or exceed much larger standard models.<br><br>Source: llms/reasoning-models.md<br>Tags: D, chain-of-thought, deepseek-r1, genai, llms, o1, o3, reasoning, test-time-compute, thinking
How is DeepSeek-R1 trained?	Uses GRPO (Group Relative Policy Optimization). Generate multiple reasoning chains for a problem, rank them group-relatively, and reinforce better paths. Remarkably, reasoning behavior (self-correction, re-evaluation) emerged purely from RL without supervised reasoning data.<br><br>Source: llms/reasoning-models.md<br>Tags: D, chain-of-thought, deepseek-r1, genai, llms, o1, o3, reasoning, test-time-compute, thinking
When would you NOT use a reasoning model?	Simple tasks (chat, translation, summarization), latency-critical applications (real-time), cost-sensitive high-volume scenarios, and creative tasks where "thinking" adds no value. Reasoning models are for problems where correct step-by-step logic matters.<br><br>Source: llms/reasoning-models.md<br>Tags: D, chain-of-thought, deepseek-r1, genai, llms, o1, o3, reasoning, test-time-compute, thinking
What is GRPO and how is it different from PPO?	GRPO eliminates the value/critic network that PPO requires. Instead of estimating expected rewards, GRPO generates multiple responses per prompt and uses the group mean reward as a baseline. This cuts memory by ~50% and provides lower-variance advantage estimates. DeepSeek-R1 used GRPO to achieve state-of-the-art reasoning by rewarding correct final answers (RLVR) rather than training a separate reward model.<br><br>Source: techniques/rl-alignment.md<br>Tags: B, D, alignment, dpo, genai, grpo, ppo, preference-optimization, reward-model, rlhf, techniques
Compare RLHF and DPO.	RLHF trains a separate reward model on human preferences, then uses PPO to optimize the LLM against it — complex (4 models in memory), expensive, but powerful. DPO mathematically rearranges the RLHF objective into a direct classification loss on preference pairs — simpler (2 models), cheaper, more stable, and achieves comparable results on many tasks. DPO is the go-to for open-source model alignment.<br><br>Source: techniques/rl-alignment.md<br>Tags: B, D, alignment, dpo, genai, grpo, ppo, preference-optimization, reward-model, rlhf, techniques
What is RLVR?	Reinforcement Learning from Verifiable Rewards. Instead of using a learned reward model (which can be gamed), use objectively verifiable rewards: does the code pass tests? Does the math answer match? This is more robust for reasoning tasks and is what powers DeepSeek-R1's math capabilities.<br><br>Source: techniques/rl-alignment.md<br>Tags: B, D, alignment, dpo, genai, grpo, ppo, preference-optimization, reward-model, rlhf, techniques
How do you read an AI paper efficiently?	I start by extracting the core claim and evaluation setup, then inspect baselines, ablations, and limitations. I try to determine what is durable knowledge versus benchmark-specific optimization.<br><br>Source: research-frontiers/research-methodology-and-paper-reading.md<br>Tags: D, experiments, methodology, papers, reproducibility, research, research-frontiers
Why do ablations matter?	Because they test which parts of the method actually drive the gains. Without ablations, it is hard to know whether the headline method or some side choice caused the result.<br><br>Source: research-frontiers/research-methodology-and-paper-reading.md<br>Tags: D, experiments, methodology, papers, reproducibility, research, research-frontiers
How would you evaluate a RAG system?	I'd evaluate in two stages. Stage 1: Retrieval quality — I'd create a labeled dataset of 100+ queries with known relevant documents, then measure Precision@5, Recall@5, MRR, and nDCG. If retrieval metrics are poor (< 0.5 precision), I'd fix retrieval before touching the LLM. Stage 2: End-to-end — using RAGAS metrics (context relevance, faithfulness, answer correctness) to evaluate the full pipeline. I'd run this on every prompt/model change as a regression check. For production, I'd add online metrics: user feedback (thumbs up/down), citation click-through rate, and "I don't know" rate.<br><br>Source: evaluation/retrieval-evaluation.md<br>Tags: B, evaluation, metrics, mrr, ndcg, rag, retrieval, search
How would you improve a RAG pipeline that's giving wrong answers?	Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
When would you choose RAG over fine-tuning?	RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the difference between semantic and keyword search in RAG.	Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the Chinchilla scaling laws.	For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
How is a large language model pre-trained?	(1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
Why are state space models interesting as an alternative to transformers?	The core motivation is computational complexity. Transformer attention is O(n²) in sequence length, making million-token contexts extremely expensive. SSMs like Mamba achieve O(n) — linear time — by processing sequences through a recurrent state that's updated at each step. The breakthrough in Mamba was making the state transition input-dependent (selective), allowing the model to learn what to remember and what to forget. In practice, pure SSMs still trail transformers slightly on tasks requiring precise recall of specific tokens, so hybrid architectures (mixing Mamba layers with attention layers) are emerging as the practical direction.<br><br>Source: foundations/state-space-models.md<br>Tags: D, architecture, foundations, linear-attention, mamba, research, sequence-modeling, ssm
What's the difference between JSON Mode and Structured Outputs?	JSON Mode only guarantees the output is syntactically valid JSON — it could be any shape. Structured Outputs enforce a specific JSON Schema using constrained decoding, guaranteeing the output has exactly the right fields, types, and structure. In production, always use Structured Outputs because you need to parse the result programmatically.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
How does constrained decoding work under the hood?	The JSON Schema is converted into a finite state machine. At each token generation step, the FSM determines which tokens are legal given the current state. All illegal tokens are masked to zero probability. The model samples only from valid tokens. This means schema violations are mathematically impossible — it's not retry-based, it's enforced during generation.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
Does structured output guarantee correct answers?	No — it guarantees correct **structure**, not correct **content**. A model can output `{"sentiment": "positive"}` for a clearly negative review. Structured output is a formatting guarantee, not a factuality guarantee. You still need semantic validation, ground-truth checks, and domain-specific validators.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
How would you train a domain-specific LLM?	(1) Collect domain documents, (2) Generate synthetic instruction-response pairs using a teacher model, (3) Quality-filter using domain experts + LLM-as-judge, (4) Format in ShareGPT/ChatML, (5) Fine-tune with LoRA/QLoRA, (6) Evaluate against domain-specific benchmarks.<br><br>Source: techniques/synthetic-data-and-data-engineering.md<br>Tags: B, data-curation, data-engineering, genai, self-instruct, synthetic-data, techniques, training-data
What's the risk of training on synthetic data?	Model collapse — progressive quality degradation across generations. Also bias amplification (synthetic data inherits teacher's biases) and benchmark contamination. Mitigate by mixing with real data, strong quality filtering, and using diverse teacher models.<br><br>Source: techniques/synthetic-data-and-data-engineering.md<br>Tags: B, data-curation, data-engineering, genai, self-instruct, synthetic-data, techniques, training-data
What is the most common mistake in AI system-design interviews?	Skipping clarification and jumping straight into tools. Good answers start with requirements, success metrics, and failure tolerance before architecture.<br><br>Source: evaluation/system-design-for-ai-interviews.md<br>Tags: ai-architecture, evaluation, interviews, mlops, system-design
What should you always mention in a GenAI system design?	Evaluation, observability, safety boundaries, and cost. Those are the recurring points that separate prototypes from real systems.<br><br>Source: evaluation/system-design-for-ai-interviews.md<br>Tags: ai-architecture, evaluation, interviews, mlops, system-design
Why do LLMs use sub-word tokenization instead of word-level?	Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why is tokenization a source of bias?	Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why do Transformers use scaled dot-product attention (divide by √d_k)?	Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
What's the computational complexity of self-attention?	O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
Why decoder-only for generation instead of encoder-decoder?	Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
How does approximate nearest neighbor search work?	ANN algorithms like HNSW build a graph structure where similar vectors are connected. Search starts from random entry points and greedily navigates toward the query vector through the graph. It's O(log n) vs O(n) for brute force, with ~95-99% recall.<br><br>Source: tools-and-infra/vector-databases.md<br>Tags: A, B, chroma, embeddings, genai-infra, pinecone, qdrant, similarity-search, tools-and-infra, vector-db
How would you choose between Pinecone and self-hosting Qdrant?	Pinecone: zero ops, serverless pricing, fast start. Qdrant self-host: lower cost at scale, data stays on your infra, more control over indexing. Decision factors: team size, data sensitivity, query volume, and operational expertise.<br><br>Source: tools-and-infra/vector-databases.md<br>Tags: A, B, chroma, embeddings, genai-infra, pinecone, qdrant, similarity-search, tools-and-infra, vector-db
How would you build a real-time voice AI agent?	Option 1 (simplest): OpenAI Realtime API — WebSocket-based, speech-to-speech, handles turn-taking and interruptions natively. Option 2 (customizable): Pipeline of Deepgram STT → LLM (with function calling for tools) → ElevenLabs TTS, with a VAD layer for turn management. Option 3 (Google ecosystem): ADK + Gemini Live for multi-agent voice systems. Key challenges: latency optimization, interruption handling, and graceful error recovery.<br><br>Source: applications/voice-ai.md<br>Tags: A, applications, genai, realtime-api, speech-to-text, stt, text-to-speech, tts, voice-ai, whisper