What makes an AI agent different from a chatbot?	A chatbot responds to messages. An agent sets goals, plans multi-step approaches, uses tools, observes results, and iterates. Agents are autonomous; chatbots are reactive.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How would you prevent an AI agent from getting stuck in a loop?	Max iteration limits, self-reflection prompts ("Am I making progress?"), fallback to human, diverse retry strategies (try different tools/approaches), and logging for debugging.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
What's the ReAct pattern?	Reason + Act. The agent alternates between thinking (reasoning about what to do) and acting (calling tools). After each action, it observes the result and reasons about next steps. This interleaving of thought and action is more reliable than planning everything upfront.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How does agent memory work?	Four types: short-term (conversation context), long-term (vector DB storing facts/preferences across sessions), episodic (summaries of past task executions for learning), and procedural (learned strategies and tool patterns). In practice, most production agents use short-term + simple long-term memory with vector retrieval.<br><br>Source: agents/ai-agents.md<br>Tags: Foundation, agentic-ai, agents, autonomy, function-calling, genai-techniques, tool-use
How do modern coding agents handle large codebases that don't fit in context?	Three techniques. (1) Repository indexing — parse ASTs and dependency graphs to understand code structure without reading every file. (2) Progressive context loading — only pull in files relevant to the current step, not the entire repo. (3) Context compaction — periodically summarize the conversation history to free up tokens. The best agents combine all three: index the repo upfront, retrieve relevant files via codebase RAG, and compact history when approaching the context limit.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
What's the most common failure mode of coding agents and how do you mitigate it?	Infinite edit loops — the agent encounters an error, makes a change that doesn't fix it, sees the same error, and repeats. Mitigation: (1) Track state diffs between iterations — if the agent's edit doesn't change the test output, intervene. (2) Set hard max iteration limits (typically 10-20 steps). (3) Have the agent explicitly explain its hypothesis before each edit so you can catch circular reasoning.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
When would you choose a cloud sandbox agent vs an IDE-integrated agent?	Cloud sandbox (like Devin) for tasks that are well-defined, can run unattended, and benefit from isolation — ticket-based bug fixes, migrations, boilerplate generation. IDE-integrated (like Cursor) for tasks requiring rapid human feedback — feature development, debugging, and any work where you need to steer the agent in real-time. The tradeoff is autonomy vs control.<br><br>Source: applications/ai-coding-agents.md<br>Tags: A, agentic-ai, ai-product, applications, codex, coding-agents, cursor, developer-tools, devin, windsurf
How do you decide whether an AI feature is worth building?	I start with the user workflow and measurable outcome, then test whether AI materially improves that workflow at an acceptable quality, trust, and cost level. If it does not, I narrow the scope or avoid the feature.<br><br>Source: applications/ai-product-management-fundamentals.md<br>Tags: A, ai-product-management, applications, evaluation, product, strategy
What is the most important metric for an AI product?	There is rarely one metric. I want a small stack that includes task success, user trust or escalation, latency, and cost per successful task.<br><br>Source: applications/ai-product-management-fundamentals.md<br>Tags: A, ai-product-management, applications, evaluation, product, strategy
When would you choose RAG over fine-tuning?	When the knowledge changes often, needs citations, or comes from private documents. Fine-tuning is better when the behavior itself must change consistently.<br><br>Source: production/ai-system-design.md<br>Tags: A, ai-architecture, genai, llmops, production, system-design
What are the minimum production components for a GenAI assistant?	Auth, prompt assembly, model invocation, safety checks, observability, and evaluation. If the task depends on facts outside the model, add retrieval.<br><br>Source: production/ai-system-design.md<br>Tags: A, ai-architecture, genai, llmops, production, system-design
What should an AI API return besides the answer?	Usually a request id, status or finish reason, and optionally citations or usage metadata depending on the product. Those fields make debugging, billing, and trust much easier.<br><br>Source: applications/api-design-for-ai.md<br>Tags: A, ai-architecture, api, applications, async, rest, streaming, webhooks
When would you choose an async job API?	When the workflow is too long or variable for an interactive request, such as large document pipelines, multi-step agent tasks, or offline generation jobs.<br><br>Source: applications/api-design-for-ai.md<br>Tags: A, ai-architecture, api, applications, async, rest, streaming, webhooks
What's the difference between MCP and A2A?	MCP connects an agent to TOOLS (databases, APIs, filesystems) — it's agent-to-tool communication. A2A connects an agent to OTHER AGENTS — it's agent-to-agent collaboration. MCP is like a USB port (connect devices), A2A is like a network protocol (connect computers). They're complementary: an agent uses MCP to access its own tools and A2A to delegate tasks to other agents.<br><br>Source: agents/agentic-protocols.md<br>Tags: A, B, a2a, adk, agent-protocols, agentic-infra, agents, autogen, crewai, genai, langraph, mcp
Why do we need MCP if we already have function calling?	Function calling is model-specific (OpenAI's API, Anthropic's API). MCP is a universal standard — build one MCP server and it works with Claude, GPT, Gemini, Cursor, and any MCP client. It also adds discovery (list available tools), resources (data access), and security (OAuth). It's the difference between every device having a custom charger vs. everyone using USB-C.<br><br>Source: agents/agentic-protocols.md<br>Tags: A, B, a2a, adk, agent-protocols, agentic-infra, agents, autogen, crewai, genai, langraph, mcp
Why divide by √d_k in attention?	Without it, for large d_k, dot products become huge → softmax saturates → near-zero gradients. Scaling keeps variance at ~1.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
What's the difference between MHA, MQA, and GQA?	MHA: separate K,V per head (most expressive, slowest). MQA: one shared K,V (fastest, some quality loss). GQA: groups of heads share K,V (good balance). LLaMA 2+ uses GQA.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How does Flash Attention improve efficiency without changing the math?	It tiles the computation to fit in SRAM (fast cache), avoiding materialization of the full n×n attention matrix in slow HBM (GPU memory). Same result, ~2-4x faster.<br><br>Source: foundations/attention-mechanism.md<br>Tags: Foundation, attention, foundations, genai-foundations, multi-head, self-attention, transformers
How do modern AI coding agents work?	They follow a plan-act-observe loop: (1) understand the task by reading the codebase context, (2) plan which files to change, (3) implement changes across multiple files, (4) run tests and linters to verify, (5) iterate on failures, (6) present a diff for human review. Tools like Antigravity and Cursor provide IDE integration, while Gemini CLI and Claude Code work from the terminal. The key differentiator in 2026 is MCP support — agents can connect to databases, APIs, and external tools.<br><br>Source: applications/code-generation.md<br>Tags: A, antigravity, applications, claude-code, code-generation, coding-agents, copilot, cursor, devin, gemini-cli, genai, windsurf
Compare Copilot, Cursor, and Antigravity.	Copilot is a platform (extension + agent) best for GitHub-native workflows — evolved from autocomplete to multi-agent orchestration. Cursor is a VS Code fork with AI deeply integrated (Composer for multi-file edits, Supermaven for autocomplete) — best for developers who want AI-enhanced traditional editing. Antigravity is agent-first — designed around delegating to autonomous agents with a Manager View for orchestrating multiple agents simultaneously — best for developers who want to direct rather than write code.<br><br>Source: applications/code-generation.md<br>Tags: A, antigravity, applications, claude-code, code-generation, coding-agents, copilot, cursor, devin, gemini-cli, genai, windsurf
When would you use RAG vs just a long context window?	Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
What is context engineering?	Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.<br><br>Source: techniques/context-engineering.md<br>Tags: Foundation, context-caching, context-window, genai, long-context, prompt-caching, rag-vs-context, techniques
How is conversational AI different from a basic chatbot?	A basic chatbot generates locally plausible replies — it answers the current message without tracking state. A conversational AI system manages dialogue state across turns (tracking intent, confirmed slots, pending questions), handles ambiguity through clarification, recovers from misunderstandings, uses tools to take real actions, and knows when to escalate to a human. The key difference is that a conversational system has explicit state management (what has been said, what's confirmed, what's pending) rather than relying purely on the LLM's context window to "remember" everything.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
Design a customer support chatbot for an e-commerce company.	I'd start by defining the scope: order status, returns/refunds, product questions, and escalation to human agents. The architecture would be a LangGraph-based conversation flow with: (1) an intent classifier node that routes to specialized sub-flows, (2) structured state tracking order IDs, customer info, and issue type, (3) tool integrations for order lookup, return initiation, and ticket creation, (4) a summarization memory layer for conversations > 10 turns, (5) guardrails for PII handling and policy compliance. For latency, I'd target TTFT < 500ms with streaming. For evaluation, I'd track task completion rate, turns-to-resolution, escalation rate, and CSAT scores. The critical design decision is the escalation policy — I'd implement confidence-based routing where the bot hands off proactively when confidence drops below 0.7, rather than waiting for the user to ask for a human.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
What should a conversational system remember and forget?	This is a product decision, not a technical one. Remember: user's stated goal, confirmed facts (slots), tool results, and explicit preferences. Forget: rejected alternatives, small talk, verbose explanations, and intermediate reasoning steps. The implementation I'd use is a hybrid: structured state for confirmed facts (a Pydantic model with intent, slots, phase), periodic summarization for conversation flow, and the last 4-6 turns verbatim for immediate context. Critical rule: never "remember" something that was said in a summary that wasn't in the original messages — that's how summary drift causes hallucinated memories.<br><br>Source: applications/conversational-ai.md<br>Tags: A, agents, applications, chatbots, conversational-ai, dialogue, state, voice
What optimizer do you use for training Transformers and why?	AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
How would you handle GPU memory limitations when training?	(1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).<br><br>Source: prerequisites/deep-learning-fundamentals.md<br>Tags: Foundation, deep-learning, genai-prerequisite, gpu, optimizer, prerequisites, regularization, training
What are embeddings and why do they matter for GenAI?	Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
What's the difference between word embeddings and sentence embeddings?	Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.<br><br>Source: foundations/embeddings.md<br>Tags: Foundation, embeddings, foundations, genai-foundations, representation, similarity, vectors
When would you fine-tune vs use RAG?	Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: **combine both** — LoRA for behavior, RAG for facts.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
Explain how LoRA reduces memory requirements.	Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.<br><br>Source: techniques/fine-tuning.md<br>Tags: Foundation, fine-tuning, genai-techniques, lora, peft, qlora, techniques, training
How does function calling work in LLMs?	You define tools with names, descriptions, and parameter schemas. The LLM receives the user message + tool definitions, decides if a tool should be called, and generates a JSON object with the function name and arguments. YOUR code executes the function and feeds the result back to the LLM for final response generation. The LLM never actually runs the function.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is MCP and why does it matter?	Model Context Protocol is an open standard for connecting LLMs to external tools. Before MCP, every tool needed custom integration for each model. MCP provides a universal interface — any MCP-compatible tool works with any MCP-compatible client. It's becoming the "USB standard" for AI tool integration.<br><br>Source: techniques/function-calling-and-structured-output.md<br>Tags: Foundation, function-calling, genai, grounding, json-mode, mcp, structured-output, techniques, tool-use
What is Graph RAG and when would you use it over standard RAG?	Graph RAG combines knowledge graphs with RAG. Standard RAG retrieves text chunks by similarity — great for "what does X mean?" but fails at "how are X and Y connected?" or "summarize all instances of Z." Graph RAG extracts entities and relationships into a knowledge graph, enabling multi-hop reasoning and aggregation. Use it when: data is entity-heavy (people, organizations, events), questions require relationship traversal, or you need thematic summary across large document sets.<br><br>Source: techniques/graph-rag.md<br>Tags: A, B, agentic-rag, genai, graph-rag, graphrag, knowledge-graph, multi-hop-reasoning, techniques
What is Agentic RAG?	Agentic RAG gives retrieval an autonomous agent that can dynamically choose retrieval strategies (vector search, graph query, SQL, web search), self-correct when results are poor, decompose complex questions into sub-queries, and verify answers before returning. It transforms RAG from a fixed pipeline into an adaptive reasoning loop. This is the emerging standard for enterprise AI in 2026.<br><br>Source: techniques/graph-rag.md<br>Tags: A, B, agentic-rag, genai, graph-rag, graphrag, knowledge-graph, multi-hop-reasoning, techniques
How would you evaluate a RAG system?	Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
Why are traditional benchmarks becoming less useful?	Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.<br><br>Source: evaluation/evaluation-and-benchmarks.md<br>Tags: E, Foundation, benchmarks, evaluation, genai, humaneval, mmlu, ragas, testing
How would you take an LLM prototype to production?	(1) Create an eval suite (50+ golden examples), (2) Add input/output guardrails, (3) Implement observability (Langfuse/LangSmith), (4) Set up cost alerting, (5) Abstract the LLM provider behind a gateway for fallbacks, (6) CI/CD pipeline that runs eval suite on every prompt/code change, (7) Canary deployment with quality monitoring.<br><br>Source: production/llmops.md<br>Tags: A, B, ci-cd, deployment, genai, llmops, monitoring, observability, production
How do you handle LLM quality degradation in production?	Continuous monitoring via automated evals, user feedback (👍/👎), drift detection. When quality drops: check if the provider updated the model, run regression analysis against golden set, roll back prompts if needed, or switch to a backup model.<br><br>Source: production/llmops.md<br>Tags: A, B, ci-cd, deployment, genai, llmops, monitoring, observability, production
Explain the training pipeline of a modern LLM.	Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
What's the difference between dense and MoE architectures?	Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?	API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.<br><br>Source: llms/llms-overview.md<br>Tags: Foundation, claude, gemini, genai, gpt, language-models, llama, llm, llms
Why is the dot product central to the attention mechanism?	Attention computes Q·Kᵀ where Q = query and K = key. The dot product measures how "related" each query is to each key — high dot product = high attention. This is then softmaxed into weights that determine how much each value V contributes to the output.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
Why are GPUs used for deep learning?	Neural networks are fundamentally matrix multiplications. GPUs have thousands of cores designed for parallel math operations. A CPU might do matrix multiply sequentially; a GPU does thousands of multiply-adds simultaneously.<br><br>Source: prerequisites/linear-algebra-for-ai.md<br>Tags: Foundation, dot-product, genai-prerequisite, linear-algebra, matrices, prerequisites, tensors, vectors
What is Mixture of Experts and why does LLaMA 4 use it?	MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
What is GQA and how does it save memory?	Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.<br><br>Source: foundations/modern-architectures.md<br>Tags: Foundation, architecture, flash-attention, foundations, genai, gqa, mixture-of-experts, moe, rope
When would you choose multi-agent over single-agent?	I'd choose multi-agent when three conditions are met: (1) the task has natural decomposition boundaries where different sub-tasks benefit from different tool access, system prompts, or contexts — for example, a research agent with web search and a coding agent with a sandbox; (2) the quality improvement from specialization is measurable and significant, not incremental; and (3) the latency and cost multiplier (3-7× more expensive) is acceptable for the use case. I'd always benchmark a single-agent baseline first. If one agent with well-designed tools achieves 80%+ of the quality, the coordination overhead of multi-agent isn't justified. The exception is adversarial review: having a critic agent that challenges the primary agent's output catches errors that self-review misses.<br><br>Source: agents/multi-agent-architectures.md<br>Tags: A, B, agents, coordination, genai-techniques, multi-agent, orchestration
Design a multi-agent system for automated code review.	I'd use a pipeline pattern with 3 agents. First, a Code Analyzer agent with access to static analysis tools (linting, complexity metrics, type checking) processes the diff and produces a structured analysis. Second, a Logic Reviewer agent with access to the codebase context (via RAG over the repo) evaluates correctness, identifies potential bugs, and checks for security issues. Third, a Summary Agent synthesizes both analyses into a human-readable review with actionable suggestions, severity levels, and specific line references. State management: each agent writes to a shared ReviewState dict with typed fields (analysis, logic_issues, suggestions). I'd add a max_cost guard ($0.50/review), trajectory logging via LangSmith, and a confidence score—if any agent is < 70% confident, flag for human review instead of auto-approving.<br><br>Source: agents/multi-agent-architectures.md<br>Tags: A, B, agents, coordination, genai-techniques, multi-agent, orchestration
What's the difference between BERT and GPT?	BERT is an ENCODER that sees all tokens bidirectionally (optimized for understanding — classification, NER, embeddings). GPT is a DECODER that sees only past tokens (optimized for generation — text, code, chat). Both use Transformers, but BERT predicts masked tokens while GPT predicts the next token.<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
Has GenAI made traditional NLP obsolete?	Mostly, yes. LLMs handle most NLP tasks via prompting, often better than task-specific models. However, BERT-based models survive for: (1) embeddings (BGE, E5), (2) sub-100ms classification at scale, (3) BM25/TF-IDF for initial retrieval in RAG. The field has consolidated around "one model, many tasks."<br><br>Source: prerequisites/nlp-fundamentals.md<br>Tags: Foundation, bert, genai-prerequisite, natural-language-processing, ner, nlp, prerequisites, sentiment, text-classification
What does backpropagation actually compute?	The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
Why do we need activation functions?	Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.<br><br>Source: prerequisites/neural-networks.md<br>Tags: Foundation, activation, backpropagation, cnn, genai-prerequisite, neural-networks, perceptron, prerequisites, rnn
What is temperature in LLM generation?	Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What loss function do LLMs use and why?	Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.<br><br>Source: prerequisites/probability-and-statistics.md<br>Tags: Foundation, bayes, distributions, genai-prerequisite, loss-functions, prerequisites, probability, sampling, statistics
What's the difference between zero-shot, few-shot, and chain-of-thought prompting?	Zero-shot: just instructions, no examples. Few-shot: include examples of desired input→output pairs. CoT: ask model to show reasoning steps. Each adds more guidance and typically improves quality.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
How would you handle prompt injection in a production system?	Input sanitization, separate system/user prompts, output validation, don't include raw user input in system prompts. Use the model's built-in system prompt separation. For critical apps, add a second LLM call to verify the first output makes sense.<br><br>Source: techniques/prompt-engineering.md<br>Tags: Foundation, chain-of-thought, few-shot, genai-techniques, prompt-engineering, prompting, techniques
Why is Python dominant in AI if it is slower than C++?	Python gives fast iteration and a huge ecosystem, while the expensive numerical work runs underneath in optimized C, C++, CUDA, or vendor kernels. Python is the control layer, not the performance bottleneck.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What Python tools matter most for GenAI work?	NumPy for array thinking, PyTorch for tensors and training, Hugging Face libraries for models and tokenizers, plus environment management so your CUDA and package versions stay reproducible.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
What is the most common beginner mistake when starting AI Python work?	Treating the environment as an afterthought. Many early failures come from incompatible package versions, wrong CUDA installs, or tensors ending up on different devices.<br><br>Source: prerequisites/python-for-ai.md<br>Tags: Foundation, environment, genai-prerequisite, numpy, prerequisites, python, pytorch, transformers
How would you improve a RAG pipeline that's giving wrong answers?	Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
When would you choose RAG over fine-tuning?	RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the difference between semantic and keyword search in RAG.	Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.<br><br>Source: techniques/rag.md<br>Tags: Foundation, embeddings, genai-techniques, rag, retrieval, techniques, vector-db
Explain the Chinchilla scaling laws.	For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
How is a large language model pre-trained?	(1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.<br><br>Source: foundations/scaling-laws-and-pretraining.md<br>Tags: D, Foundation, chinchilla, compute, data-mix, foundations, genai, pre-training, scaling-laws, training
What's the difference between JSON Mode and Structured Outputs?	JSON Mode only guarantees the output is syntactically valid JSON — it could be any shape. Structured Outputs enforce a specific JSON Schema using constrained decoding, guaranteeing the output has exactly the right fields, types, and structure. In production, always use Structured Outputs because you need to parse the result programmatically.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
How does constrained decoding work under the hood?	The JSON Schema is converted into a finite state machine. At each token generation step, the FSM determines which tokens are legal given the current state. All illegal tokens are masked to zero probability. The model samples only from valid tokens. This means schema violations are mathematically impossible — it's not retry-based, it's enforced during generation.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
Does structured output guarantee correct answers?	No — it guarantees correct **structure**, not correct **content**. A model can output `{"sentiment": "positive"}` for a clearly negative review. Structured output is a formatting guarantee, not a factuality guarantee. You still need semantic validation, ground-truth checks, and domain-specific validators.<br><br>Source: techniques/structured-outputs.md<br>Tags: A, constrained-decoding, function-calling, genai-technique, json-mode, pydantic, schema, structured-output, techniques
Why do LLMs use sub-word tokenization instead of word-level?	Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why is tokenization a source of bias?	Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.<br><br>Source: foundations/tokenization.md<br>Tags: Foundation, bpe, foundations, genai-foundations, llm-internals, sentencepiece, tokenization, wordpiece
Why do Transformers use scaled dot-product attention (divide by √d_k)?	Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
What's the computational complexity of self-attention?	O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
Why decoder-only for generation instead of encoder-decoder?	Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.<br><br>Source: foundations/transformers.md<br>Tags: Foundation, architecture, attention, deep-learning, foundations, genai-foundations, transformers
How does approximate nearest neighbor search work?	ANN algorithms like HNSW build a graph structure where similar vectors are connected. Search starts from random entry points and greedily navigates toward the query vector through the graph. It's O(log n) vs O(n) for brute force, with ~95-99% recall.<br><br>Source: tools-and-infra/vector-databases.md<br>Tags: A, B, chroma, embeddings, genai-infra, pinecone, qdrant, similarity-search, tools-and-infra, vector-db
How would you choose between Pinecone and self-hosting Qdrant?	Pinecone: zero ops, serverless pricing, fast start. Qdrant self-host: lower cost at scale, data stays on your infra, more control over indexing. Decision factors: team size, data sensitivity, query volume, and operational expertise.<br><br>Source: tools-and-infra/vector-databases.md<br>Tags: A, B, chroma, embeddings, genai-infra, pinecone, qdrant, similarity-search, tools-and-infra, vector-db
How would you build a real-time voice AI agent?	Option 1 (simplest): OpenAI Realtime API — WebSocket-based, speech-to-speech, handles turn-taking and interruptions natively. Option 2 (customizable): Pipeline of Deepgram STT → LLM (with function calling for tools) → ElevenLabs TTS, with a VAD layer for turn management. Option 3 (Google ecosystem): ADK + Gemini Live for multi-agent voice systems. Key challenges: latency optimization, interruption handling, and graceful error recovery.<br><br>Source: applications/voice-ai.md<br>Tags: A, applications, genai, realtime-api, speech-to-text, stt, text-to-speech, tts, voice-ai, whisper