Context Engineering & Long Context¶

✨ Bit: In 2023, you could feed an LLM ~4,000 tokens (~3 pages). In 2025, Gemini accepts 1,000,000 tokens (~750,000 words — that's 10 novels). This changes EVERYTHING about how we build AI applications. RAG? Sometimes you just paste the entire database.

★ TL;DR¶

What: The art and science of deciding WHAT information goes into an LLM's context window, and using long context + caching to do it efficiently
Why: The context window IS the LLM's working memory. What you put in it determines everything about the output quality.
Key point: Context engineering is replacing "prompt engineering" as THE critical skill. It's not just about the prompt — it's about the system prompt + retrieved docs + examples + tool results + conversation history, all managed within a token budget.

★ Overview¶

Definition¶

Context window: The maximum number of tokens an LLM can process in a single call (input + output combined)
Context engineering: Strategically constructing the full context (system prompt, examples, retrieved data, conversation history) to maximize output quality within token limits
Context caching / Prompt caching: Reusing pre-computed token representations across API calls to save cost and latency

Scope¶

Covers context strategy and optimization. For retrieval-specific techniques, see Rag. For prompting techniques, see Prompt Engineering.

★ Deep Dive¶

Context Window Evolution¶

MODEL             │ CONTEXT WINDOW  │ ≈ PAGES │ YEAR
══════════════════╪═════════════════╪═════════╪══════
GPT-3             │     4,096       │     3   │ 2020
GPT-3.5           │    16,384       │    12   │ 2023
GPT-4             │   128,000       │    96   │ 2023
Claude 3          │   200,000       │   150   │ 2024
Gemini 1.5 Pro    │ 1,000,000       │   750   │ 2024
GPT-5.4           │ 1,000,000       │   750   │ 2026
Claude Opus 4.6   │ 1,000,000       │   750   │ 2026
Gemini 3.1 Pro    │ 1,000,000+      │   750+  │ 2026
LLaMA 4 Scout     │10,000,000       │ 7,500   │ 2025

WHAT FITS IN 1M TOKENS:
  10 novels         │ 30 hours of transcripts
  Entire codebase   │ 1000s of documents
  Full legal case   │ Year of emails

RAG vs Long Context vs Context Engineering¶

THE DEBATE (2025-2026):

  RAG:                          LONG CONTEXT:
  "Retrieve relevant chunks"    "Just stuff everything in"

  PROS:                         PROS:
  ✅ Works with ANY context     ✅ No retrieval pipeline
  ✅ Scales to billions of docs ✅ Model sees FULL context
  ✅ Always up-to-date          ✅ Better cross-referencing
  ✅ Cheaper per query          ✅ Simpler architecture

  CONS:                         CONS:
  ❌ Retrieval failures         ❌ Expensive per query
  ❌ Chunking artifacts         ❌ Limited to context size
  ❌ Complex pipeline           ❌ "Lost in the middle" effect
  ❌ Can miss connections       ❌ Slower (more tokens to process)

VERDICT: It's not either/or. Context engineering uses BOTH.

  ┌─────────────────────────────────────────────┐
  │  CONTEXT ENGINEERING = Strategic combination │
  │                                             │
  │  System prompt (always present)             │
  │  + Cached context (heavy docs, reusable)    │
  │  + RAG results (query-specific chunks)      │
  │  + Conversation history (recent turns)      │
  │  + Examples (few-shot, if needed)            │
  │  + Tool results (function call outputs)     │
  │  = Optimized context window                 │
  └─────────────────────────────────────────────┘

Context Caching (Prompt Caching)¶

THE COST PROBLEM:
  You have a 50-page manual in your system prompt.
  Every API call re-processes all 50 pages.
  1000 queries/day × 50 pages = MASSIVE token bill.

SOLUTION: Cache the repeated part.

  WITHOUT CACHING:
    Call 1: [System + 50 pages + user question 1]  → process ALL
    Call 2: [System + 50 pages + user question 2]  → process ALL
    Call 3: [System + 50 pages + user question 3]  → process ALL
    Cost: 100% × 3 = 300% tokens

  WITH CACHING:
    Call 1: [System + 50 pages ← CACHE THIS] + [question 1]
    Call 2: [CACHED] + [question 2]  → only process new part
    Call 3: [CACHED] + [question 3]  → only process new part
    Cost: 100% + 10% + 10% = 120% tokens → 60% SAVINGS!

PROVIDER SUPPORT (2026):
  Anthropic:  "Prompt caching" — explicit cache_control blocks
  Google:     "Context caching" — cache API for Gemini
  OpenAI:     Automatic caching for repeated prefixes

  Pricing:    Cached tokens cost 75-90% less than uncached
  Latency:    Cached tokens processed ~2-5x faster

The "Lost in the Middle" Problem¶

PROBLEM: LLMs pay most attention to the START and END
         of the context, often "forgetting" the MIDDLE.

  [System prompt - high attention]
  [Document 1 - moderate attention]
  [Document 2 - low attention]     ← "lost in the middle"
  [Document 3 - low attention]     ← important info here?
  [Document 4 - moderate attention]
  [User question - high attention]

MITIGATIONS:
  1. Put MOST IMPORTANT info at start and end
  2. Use structured formats (headers, bullets)
  3. Explicitly reference: "Based on Document 3..."
  4. Shorter contexts when possible (quality > quantity)
  5. Modern models (Gemini 3.1 Pro, Claude Opus 4.6) handle this MUCH better

Context Engineering in Practice¶

# ⚠️ Last tested: 2026-04
# ═══ Context Engineering Example ═══

def build_context(user_query: str, conversation_history: list) -> list:
    messages = []

    # 1. System prompt (always first, high attention)
    messages.append({
        "role": "system",
        "content": SYSTEM_PROMPT  # Company rules, persona, constraints
    })

    # 2. Cached reference docs (expensive, reuse across calls)
    # Use provider-specific caching (Anthropic cache_control, etc.)
    messages.append({
        "role": "system",
        "content": PRODUCT_MANUAL,  # 50-page manual, cached
        "cache_control": {"type": "ephemeral"}  # Anthropic syntax
    })

    # 3. RAG results (query-specific, fresh each call)
    relevant_chunks = retrieve_from_vector_db(user_query, top_k=5)
    messages.append({
        "role": "system",
        "content": f"Relevant context:\n{format_chunks(relevant_chunks)}"
    })

    # 4. Conversation history (sliding window)
    # Keep last N turns to stay within token budget
    recent_history = conversation_history[-10:]  # Last 10 turns
    messages.extend(recent_history)

    # 5. User's actual question (last, high attention)
    messages.append({
        "role": "user",
        "content": user_query
    })

    return messages

◆ Quick Reference¶

WHEN TO USE WHAT:
  Small doc set (< 100 pages)  → Long context (just paste it)
  Large doc set (1000s of docs) → RAG (retrieve relevant chunks)
  Repeated context across calls → Context caching (save $$$)
  Mixed scenario               → Cache + RAG + long context

TOKEN BUDGET PLANNING:
  Total budget = model's context window
  Reserve for output: ~2K-4K tokens
  System prompt: 500-2000 tokens
  Cached context: up to 50% of remaining
  RAG results: 20-40% of remaining
  Conversation history: 10-20%
  User query: typically < 500 tokens

COST COMPARISON (per 1M input tokens, approximate):
  Standard tokens:  $3-15
  Cached tokens:    $0.30-1.50 (10x cheaper!)
  Short context:    Fast, cheap
  Long context:     Slow, expensive, but comprehensive

○ Gotchas & Common Mistakes¶

⚠️ More context ≠ better answers: Irrelevant context DILUTES quality. Be strategic about what goes in.
⚠️ Lost in the middle: Important info gets ignored if buried in the middle. Structure and position matter.
⚠️ Cache invalidation: When your cached docs update, the cache must be refreshed. Plan for this.
⚠️ Token counting is tricky: Different models count tokens differently. Always check with the tokenizer.
⚠️ Context window ≠ effective context: A 1M-token window doesn't mean the model is equally good at using ALL 1M tokens. Effective context is usually shorter.

○ Interview Angles¶

Q: When would you use RAG vs just a long context window?
A: Long context when: few documents, need cross-references, latency isn't critical, and you can afford the token cost. RAG when: many documents (more than context window), need real-time data, cost-sensitive, or need to scale to millions of docs. In practice, combine both: cache stable reference docs in context, use RAG for dynamic query-specific retrieval.
Q: What is context engineering?
A: Context engineering is the practice of strategically constructing the full input to an LLM — system prompt, cached reference docs, RAG results, conversation history, and examples — to maximize output quality within the token budget. It's becoming more important than prompt engineering because the quality bottleneck is often WHAT information the model has access to, not HOW you phrase the question.

★ Code & Implementation¶

Dynamic Context Window Manager¶

# pip install openai>=1.60 tiktoken>=0.6
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, tiktoken>=0.6, OPENAI_API_KEY
import tiktoken
from openai import OpenAI
from dataclasses import dataclass, field

client = OpenAI()
enc    = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

@dataclass
class ContextManager:
    """Manages context window budget across system prompt, history, and retrieved docs."""
    model: str         = "gpt-4o-mini"
    context_limit: int = 8192       # safe limit (model max - output buffer)
    system_prompt: str = "You are a helpful assistant."
    history:       list[dict] = field(default_factory=list)

    @property
    def _system_tokens(self) -> int:
        return count_tokens(self.system_prompt)

    def add_user_message(self, content: str, context_docs: list[str] | None = None) -> list[dict]:
        """Build a messages list that fits within the context limit."""
        # Build the user message with retrieved context
        if context_docs:
            context_str = "\n\n".join(f"[Doc {i+1}]: {d}" for i, d in enumerate(context_docs))
            full_content = f"Context:\n{context_str}\n\nQuestion: {content}"
        else:
            full_content = content

        # Truncate history to fit in budget
        budget = self.context_limit - self._system_tokens - count_tokens(full_content) - 200
        trimmed_history = []
        history_tokens = 0
        for msg in reversed(self.history):
            t = count_tokens(msg["content"])
            if history_tokens + t > budget:
                break
            trimmed_history.insert(0, msg)
            history_tokens += t

        messages = (
            [{"role": "system", "content": self.system_prompt}]
            + trimmed_history
            + [{"role": "user", "content": full_content}]
        )
        used_tokens = self._system_tokens + history_tokens + count_tokens(full_content)
        print(f"Context: {used_tokens}/{self.context_limit} tokens ({len(trimmed_history)} history msgs)")
        return messages

    def chat(self, user_input: str, context_docs: list[str] | None = None) -> str:
        messages = self.add_user_message(user_input, context_docs)
        resp = client.chat.completions.create(
            model=self.model, messages=messages, max_tokens=500
        )
        answer = resp.choices[0].message.content
        self.history.append({"role": "user",      "content": user_input})
        self.history.append({"role": "assistant",  "content": answer})
        return answer

# Example
cm = ContextManager(system_prompt="You are a concise ML expert.")
r  = cm.chat("What is RAG?", context_docs=["RAG combines retrieval with generation to ground LLM answers."])
print(r)

★ Connections¶

Relationship	Topics
Builds on	Rag, Prompt Engineering, Tokenization
Leads to	Llmops (cost management), Better AI applications
Compare with	Traditional search, Knowledge bases
Cross-domain	Information retrieval, Memory management, Caching systems

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Lost-in-the-middle	Model ignores information in the middle of long contexts	Attention distribution bias (U-shaped)	Place critical info at start/end, use structural markers
Context window waste	128K tokens used when 8K suffices, causing latency/cost spikes	No context budget management	Token counting, dynamic context assembly, cache control
Instruction-context conflict	System prompt and retrieved context give contradictory guidance	No priority hierarchy between instruction types	Explicit priority layers, context deduplication
Prompt injection via context	User-supplied context contains adversarial instructions	Untrusted content injected into prompt	Input sanitization, delimiter enforcement, separate user/system context

◆ Hands-On Exercises¶

Exercise 1: Build a Token-Budget-Aware Prompt Builder¶

Goal: Create a prompt assembly system that respects token limits Time: 30 minutes Steps: 1. Implement a PromptBuilder class with system/context/user sections 2. Add token counting with tiktoken 3. Implement priority-based truncation (system > user > context) 4. Test with inputs that exceed 4K token budget Expected Output: Prompt that never exceeds budget, with truncation logging

Exercise 2: Test the Lost-in-the-Middle Effect¶

Goal: Empirically demonstrate and mitigate lost-in-the-middle Time: 30 minutes Steps: 1. Create a 20-fact context window 2. Place a target fact at positions 1, 5, 10, 15, 20 3. Ask the LLM about the target fact at each position 4. Plot accuracy by position 5. Re-test with structural markers (XML tags, section headers) Expected Output: U-shaped accuracy curve and improvement with markers

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	Anthropic Prompt Engineering Guide	Best practical guide to context window management
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 5	Covers prompt and context design patterns systematically
🎥 Video	Simon Willison — "Context Engineering"	Practical insights on managing LLM context

★ Sources¶

Google, "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens" (2024)
Anthropic, "Prompt Caching" documentation — https://docs.anthropic.com
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023)
Simon Willison, "Context Engineering" blog posts (2025)