5.1

Manage conversation context to preserve critical information across long interactions

In a long customer support session, a summarized history is lethal to accuracy — numbers like "$47.20 refund" become "a large amount," order IDs vanish, and the model has nothing solid to reason from. The fix is a persistent case facts block: a small structured object containing transactional data (amounts, dates, order numbers, statuses) extracted outside the summarized history and injected verbatim into every prompt. Tool outputs deepen the problem independently: a single order lookup can return 40+ fields when only 5 matter, so trim before they accumulate in context. A third hazard is position: the model reliably processes content at the beginning and end of long inputs but may skip what is buried in the middle — the lost in the middle effect — so place key findings first and use explicit section headers to counteract it.

  • Extract transactional facts (amounts, dates, order numbers, statuses) into a persistent case-facts block kept outside the summarized history and injected into every prompt
  • For multi-issue sessions, persist structured issue data (IDs, amounts, statuses) in a separate context layer so distinct issues remain unambiguous
  • Trim verbose tool outputs to only the fields relevant to the current task before appending them to context
  • Place key findings at the beginning of aggregated inputs; use explicit section headers throughout to counteract the "lost in the middle" position effect
  • Require subagents to include metadata (dates, source locations, methodological context) in structured outputs; upstream agents should return key-facts structures rather than verbose reasoning chains when downstream context budgets are limited
Task 5.1 diagram — context preservation in long interactions
  • Progressive summarization of numerical values, percentages, and dates — condensing "$47.20" to "large amount" destroys precision needed later
  • Relying on model memory across turns without re-injecting the case facts block — the model's attention does not guarantee prior facts remain accessible
  • Including full raw tool output in context (40+ fields) without trimming to the 5 that matter — token cost grows faster than the useful signal
  • Placing key findings in the middle of a long aggregated input rather than at the start — the "lost in the middle" effect makes the model skip them
  • Having upstream agents return verbose content and reasoning chains when downstream agents have limited context budgets
{
  "case_facts": {
    "order_id":        "ORD-8841",
    "claim_amount":    "$47.20",      // exact dollar — never "a large amount"
    "return_deadline": "2026-03-15",  // exact ISO date — never "soon"
    "issue":           "screen cracked in transit",
    "status":          "return_initiated"
  }
}
// Injected verbatim into every prompt — kept outside the summarized history.
// Placing it inside the summary risks truncating the exact values above.
Claude Docs (official): Prompting best practices — Long context tips · Position effects, putting data before instructions, mitigating "lost in the middle"
Anthropic Academy on Skilljar (optional): Claude Code in Action — Module 3 + Building with the Claude API — Module 7 · Controlling context (CC) · Prompt caching · Rules of prompt caching · Prompt caching in action (BCA)
Peace Of Code (YouTube, optional): Ep 18 — Why AI Agents Forget: Context Engineering
5.2

Design effective escalation and ambiguity resolution patterns

A customer support agent achieves only 55% first-contact resolution because it escalates routine refunds a policy lookup would resolve in seconds — while letting "I want to speak to a manager" slip past without immediate human handoff. The fix is not smarter sentiment detection or a confidence score: it is three explicit hard triggers encoded with few-shot examples in the system prompt: (1) the customer explicitly requests a human (route immediately — no investigation first), (2) policy is silent or ambiguous on the specific request (a competitor price-match request falls outside a policy that only addresses own-site adjustments), and (3) the agent cannot make meaningful progress. When a tool lookup returns multiple customer records, the agent must ask for an additional identifier — never select by heuristic.

  • Honor customer requests for a human agent immediately — no prior investigation attempt — by encoding this trigger explicitly in the system prompt
  • Acknowledge frustration and offer resolution when the issue is within the agent's capability; escalate only if the customer reiterates their preference for a human
  • Escalate when policy is silent or ambiguous on the specific request — not just when the case seems complex in general
  • Ask for an additional identifier (email, order number, ZIP) when a tool lookup returns multiple matching customer records
  • Encode escalation criteria as explicit rules with few-shot examples showing when to escalate versus resolve autonomously
Task 5.2 diagram — escalation decision flow
  • "Self-reported confidence score" — LLM self-reported confidence is poorly calibrated and does not correlate with actual case complexity
  • "Sentiment analysis as escalation signals" — negative sentiment does not reliably indicate whether the case exceeds the agent's capability
  • "Separate classifier model before prompt optimization" — over-engineered; requires labeled data and ML infrastructure when prompt optimization hasn't been tried
  • Attempting investigation before routing an explicit "I want a human" request — the trigger is clear and must be honored without delay
  • Selecting a customer record by heuristic when multiple matches exist — always request a disambiguating identifier instead
Anthropic Academy on Skilljar (optional): Building with the Claude API — Module 4 + Claude 101 — Module 4 · Being clear and direct · Being specific (BCA) · Claude in action: use-cases by role (101)
Peace Of Code (YouTube, optional): Ep 20 — When AI Needs a Human
5.3

Implement error propagation strategies across multi-agent systems

A web search subagent times out mid-research. The wrong response returns a generic "search unavailable" status — or worse, an empty result set marked as successful — stripping the coordinator of everything it needs to recover. The right response returns structured error context: failure_type, attempted_query, partial_results, and alternatives. This data transforms a dead end into a decision point: the coordinator can retry, reroute to a mirror, proceed with partial results annotated for coverage gaps, or escalate — none of which is possible from a generic status. The critical distinction is between an access failure (timeout — retry decision needed) and a valid empty result (query succeeded but matched nothing); conflating them produces bad retry logic.

  • Return structured error context — failure type, what was attempted, partial results, alternatives — not a generic "search unavailable" status
  • Distinguish access failures (timeout, permission denied → retry decision) from valid empty results (successful query, zero matches) in all error reporting
  • Implement local recovery for transient failures; propagate only errors the subagent cannot resolve, always including what was attempted and any partial results
  • Annotate synthesis output with coverage notes indicating which topic areas have gaps due to unavailable sources — the report should state its own limits
Task 5.3 diagram — structured error propagation flow
  • "Generic search unavailable status" — hides the failure type, the attempted query, and partial results; the coordinator cannot make an intelligent recovery decision
  • "Return empty result set marked as successful" — suppresses the error entirely; the coordinator proceeds as if the search succeeded, producing incomplete output with no signal
  • "Terminate the entire workflow on a single failure" — one subagent failure rarely justifies abandoning all other findings already collected
  • Propagating an error the subagent could have resolved locally — transient failures (network retry, brief timeout) should be handled before escalation
{
  "failure_type": "timeout",                    // access failure — not "search unavailable"
  "attempted":    "market share analysis 2026",  // exact query — coordinator can retry it
  "partial":      ["result_a"],                   // what was found before the failure
  "alternatives": ["cached_data", "mirror"]       // other approaches coordinator can try
}
// Return this structure — not a generic status string.
// A timeout and a valid empty result are not the same signal.
Anthropic Academy on Skilljar (optional): Building with the Claude API — Modules 5 + 10 · Sending tool results · Handling message blocks (M5) · Agents and tools · Environment inspection (M10)
5.4

Manage context effectively in large codebase exploration

An extended codebase exploration session degrades noticeably: the model begins citing "typical class hierarchies" instead of the actual inheritance structure it read 40 turns ago. Isolation is the fix — a main coordinator that holds only high-level findings delegates verbose discovery to named subagents ("find all test files," "trace the refund flow dependencies"), keeping their raw output in their isolated context while returning only a compact summary to the coordinator. Key findings must be persisted to a findings.md scratchpad on disk and re-injected in later turns to counteract drift; use /compact when the context window fills with verbose discovery output. For crashes, each subagent exports state to a known location and the coordinator loads a manifest on resume.

  • Spawn subagents with specific, scoped questions so verbose file listings and traces stay in their isolated contexts — not the main coordinator's
  • Summarize key findings from each exploration phase before spawning the next phase, injecting the summary into new subagent prompts
  • Maintain a scratchpad file (findings.md) and re-read it for subsequent questions to counteract context degradation
  • Use /compact to reduce context usage when extended sessions accumulate verbose discovery output
  • Design crash recovery with structured state exports (manifests) that the coordinator loads and injects into agent prompts on resume
Task 5.4 diagram — hub-and-spoke codebase context architecture
  • Loading the coordinator's context with raw file dumps from subagents — verbose discovery belongs in isolated subagent contexts, not the coordinator's
  • Ignoring the scratchpad pattern and relying on model memory alone — without written persistence, findings degrade and the model reverts to "typical patterns"
  • Spawning phase N+1 subagents without first summarizing phase N findings — each phase should inject a compact summary, not the full prior context
  • Treating a crash as unrecoverable — structured manifests allow the coordinator to reload state and resume without restarting the entire exploration
Claude Docs (official): Best practices for Claude Code · Manage context with subagents, scratchpad files, and /compact
Anthropic Academy on Skilljar (optional): Claude Code in Action — Modules 2 + 3 + Building with the Claude API — Module 6 · Adding context · Controlling context (CC) · Text chunking strategies · The full RAG flow · BM25 lexical search · A Multi-Index RAG pipeline (BCA)
Peace Of Code (YouTube, optional): Ep 18 — Why AI Agents Forget: Context Engineering
5.5

Design human review workflows and confidence calibration

Routing every extracted document to a human reviewer is unsustainable at scale, but one aggregate accuracy number — 97% correct — can hide a document type or field that is failing at 71%. The reliable path is field-level confidence scores calibrated against a labeled validation set (raw model confidence is not calibrated out of the box), routing extractions below the threshold or with ambiguous/contradictory source documents to human review. The auto-accepted pile is not safe to leave unmonitored: stratified random sampling by document type and field ensures novel error patterns surface before they propagate across thousands of records.

  • Output field-level confidence scores (not a single document-level score) — routing decisions need granularity at the field, not the document
  • Calibrate review thresholds using a labeled validation set, not raw model output — model confidence is not calibrated by default
  • Route extractions with low confidence or from ambiguous/contradictory source documents to human review, prioritizing limited reviewer capacity where it matters most
  • Implement stratified random sampling of high-confidence auto-accepted extractions for ongoing error-rate measurement and novel pattern detection
  • Analyze accuracy by document type and by field before reducing human review on any segment — never rely on an aggregate metric alone
Task 5.5 diagram — confidence-based human review routing
  • Using a single document-level confidence score to route review — field-level granularity is required; a document can be high-confidence on most fields but fail on one
  • Trusting raw model confidence as a calibrated signal — it is not; calibrate against a labeled validation set first
  • Stopping human review of high-confidence extractions based on an aggregate accuracy metric — one failing segment (e.g., handwritten forms at 71%) can be hidden inside a 97% aggregate
  • Sampling uniformly rather than by stratum — document types and field types have different error profiles; uniform sampling under-samples the segments that fail
Anthropic Academy on Skilljar (optional): Building with the Claude API — Module 3 + Introduction to Agent Skills — Module 6 · Model-based grading · Code-based grading (BCA) · Troubleshooting skills (IAS)
Peace Of Code (YouTube, optional): Ep 20 — When AI Needs a Human
5.6

Preserve information provenance and handle uncertainty in multi-source synthesis

Multi-source synthesis breaks silently when a summarization step drops source URLs, document names, and dates — the final report presents claims with no way to trace them back. Every subagent must output structured claim-source mappings (source URL, document name, relevant excerpt, publication date) and the synthesis agent must preserve and merge them rather than flattening them into anonymous prose. When two credible sources report conflicting statistics — one shows revenue growth of 14%, another 9% — both values are included and annotated with their sources and dates; the coordinator decides reconciliation. A publication date in the mapping prevents a Q1/Q4 difference from being misread as a factual contradiction.

  • Require subagents to output structured claim-source mappings (source URL, document name, relevant excerpt, date) in every structured result
  • Preserve and merge source attribution through synthesis steps — never flatten mappings into prose that loses the originating source
  • When two credible sources conflict, annotate both values with source and date; do not arbitrarily select one — let the coordinator decide reconciliation
  • Require publication or data-collection dates in all structured outputs so temporal differences between sources are not misinterpreted as factual contradictions
  • Structure synthesis reports with separate sections distinguishing well-established findings from contested ones; render content type-appropriately (financial data as tables, news as prose, technical findings as structured lists)
Task 5.6 diagram — provenance preservation through multi-source synthesis
  • Dropping source attribution during summarization — once URLs, doc names, and excerpts are removed from a compressed finding, they cannot be reconstructed downstream
  • Arbitrarily selecting one value when two credible sources conflict — the synthesis agent should annotate both with provenance and let the coordinator or human decide
  • Omitting publication dates from structured outputs — a Q1 vs. Q4 figure may be a temporal update, not a contradiction; without dates the coordinator cannot tell
  • Converting all content to a uniform format — financial data, news prose, and technical findings each need a different rendering; one format loses structure or nuance
{
  "claim":      "Revenue grew 14% in Q1 2026",
  "source_url": "https://corp.example.com/reports/q1-2026.pdf",
  "doc_name":   "Q1 2026 Earnings Report",
  "excerpt":    "Total revenue increased 14.2% year-over-year...",
  "date":       "2026-04-15"   // required — distinguishes update from contradiction
}
// Without the "date" field, a Q1 vs Q4 figure looks like a conflict.
// Synthesis agent merges these mappings — it never strips them into prose.
Anthropic Academy on Skilljar (optional): Building with the Claude API — Modules 6 + 7 · Text embeddings · The full RAG flow · A Multi-Index RAG pipeline (M6) · Citations (M7)

Further Reading — Claude Docs

Official Anthropic documentation for the concepts in this domain.

Task 5.1 — Long-context prompting Prompting best practices — Long context tips →
Task 5.4 — Large codebase exploration Best practices for Claude Code →

Further Reading — Anthropic Academy on Skilljar

Optional self-paced courses covering Domain 5 topics in depth.

Anthropic Academy on Skilljar →

Tasks 5.1 + 5.4 — Context management & codebase exploration Claude Code in Action →
Tasks 5.2 · 5.3 · 5.5 · 5.6 — Escalation, errors, review & provenance Building with the Claude API →

Further Viewing — Peace Of Code (YouTube)

Watch if a topic is still unclear after reading.

CCA Full Course — Peace Of Code →

Tasks 5.1, 5.4 — Context engineering & memory management Ep 18 — Why AI Agents Forget: Context Engineering →
Task 5.3 — Subagent error propagation & context boundaries Ep 19 — Subagent Error Propagation & Context Management →
Tasks 5.2, 5.5 — Human-in-the-loop & escalation patterns Ep 20 — When AI Needs a Human →