CCA Study Guide – Prompt Engineering & Structured Output

4.1

Design prompts with explicit criteria to improve precision and reduce false positives

A code-review agent keeps flagging harmless style nits, so a teammate tells it to "be conservative" and "only report high-confidence findings" — and precision barely moves. Vague instructions like those don't work because they leave the real question unanswered: which issues count. The fix is explicit categorical criteria — "flag comments only when the claimed behavior contradicts the actual code behavior," not "check that comments are accurate" — naming exactly which issues to report (bugs, security) versus skip (minor style, local patterns) instead of relying on a confidence dial.

Precision matters because false positives are contagious: a single high-false-positive category undermines developer trust in every category, even the accurate ones. Two levers follow from that. Define explicit severity criteria with a concrete code example for each level so classification stays consistent run to run. And when one category is too noisy to trust, temporarily disable it to restore confidence in the rest while you improve its prompt — then re-enable it.

Key Behaviors

Write specific review criteria that define which issues to report (bugs, security) vs skip (minor style, local patterns) — categorical rules, not confidence-based filtering
Explicit criteria ("flag only when claimed behavior contradicts actual code behavior") beat vague ones ("check that comments are accurate")
General pleas — "be conservative", "only report high-confidence findings" — do not improve precision the way specific categorical criteria do
Define explicit severity criteria with a concrete code example per level to achieve consistent classification
A high-false-positive category erodes trust across all categories; temporarily disable it to restore trust while you improve its prompt

How It Works

Domain 4 Task 4.1 — explicit categorical criteria vs vague confidence-based instructions comparison

Distractor Traps (common wrong answers)

Telling the model to "be conservative" or "only report high-confidence findings" as a precision fix — it doesn't improve precision like categorical criteria do
Relying on confidence-based filtering instead of writing explicit criteria for which issues to report vs skip
Leaving severity levels undefined (no concrete code example per level), so classification drifts between runs
Keeping a noisy, high-false-positive category enabled — it erodes developer trust in the accurate categories too

Recommended Material

Claude Docs (official): Prompting best practices — Be clear and direct · specific, explicit criteria beat vague instructions

Anthropic Academy on Skilljar (optional): Building with the Claude API — Module 4 (Prompt Engineering) · Being clear and direct · Being specific · Structure with XML tags

Peace Of Code (YouTube, optional): Ep 14 — Prompt Engineering: Explicit Criteria & False Positives

4.2

Apply few-shot prompting to improve output consistency and quality

When detailed written instructions still produce inconsistent output, the most effective fix is to show 2–4 worked examples — few-shot prompting — rather than writing more prose. The examples do two jobs at once: they demonstrate the exact output format you want (location, issue, severity, suggested fix), and they show how to handle the ambiguous cases that confuse the model — which tool to pick for a vague request, or a branch-level test-coverage gap. The trick is to include the reasoning for why one action was chosen over a plausible alternative, so the example teaches judgment, not just an answer.

Good examples make the model generalize to novel patterns instead of matching only the cases you wrote down. They also cut hallucination in extraction: examples covering varied document structures (inline citations vs bibliographies, methodology sections vs embedded details) and varied formats teach the model to return real values — addressing empty or null extraction of required fields and informal measurements — while still distinguishing acceptable code patterns from genuine issues to keep false positives down.

Key Behaviors

Create 2–4 targeted few-shot examples for ambiguous scenarios, each showing the reasoning for why one action was chosen over plausible alternatives
Include examples that demonstrate the specific desired output format — location, issue, severity, suggested fix — to achieve consistency
Few-shot is the most effective technique when detailed instructions alone produce inconsistent results; examples generalize judgment to novel patterns, not just pre-specified cases
Provide examples distinguishing acceptable code patterns from genuine issues to reduce false positives while still enabling generalization
Use examples covering varied document structures (inline citations vs bibliographies) and formats to address empty/null extraction and reduce hallucination

How It Works

Domain 4 Task 4.2 — anatomy of a few-shot example (input, reasoning, output) and what few-shot examples buy you

Distractor Traps (common wrong answers)

Adding more prose instructions when output is inconsistent, instead of providing concrete few-shot examples
Writing examples that only match pre-specified cases, so the model can't generalize to novel patterns
Showing only the chosen answer without the reasoning that distinguishes it from plausible alternatives
Omitting examples of varied document formats, leaving required fields extracted as empty or null

Recommended Material

Claude Docs (official): Prompting best practices — Use examples (multishot) · a few well-crafted examples for consistent, generalizable output

Anthropic Academy on Skilljar (optional): Building with the Claude API — Module 4 (Prompt Engineering) · Providing examples · Structure with XML tags · Being specific

Peace Of Code (YouTube, optional): Ep 15 — Few-Shot Prompting Explained

4.3

Enforce structured output using tool use and JSON schemas

When you need machine-parseable output, the most reliable approach is tool use with a JSON schema: you define an extraction tool whose input schema is the shape you want, and you read the structured data straight out of the tool_use response. Because the schema is enforced, this eliminates JSON syntax errors entirely. How hard you force it depends on tool_choice: "auto" lets the model return plain text instead of calling a tool (no guarantee), "any" requires it to call some tool (use this when several extraction schemas exist and the document type is unknown), and {"type": "tool", "name": "extract_metadata"} forces one specific tool to run — handy when a particular extraction must happen before enrichment steps.

A strict schema fixes syntax, not semantics: the JSON can be valid yet wrong — line items that don't sum to the total, or a value in the wrong field (that's Task 4.4's job). Design the schema to match reality: mark fields optional/nullable when a source may not contain them, which stops the model fabricating values to satisfy required fields; add enum values like "unclear" for ambiguous cases and an "other" + detail-string pattern for extensible categories; and put format-normalization rules in the prompt alongside the schema to absorb inconsistent source formatting.

Key Behaviors

Define extraction tools with JSON schemas as input parameters, then read the structured data from the tool_use response
tool_choice: "any" guarantees structured output when multiple extraction schemas exist and the document type is unknown; forcing {"type":"tool","name":"extract_metadata"} runs a specific extraction before enrichment
Strict schemas via tool use eliminate JSON syntax errors but not semantic errors (line items that don't sum, values in the wrong field)
Make fields optional/nullable when the source may lack them — prevents the model fabricating values to satisfy required fields
Add enum "unclear" for ambiguous cases and "other" + detail for extensible categories; include format-normalization rules in the prompt alongside the schema

How It Works

Domain 4 Task 4.3 — tool_choice auto vs any vs forced, plus the syntax-vs-semantic-error distinction and schema design

Distractor Traps (common wrong answers)

Using tool_choice: "auto" when you need guaranteed structured output — the model may return text instead of calling a tool
Assuming a strict JSON schema also prevents semantic errors — it only eliminates syntax errors
Marking fields required when the source may not contain them — forces the model to fabricate values
Asking for JSON in the prompt alone, without tool use — reintroduces the syntax errors tool use was meant to remove

Faded Example (Messages API — tool with JSON schema + tool_choice)

tools = [{
    "name": "extract_metadata",
    "description": "Extract invoice fields from the document",
    "input_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "category": {"type": "string", "enum": ["hardware", "services", "other"]},
            "po_number":  {"type": ["string", "null"]}   # nullable: source may omit it
        },
        "required": ["invoice_number"]
    }
}]

# Force this exact tool so the extraction runs before enrichment:
tool_choice = {"type": "tool", "name": "extract_metadata"}

Recommended Material

Claude Docs (official): Structured outputs · schema-compliant output via JSON outputs and strict tool use

Anthropic Academy on Skilljar (optional): Building with the Claude API — Modules 2 + 5 · Structured data · Structured data exercise (M2) · Tool schemas · Tool functions (M5)

Peace Of Code (YouTube, optional): Ep 16 — Structured Output & JSON Schema

4.4

Implement validation, retry, and feedback loops for extraction quality

Tool use removes syntax errors, but the extracted values can still be wrong — so you validate, then retry with the specific errors fed back. A correction request includes the original document, the failed extraction, and the exact validation errors, which guides the model to fix itself. To catch semantic problems in the first place, design the validation into the schema: extract "calculated_total" alongside "stated_total" so a mismatch is flagged automatically, and add a "conflict_detected" boolean for inconsistent source data.

Retry has a hard limit: it works for format mismatches and structural output errors, but it is ineffective when the required information is simply absent from the source — no amount of re-asking conjures data that lives only in an external document you didn't provide. Recognize that case and supply the source or escalate instead of looping. Finally, close the feedback loop over time: add a "detected_pattern" field that records which code construct triggered each finding, so when developers dismiss findings you can analyze the patterns and fix the false-positive sources at their root.

Key Behaviors

On failure, send a follow-up request with the original document, the failed extraction, and the specific validation errors for self-correction
Retry succeeds for format mismatches and structural output errors; it is ineffective when the information exists only in an external document not provided
Build self-correcting validation: extract "calculated_total" alongside "stated_total" to flag discrepancies, and add "conflict_detected" booleans for inconsistent source data
Add a "detected_pattern" field to findings so dismissed findings can be analyzed for false-positive patterns over time
Semantic validation errors (values don't sum, wrong field) are distinct from schema syntax errors, which tool use already eliminates

How It Works

Domain 4 Task 4.4 — extract / validate / retry-with-error-feedback loop, the info-absent dead end, and the detected_pattern feedback field

Distractor Traps (common wrong answers)

Retrying blindly without appending the specific validation errors that tell the model what to correct
Retrying when the required information is simply absent from the source — no retry can produce data that isn't there
Treating semantic errors (values don't sum, wrong field) as syntax errors — tool use fixes syntax, not logic
Not tracking which constructs trigger findings (detected_pattern), so false-positive patterns go unanalyzed

Recommended Material

Anthropic Academy on Skilljar (optional): Building with the Claude API — Module 3 (Prompt Evaluation) · Prompt evaluation · A typical eval workflow · Model-based grading · Code-based grading

4.5

Design efficient batch processing strategies

A manager wants to cut costs by moving everything to the Message Batches API for its 50% savings — but the right answer is to match each API to its workload. The batch API processes asynchronously within an up to 24-hour window with no guaranteed latency SLA, so it fits latency-tolerant, non-blocking work — overnight reports, weekly audits, nightly test generation — and is wrong for a blocking pre-merge check a developer waits on. Keep the synchronous API for anything blocking; "often faster" is not a guarantee. One more constraint: the batch API does not support multi-turn tool calling within a single request, so it can't execute a tool mid-request and feed results back.

Operate it deliberately. Each request carries a custom_id that correlates request and response when results return hours later and lets you resubmit only the failed documents — chunking any that exceeded the context limit. Size your submission cadence to your SLA: with a 24-hour processing window, submitting every 4 hours guarantees a 30-hour end-to-end SLA. And before you run large volumes, refine the prompt on a small sample set first to maximize first-pass success and avoid expensive iterative resubmission.

Key Behaviors

Match the API to latency needs: synchronous for blocking pre-merge checks, batch for overnight/weekly analysis
Message Batches API: 50% cost savings, up to a 24-hour processing window, no guaranteed latency SLA; no multi-turn tool calling within a single request
Calculate submission frequency from the SLA — e.g. 4-hour submission windows guarantee a 30-hour SLA with 24-hour batch processing
Handle failures by resubmitting only the failed documents (identified by custom_id), e.g. chunking documents that exceeded context limits
Refine the prompt on a sample set before batch-processing large volumes to maximize first-pass success and reduce resubmission cost

How It Works

Domain 4 Task 4.5 — synchronous API vs Message Batches API comparison across cost, latency, use case, tool calling, and custom_id correlation

Distractor Traps (common wrong answers)

Switching a blocking pre-merge check to batch processing — there is no guaranteed SLA, so "often faster" isn't acceptable
Using "batch with status polling" for a blocking workflow to dodge the latency problem
Claiming you must avoid batch because of "result ordering issues" — custom_id correlates request and response
Adding a "timeout fallback to real-time" instead of simply matching each API to its appropriate use case

Recommended Material

Claude Docs (official): Batch processing (Message Batches API) · 50% cost savings, async ≤24h window, custom_id correlation

Anthropic Academy on Skilljar (optional): Building with the Claude API — Modules 3 + 7 · A typical eval workflow · Generating test datasets · Running the eval (M3) · Prompt caching · Rules of prompt caching (M7)

Peace Of Code (YouTube, optional): Ep 17 — Batch API & Multi-Pass Review

4.6

Design multi-instance and multi-pass review architectures

A single-pass review of a 14-file pull request gives inconsistent results — detailed feedback on some files, superficial on others, obvious bugs missed, and contradictory verdicts that flag a pattern in one file while approving identical code in another. Two architectural moves fix this. First, use a second, independent Claude instance to review the generated code: the model that wrote the code retains its own reasoning context and is therefore less likely to question its own decisions, so a fresh instance with no generation context catches subtle issues that self-review instructions or extended thinking cannot.

Second, split a large review into focused passes: analyze each file individually for local issues, then run a separate integration pass examining cross-file data flow. This avoids the attention dilution and contradictory findings that come from processing everything at once — and a bigger context window does not fix it, because the problem is attention quality, not capacity. Optionally add a verification pass where the model self-reports a confidence level alongside each finding, enabling calibrated routing — auto-apply high-confidence findings, send low-confidence ones to a human.

Key Behaviors

Use a second independent Claude instance to review generated code without the generator's reasoning context
A model retains reasoning context from generation, so it won't question its own decisions — an independent instance beats self-review instructions or extended thinking
Split large multi-file reviews into per-file passes for local issues plus a separate integration pass for cross-file data flow
Per-file + integration passes avoid the attention dilution and contradictory findings of a single all-files pass
Run verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing

How It Works

Domain 4 Task 4.6 — independent reviewer vs self-review, plus multi-pass (per-file local + cross-file integration) architecture

Distractor Traps (common wrong answers)

Switching to a larger context window to fix attention dilution — capacity isn't the problem, attention quality is
Running multiple full-PR passes and requiring consensus — suppresses real bugs that are only caught intermittently
Forcing developers to split PRs into smaller submissions — shifts the burden without improving the system
Reviewing with the same session that generated the code, or relying on extended thinking, instead of an independent instance

Recommended Material

Claude Docs (official): Building Effective AI Agents · evaluator-optimizer and parallelization patterns (independent review)

Anthropic Academy on Skilljar (optional): Building with the Claude API — Modules 3 + 10 · Model-based grading · Code-based grading (M3) · Parallelization workflows · Chaining workflows (M10)

Peace Of Code (YouTube, optional): Ep 17 — Batch API & Multi-Pass Review

Further Viewing — Peace Of Code (YouTube)

Watch if a topic is still unclear.

CCA Full Course — Peace Of Code

Explicit criteria & false positives Ep 14 — Prompt Engineering: Explicit Criteria & False Positives

Few-shot prompting Ep 15 — Few-Shot Prompting Explained

Structured output & JSON schema Ep 16 — Structured Output & JSON Schema

Batch API & multi-pass review Ep 17 — Batch API & Multi-Pass Review

Prompt Engineering & Structured Output

Design prompts with explicit criteria to improve precision and reduce false positives

Apply few-shot prompting to improve output consistency and quality

Enforce structured output using tool use and JSON schemas

Implement validation, retry, and feedback loops for extraction quality

Design efficient batch processing strategies

Design multi-instance and multi-pass review architectures

Further Reading — Claude Docs

Further Reading — Anthropic Academy on Skilljar

Further Viewing — Peace Of Code (YouTube)