Prompt Engineering & Structured Output
CCA Foundations · Exam Prep Guide · ImagineX Consulting
Design prompts with explicit criteria to improve precision and reduce false positives
A code-review agent keeps flagging harmless style nits, so a teammate tells it to "be conservative" and "only report high-confidence findings" — and precision barely moves. Vague instructions like those don't work because they leave the real question unanswered: which issues count. The fix is explicit categorical criteria — "flag comments only when the claimed behavior contradicts the actual code behavior," not "check that comments are accurate" — naming exactly which issues to report (bugs, security) versus skip (minor style, local patterns) instead of relying on a confidence dial.
Precision matters because false positives are contagious: a single high-false-positive category undermines developer trust in every category, even the accurate ones. Two levers follow from that. Define explicit severity criteria with a concrete code example for each level so classification stays consistent run to run. And when one category is too noisy to trust, temporarily disable it to restore confidence in the rest while you improve its prompt — then re-enable it.
Key Behaviors
- Write specific review criteria that define which issues to report (bugs, security) vs skip (minor style, local patterns) — categorical rules, not confidence-based filtering
- Explicit criteria ("flag only when claimed behavior contradicts actual code behavior") beat vague ones ("check that comments are accurate")
- General pleas —
"be conservative","only report high-confidence findings"— do not improve precision the way specific categorical criteria do - Define explicit severity criteria with a concrete code example per level to achieve consistent classification
- A high-false-positive category erodes trust across all categories; temporarily disable it to restore trust while you improve its prompt
How It Works
Distractor Traps (common wrong answers)
- Telling the model to
"be conservative"or"only report high-confidence findings"as a precision fix — it doesn't improve precision like categorical criteria do - Relying on confidence-based filtering instead of writing explicit criteria for which issues to report vs skip
- Leaving severity levels undefined (no concrete code example per level), so classification drifts between runs
- Keeping a noisy, high-false-positive category enabled — it erodes developer trust in the accurate categories too
Recommended Material
Apply few-shot prompting to improve output consistency and quality
When detailed written instructions still produce inconsistent output, the most effective fix is to show 2–4 worked examples — few-shot prompting — rather than writing more prose. The examples do two jobs at once: they demonstrate the exact output format you want (location, issue, severity, suggested fix), and they show how to handle the ambiguous cases that confuse the model — which tool to pick for a vague request, or a branch-level test-coverage gap. The trick is to include the reasoning for why one action was chosen over a plausible alternative, so the example teaches judgment, not just an answer.
Good examples make the model generalize to novel patterns instead of matching only the cases you wrote down. They also cut hallucination in extraction: examples covering varied document structures (inline citations vs bibliographies, methodology sections vs embedded details) and varied formats teach the model to return real values — addressing empty or null extraction of required fields and informal measurements — while still distinguishing acceptable code patterns from genuine issues to keep false positives down.
Key Behaviors
- Create 2–4 targeted few-shot examples for ambiguous scenarios, each showing the reasoning for why one action was chosen over plausible alternatives
- Include examples that demonstrate the specific desired output format — location, issue, severity, suggested fix — to achieve consistency
- Few-shot is the most effective technique when detailed instructions alone produce inconsistent results; examples generalize judgment to novel patterns, not just pre-specified cases
- Provide examples distinguishing acceptable code patterns from genuine issues to reduce false positives while still enabling generalization
- Use examples covering varied document structures (inline citations vs bibliographies) and formats to address empty/null extraction and reduce hallucination
How It Works
Distractor Traps (common wrong answers)
- Adding more prose instructions when output is inconsistent, instead of providing concrete few-shot examples
- Writing examples that only match pre-specified cases, so the model can't generalize to novel patterns
- Showing only the chosen answer without the reasoning that distinguishes it from plausible alternatives
- Omitting examples of varied document formats, leaving required fields extracted as empty or null
Recommended Material
Enforce structured output using tool use and JSON schemas
When you need machine-parseable output, the most reliable approach is tool use with a JSON schema: you define an extraction tool whose input schema is the shape you want, and you read the structured data straight out of the tool_use response. Because the schema is enforced, this eliminates JSON syntax errors entirely. How hard you force it depends on tool_choice: "auto" lets the model return plain text instead of calling a tool (no guarantee), "any" requires it to call some tool (use this when several extraction schemas exist and the document type is unknown), and {"type": "tool", "name": "extract_metadata"} forces one specific tool to run — handy when a particular extraction must happen before enrichment steps.
A strict schema fixes syntax, not semantics: the JSON can be valid yet wrong — line items that don't sum to the total, or a value in the wrong field (that's Task 4.4's job). Design the schema to match reality: mark fields optional/nullable when a source may not contain them, which stops the model fabricating values to satisfy required fields; add enum values like "unclear" for ambiguous cases and an "other" + detail-string pattern for extensible categories; and put format-normalization rules in the prompt alongside the schema to absorb inconsistent source formatting.
Key Behaviors
- Define extraction tools with JSON schemas as input parameters, then read the structured data from the
tool_useresponse tool_choice: "any"guarantees structured output when multiple extraction schemas exist and the document type is unknown; forcing{"type":"tool","name":"extract_metadata"}runs a specific extraction before enrichment- Strict schemas via tool use eliminate JSON syntax errors but not semantic errors (line items that don't sum, values in the wrong field)
- Make fields optional/nullable when the source may lack them — prevents the model fabricating values to satisfy required fields
- Add enum
"unclear"for ambiguous cases and"other"+ detail for extensible categories; include format-normalization rules in the prompt alongside the schema
How It Works
Distractor Traps (common wrong answers)
- Using
tool_choice: "auto"when you need guaranteed structured output — the model may return text instead of calling a tool - Assuming a strict JSON schema also prevents semantic errors — it only eliminates syntax errors
- Marking fields required when the source may not contain them — forces the model to fabricate values
- Asking for JSON in the prompt alone, without tool use — reintroduces the syntax errors tool use was meant to remove
Faded Example (Messages API — tool with JSON schema + tool_choice)
tools = [{
"name": "extract_metadata",
"description": "Extract invoice fields from the document",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"category": {"type": "string", "enum": ["hardware", "services", "other"]},
"po_number": {"type": ["string", "null"]} # nullable: source may omit it
},
"required": ["invoice_number"]
}
}]
# Force this exact tool so the extraction runs before enrichment:
tool_choice = {"type": "tool", "name": "extract_metadata"}
Recommended Material
Implement validation, retry, and feedback loops for extraction quality
Tool use removes syntax errors, but the extracted values can still be wrong — so you validate, then retry with the specific errors fed back. A correction request includes the original document, the failed extraction, and the exact validation errors, which guides the model to fix itself. To catch semantic problems in the first place, design the validation into the schema: extract "calculated_total" alongside "stated_total" so a mismatch is flagged automatically, and add a "conflict_detected" boolean for inconsistent source data.
Retry has a hard limit: it works for format mismatches and structural output errors, but it is ineffective when the required information is simply absent from the source — no amount of re-asking conjures data that lives only in an external document you didn't provide. Recognize that case and supply the source or escalate instead of looping. Finally, close the feedback loop over time: add a "detected_pattern" field that records which code construct triggered each finding, so when developers dismiss findings you can analyze the patterns and fix the false-positive sources at their root.
Key Behaviors
- On failure, send a follow-up request with the original document, the failed extraction, and the specific validation errors for self-correction
- Retry succeeds for format mismatches and structural output errors; it is ineffective when the information exists only in an external document not provided
- Build self-correcting validation: extract
"calculated_total"alongside"stated_total"to flag discrepancies, and add"conflict_detected"booleans for inconsistent source data - Add a
"detected_pattern"field to findings so dismissed findings can be analyzed for false-positive patterns over time - Semantic validation errors (values don't sum, wrong field) are distinct from schema syntax errors, which tool use already eliminates
How It Works
Distractor Traps (common wrong answers)
- Retrying blindly without appending the specific validation errors that tell the model what to correct
- Retrying when the required information is simply absent from the source — no retry can produce data that isn't there
- Treating semantic errors (values don't sum, wrong field) as syntax errors — tool use fixes syntax, not logic
- Not tracking which constructs trigger findings (
detected_pattern), so false-positive patterns go unanalyzed
Recommended Material
Design efficient batch processing strategies
A manager wants to cut costs by moving everything to the Message Batches API for its 50% savings — but the right answer is to match each API to its workload. The batch API processes asynchronously within an up to 24-hour window with no guaranteed latency SLA, so it fits latency-tolerant, non-blocking work — overnight reports, weekly audits, nightly test generation — and is wrong for a blocking pre-merge check a developer waits on. Keep the synchronous API for anything blocking; "often faster" is not a guarantee. One more constraint: the batch API does not support multi-turn tool calling within a single request, so it can't execute a tool mid-request and feed results back.
Operate it deliberately. Each request carries a custom_id that correlates request and response when results return hours later and lets you resubmit only the failed documents — chunking any that exceeded the context limit. Size your submission cadence to your SLA: with a 24-hour processing window, submitting every 4 hours guarantees a 30-hour end-to-end SLA. And before you run large volumes, refine the prompt on a small sample set first to maximize first-pass success and avoid expensive iterative resubmission.
Key Behaviors
- Match the API to latency needs: synchronous for blocking pre-merge checks, batch for overnight/weekly analysis
- Message Batches API: 50% cost savings, up to a 24-hour processing window, no guaranteed latency SLA; no multi-turn tool calling within a single request
- Calculate submission frequency from the SLA — e.g. 4-hour submission windows guarantee a 30-hour SLA with 24-hour batch processing
- Handle failures by resubmitting only the failed documents (identified by
custom_id), e.g. chunking documents that exceeded context limits - Refine the prompt on a sample set before batch-processing large volumes to maximize first-pass success and reduce resubmission cost
How It Works
Distractor Traps (common wrong answers)
- Switching a blocking pre-merge check to batch processing — there is no guaranteed SLA, so "often faster" isn't acceptable
- Using "batch with status polling" for a blocking workflow to dodge the latency problem
- Claiming you must avoid batch because of "result ordering issues" —
custom_idcorrelates request and response - Adding a "timeout fallback to real-time" instead of simply matching each API to its appropriate use case
Recommended Material
custom_id correlationDesign multi-instance and multi-pass review architectures
A single-pass review of a 14-file pull request gives inconsistent results — detailed feedback on some files, superficial on others, obvious bugs missed, and contradictory verdicts that flag a pattern in one file while approving identical code in another. Two architectural moves fix this. First, use a second, independent Claude instance to review the generated code: the model that wrote the code retains its own reasoning context and is therefore less likely to question its own decisions, so a fresh instance with no generation context catches subtle issues that self-review instructions or extended thinking cannot.
Second, split a large review into focused passes: analyze each file individually for local issues, then run a separate integration pass examining cross-file data flow. This avoids the attention dilution and contradictory findings that come from processing everything at once — and a bigger context window does not fix it, because the problem is attention quality, not capacity. Optionally add a verification pass where the model self-reports a confidence level alongside each finding, enabling calibrated routing — auto-apply high-confidence findings, send low-confidence ones to a human.
Key Behaviors
- Use a second independent Claude instance to review generated code without the generator's reasoning context
- A model retains reasoning context from generation, so it won't question its own decisions — an independent instance beats self-review instructions or extended thinking
- Split large multi-file reviews into per-file passes for local issues plus a separate integration pass for cross-file data flow
- Per-file + integration passes avoid the attention dilution and contradictory findings of a single all-files pass
- Run verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing
How It Works
Distractor Traps (common wrong answers)
- Switching to a larger context window to fix attention dilution — capacity isn't the problem, attention quality is
- Running multiple full-PR passes and requiring consensus — suppresses real bugs that are only caught intermittently
- Forcing developers to split PRs into smaller submissions — shifts the burden without improving the system
- Reviewing with the same session that generated the code, or relying on extended thinking, instead of an independent instance
Recommended Material
Further Reading — Claude Docs
Official Anthropic documentation for the concepts in this domain.
Further Reading — Anthropic Academy on Skilljar
Optional self-paced courses.
Further Viewing — Peace Of Code (YouTube)
Watch if a topic is still unclear.
CCA Full Course — Peace Of Code