CCA Flash Cards – Prompt Engineering & Structured Output

QuestionTask 4.1

Why don't instructions like "be conservative" or "only report high-confidence findings" improve precision — and what does?

Space to flip

AnswerTask 4.1

They leave "which issues count" undefined. Explicit categorical criteria do the work — e.g. "flag comments only when claimed behavior contradicts actual code behavior," not the vague "check that comments are accurate."

← → navigate

QuestionTask 4.1

How do you write review criteria that actually raise precision?

Space to flip

AnswerTask 4.1

Define which issues to report (bugs, security) vs skip (minor style, local patterns) as categorical rules — not confidence-based filtering.

← → navigate

QuestionTask 4.1

How do you get consistent severity classification across runs?

Space to flip

AnswerTask 4.1

Define explicit severity criteria with a concrete code example for each level, so the model classifies the same way every run instead of guessing what "critical" means.

← → navigate

QuestionTask 4.1

A review category has a high false-positive rate. What's the impact, and the fix?

Space to flip

AnswerTask 4.1

A high-FP category undermines developer trust in every category — even accurate ones. Temporarily disable it to restore trust while you improve its prompt, then re-enable.

← → navigate

QuestionTask 4.2

Detailed written instructions still produce inconsistent output. Most effective fix?

Space to flip

AnswerTask 4.2

Few-shot prompting — give 2–4 worked examples. It's the most effective technique when detailed instructions alone produce inconsistent, inconsistently-formatted output.

← → navigate

QuestionTask 4.2

What should a good few-shot example for an ambiguous case contain?

Space to flip

AnswerTask 4.2

The reasoning for why one action was chosen over plausible alternatives — so it teaches judgment for ambiguous cases (tool selection, branch-level coverage gaps), not just an answer.

← → navigate

QuestionTask 4.2

How do you use few-shot examples to lock in output format?

Space to flip

AnswerTask 4.2

Include examples demonstrating the exact desired format — e.g. location · issue · severity · suggested fix — so output stays consistent.

← → navigate

QuestionTask 4.2

Why do few-shot examples reduce false positives without overfitting?

Space to flip

AnswerTask 4.2

Examples distinguishing acceptable code patterns from genuine issues cut false positives, while still letting the model generalize to novel patterns rather than matching only pre-specified cases.

← → navigate

QuestionTask 4.2

How do few-shot examples reduce hallucination in extraction tasks?

Space to flip

AnswerTask 4.2

Examples covering varied document structures (inline citations vs bibliographies, methodology vs embedded details) and varied formats address empty/null extraction of required fields and informal measurements.

← → navigate

QuestionTask 4.3

Most reliable way to guarantee schema-compliant structured output?

Space to flip

AnswerTask 4.3

Tool use with a JSON schema. Define an extraction tool whose input schema is the shape you want and read the data from the tool_use response — this eliminates JSON syntax errors.

← → navigate

QuestionTask 4.3

Distinguish tool_choice: "auto" vs "any" vs forced.

Space to flip

AnswerTask 4.3

"auto" — model MAY call a tool or return text (no guarantee)
"any" — MUST call a tool, model picks which (unknown doc type, multiple schemas)
{"type":"tool","name":"extract_metadata"} — forces one specific tool

← → navigate

QuestionTask 4.3

A strict JSON schema via tool use eliminates which errors — and which does it NOT?

Space to flip

AnswerTask 4.3

It eliminates syntax errors. It does not prevent semantic errors — line items that don't sum to the total, or a value in the wrong field.

← → navigate

QuestionTask 4.3

How do you stop the model fabricating values for fields the source doesn't contain?

Space to flip

AnswerTask 4.3

Mark those schema fields optional / nullable. Required fields that may be absent force the model to invent values to satisfy them.

← → navigate

QuestionTask 4.3

Two schema-design patterns for messy real-world categories + formats?

Space to flip

AnswerTask 4.3

Add enum "unclear" for ambiguous cases and "other" + a detail string for extensible categories. Put format-normalization rules in the prompt alongside the strict schema.

← → navigate

QuestionTask 4.4

What goes into a retry-with-error-feedback request?

Space to flip

AnswerTask 4.4

The original document, the failed extraction, and the specific validation errors — appended to the prompt so the model self-corrects.

← → navigate

QuestionTask 4.4

When does retry succeed, and when is it ineffective?

Space to flip

AnswerTask 4.4

Succeeds: format mismatches and structural output errors. Ineffective: when the required information is simply absent from the source (e.g. it lives only in an external document you didn't provide).

← → navigate

QuestionTask 4.4

How do you design a schema that flags semantic errors itself?

Space to flip

AnswerTask 4.4

Extract "calculated_total" alongside "stated_total" to flag discrepancies, and add a "conflict_detected" boolean for inconsistent source data.

← → navigate

QuestionTask 4.4

What is the detected_pattern field for?

Space to flip

AnswerTask 4.4

It records which code construct triggered each finding. When developers dismiss findings, you analyze these patterns to find and fix false-positive sources. (Semantic errors ≠ syntax errors, which tool use already removes.)

← → navigate

QuestionTask 4.5

Key specs of the Message Batches API?

Space to flip

AnswerTask 4.5

50% cost savings
up to a 24-hour processing window
no guaranteed latency SLA
no multi-turn tool calling within a single request

← → navigate

QuestionTask 4.5

Which workloads suit batch vs the synchronous API?

Space to flip

AnswerTask 4.5

Batch: non-blocking, latency-tolerant work — overnight reports, weekly audits, nightly test generation. Sync: blocking workflows like pre-merge checks a developer waits on.

← → navigate

QuestionTask 4.5

How do you size batch submission frequency to an SLA?

Space to flip

AnswerTask 4.5

Calculate from the processing window: with 24-hour batch processing, submitting every 4 hours guarantees a 30-hour end-to-end SLA.

← → navigate

QuestionTask 4.5

What does custom_id enable, and what should you do before processing large volumes?

Space to flip

AnswerTask 4.5

custom_id correlates request/response pairs and lets you resubmit only failed documents (e.g. chunk ones that exceeded context limits). First, refine the prompt on a sample set to maximize first-pass success.

← → navigate

QuestionTask 4.6

Why is a second independent Claude instance better at reviewing code than self-review?

Space to flip

AnswerTask 4.6

The model that generated the code retains its reasoning context and won't question its own decisions. An independent instance (no generation context) catches subtle issues — more effectively than self-review instructions or extended thinking.

← → navigate

QuestionTask 4.6

How do you restructure a large multi-file review that gives inconsistent, contradictory results?

Space to flip

AnswerTask 4.6

Split into focused passes: analyze each file individually for local issues, then run a separate integration pass for cross-file data flow. Avoids attention dilution and contradictory findings.

← → navigate

QuestionTask 4.6

What does a verification pass with self-reported confidence enable?

Space to flip

AnswerTask 4.6

The model reports a confidence level alongside each finding, enabling calibrated review routing — auto-apply high-confidence findings, send low-confidence ones to a human.

← → navigate

QuestionTask 4.6

Trap check: three wrong fixes for an inconsistent 14-file single-pass review?

Space to flip

AnswerTask 4.6

"Use a larger context window" — the problem is attention quality, not capacity
"Require consensus across multiple full-PR passes" — suppresses intermittently-caught bugs
"Make developers split the PR" — shifts the burden, doesn't fix the system

← → navigate

Keyboard: ← → navigate · Space flip · S shuffle · R restart · G got it · V review again