Why don't instructions like "be conservative" or "only report high-confidence findings" improve precision — and what does?
Space to flipThey leave "which issues count" undefined. Explicit categorical criteria do the work — e.g. "flag comments only when claimed behavior contradicts actual code behavior," not the vague "check that comments are accurate."
← → navigateHow do you write review criteria that actually raise precision?
Space to flipDefine which issues to report (bugs, security) vs skip (minor style, local patterns) as categorical rules — not confidence-based filtering.
← → navigateHow do you get consistent severity classification across runs?
Space to flipDefine explicit severity criteria with a concrete code example for each level, so the model classifies the same way every run instead of guessing what "critical" means.
← → navigateA review category has a high false-positive rate. What's the impact, and the fix?
Space to flipA high-FP category undermines developer trust in every category — even accurate ones. Temporarily disable it to restore trust while you improve its prompt, then re-enable.
← → navigateDetailed written instructions still produce inconsistent output. Most effective fix?
Space to flipFew-shot prompting — give 2–4 worked examples. It's the most effective technique when detailed instructions alone produce inconsistent, inconsistently-formatted output.
← → navigateWhat should a good few-shot example for an ambiguous case contain?
Space to flipThe reasoning for why one action was chosen over plausible alternatives — so it teaches judgment for ambiguous cases (tool selection, branch-level coverage gaps), not just an answer.
← → navigateHow do you use few-shot examples to lock in output format?
Space to flipInclude examples demonstrating the exact desired format — e.g. location · issue · severity · suggested fix — so output stays consistent.
← → navigateWhy do few-shot examples reduce false positives without overfitting?
Space to flipExamples distinguishing acceptable code patterns from genuine issues cut false positives, while still letting the model generalize to novel patterns rather than matching only pre-specified cases.
← → navigateHow do few-shot examples reduce hallucination in extraction tasks?
Space to flipExamples covering varied document structures (inline citations vs bibliographies, methodology vs embedded details) and varied formats address empty/null extraction of required fields and informal measurements.
← → navigateMost reliable way to guarantee schema-compliant structured output?
Space to flipTool use with a JSON schema. Define an extraction tool whose input schema is the shape you want and read the data from the tool_use response — this eliminates JSON syntax errors.
Distinguish tool_choice: "auto" vs "any" vs forced.
"auto" — model MAY call a tool or return text (no guarantee)"any" — MUST call a tool, model picks which (unknown doc type, multiple schemas){"type":"tool","name":"extract_metadata"} — forces one specific toolA strict JSON schema via tool use eliminates which errors — and which does it NOT?
Space to flipIt eliminates syntax errors. It does not prevent semantic errors — line items that don't sum to the total, or a value in the wrong field.
← → navigateHow do you stop the model fabricating values for fields the source doesn't contain?
Space to flipMark those schema fields optional / nullable. Required fields that may be absent force the model to invent values to satisfy them.
← → navigateTwo schema-design patterns for messy real-world categories + formats?
Space to flipAdd enum "unclear" for ambiguous cases and "other" + a detail string for extensible categories. Put format-normalization rules in the prompt alongside the strict schema.
What goes into a retry-with-error-feedback request?
Space to flipThe original document, the failed extraction, and the specific validation errors — appended to the prompt so the model self-corrects.
← → navigateWhen does retry succeed, and when is it ineffective?
Space to flipSucceeds: format mismatches and structural output errors. Ineffective: when the required information is simply absent from the source (e.g. it lives only in an external document you didn't provide).
← → navigateHow do you design a schema that flags semantic errors itself?
Space to flipExtract "calculated_total" alongside "stated_total" to flag discrepancies, and add a "conflict_detected" boolean for inconsistent source data.
What is the detected_pattern field for?
It records which code construct triggered each finding. When developers dismiss findings, you analyze these patterns to find and fix false-positive sources. (Semantic errors ≠ syntax errors, which tool use already removes.)
← → navigateKey specs of the Message Batches API?
Space to flipWhich workloads suit batch vs the synchronous API?
Space to flipBatch: non-blocking, latency-tolerant work — overnight reports, weekly audits, nightly test generation. Sync: blocking workflows like pre-merge checks a developer waits on.
← → navigateHow do you size batch submission frequency to an SLA?
Space to flipCalculate from the processing window: with 24-hour batch processing, submitting every 4 hours guarantees a 30-hour end-to-end SLA.
← → navigateWhat does custom_id enable, and what should you do before processing large volumes?
custom_id correlates request/response pairs and lets you resubmit only failed documents (e.g. chunk ones that exceeded context limits). First, refine the prompt on a sample set to maximize first-pass success.
Why is a second independent Claude instance better at reviewing code than self-review?
Space to flipThe model that generated the code retains its reasoning context and won't question its own decisions. An independent instance (no generation context) catches subtle issues — more effectively than self-review instructions or extended thinking.
← → navigateHow do you restructure a large multi-file review that gives inconsistent, contradictory results?
Space to flipSplit into focused passes: analyze each file individually for local issues, then run a separate integration pass for cross-file data flow. Avoids attention dilution and contradictory findings.
← → navigateWhat does a verification pass with self-reported confidence enable?
Space to flipThe model reports a confidence level alongside each finding, enabling calibrated review routing — auto-apply high-confidence findings, send low-confidence ones to a human.
← → navigateTrap check: three wrong fixes for an inconsistent 14-file single-pass review?
Space to flipKeyboard: ← → navigate · Space flip · S shuffle · R restart · G got it · V review again