Filter:
QuestionTask 4.1

Why don't instructions like "be conservative" or "only report high-confidence findings" improve precision — and what does?

Space to flip
AnswerTask 4.1

They leave "which issues count" undefined. Explicit categorical criteria do the work — e.g. "flag comments only when claimed behavior contradicts actual code behavior," not the vague "check that comments are accurate."

← → navigate
QuestionTask 4.1

How do you write review criteria that actually raise precision?

Space to flip
AnswerTask 4.1

Define which issues to report (bugs, security) vs skip (minor style, local patterns) as categorical rules — not confidence-based filtering.

← → navigate
QuestionTask 4.1

How do you get consistent severity classification across runs?

Space to flip
AnswerTask 4.1

Define explicit severity criteria with a concrete code example for each level, so the model classifies the same way every run instead of guessing what "critical" means.

← → navigate
QuestionTask 4.1

A review category has a high false-positive rate. What's the impact, and the fix?

Space to flip
AnswerTask 4.1

A high-FP category undermines developer trust in every category — even accurate ones. Temporarily disable it to restore trust while you improve its prompt, then re-enable.

← → navigate
QuestionTask 4.2

Detailed written instructions still produce inconsistent output. Most effective fix?

Space to flip
AnswerTask 4.2

Few-shot prompting — give 2–4 worked examples. It's the most effective technique when detailed instructions alone produce inconsistent, inconsistently-formatted output.

← → navigate
QuestionTask 4.2

What should a good few-shot example for an ambiguous case contain?

Space to flip
AnswerTask 4.2

The reasoning for why one action was chosen over plausible alternatives — so it teaches judgment for ambiguous cases (tool selection, branch-level coverage gaps), not just an answer.

← → navigate
QuestionTask 4.2

How do you use few-shot examples to lock in output format?

Space to flip
AnswerTask 4.2

Include examples demonstrating the exact desired format — e.g. location · issue · severity · suggested fix — so output stays consistent.

← → navigate
QuestionTask 4.2

Why do few-shot examples reduce false positives without overfitting?

Space to flip
AnswerTask 4.2

Examples distinguishing acceptable code patterns from genuine issues cut false positives, while still letting the model generalize to novel patterns rather than matching only pre-specified cases.

← → navigate
QuestionTask 4.2

How do few-shot examples reduce hallucination in extraction tasks?

Space to flip
AnswerTask 4.2

Examples covering varied document structures (inline citations vs bibliographies, methodology vs embedded details) and varied formats address empty/null extraction of required fields and informal measurements.

← → navigate
QuestionTask 4.3

Most reliable way to guarantee schema-compliant structured output?

Space to flip
AnswerTask 4.3

Tool use with a JSON schema. Define an extraction tool whose input schema is the shape you want and read the data from the tool_use response — this eliminates JSON syntax errors.

← → navigate
QuestionTask 4.3

Distinguish tool_choice: "auto" vs "any" vs forced.

Space to flip
AnswerTask 4.3
  • "auto" — model MAY call a tool or return text (no guarantee)
  • "any" — MUST call a tool, model picks which (unknown doc type, multiple schemas)
  • {"type":"tool","name":"extract_metadata"} — forces one specific tool
← → navigate
QuestionTask 4.3

A strict JSON schema via tool use eliminates which errors — and which does it NOT?

Space to flip
AnswerTask 4.3

It eliminates syntax errors. It does not prevent semantic errors — line items that don't sum to the total, or a value in the wrong field.

← → navigate
QuestionTask 4.3

How do you stop the model fabricating values for fields the source doesn't contain?

Space to flip
AnswerTask 4.3

Mark those schema fields optional / nullable. Required fields that may be absent force the model to invent values to satisfy them.

← → navigate
QuestionTask 4.3

Two schema-design patterns for messy real-world categories + formats?

Space to flip
AnswerTask 4.3

Add enum "unclear" for ambiguous cases and "other" + a detail string for extensible categories. Put format-normalization rules in the prompt alongside the strict schema.

← → navigate
QuestionTask 4.4

What goes into a retry-with-error-feedback request?

Space to flip
AnswerTask 4.4

The original document, the failed extraction, and the specific validation errors — appended to the prompt so the model self-corrects.

← → navigate
QuestionTask 4.4

When does retry succeed, and when is it ineffective?

Space to flip
AnswerTask 4.4

Succeeds: format mismatches and structural output errors. Ineffective: when the required information is simply absent from the source (e.g. it lives only in an external document you didn't provide).

← → navigate
QuestionTask 4.4

How do you design a schema that flags semantic errors itself?

Space to flip
AnswerTask 4.4

Extract "calculated_total" alongside "stated_total" to flag discrepancies, and add a "conflict_detected" boolean for inconsistent source data.

← → navigate
QuestionTask 4.4

What is the detected_pattern field for?

Space to flip
AnswerTask 4.4

It records which code construct triggered each finding. When developers dismiss findings, you analyze these patterns to find and fix false-positive sources. (Semantic errors ≠ syntax errors, which tool use already removes.)

← → navigate
QuestionTask 4.5

Key specs of the Message Batches API?

Space to flip
AnswerTask 4.5
  • 50% cost savings
  • up to a 24-hour processing window
  • no guaranteed latency SLA
  • no multi-turn tool calling within a single request
← → navigate
QuestionTask 4.5

Which workloads suit batch vs the synchronous API?

Space to flip
AnswerTask 4.5

Batch: non-blocking, latency-tolerant work — overnight reports, weekly audits, nightly test generation. Sync: blocking workflows like pre-merge checks a developer waits on.

← → navigate
QuestionTask 4.5

How do you size batch submission frequency to an SLA?

Space to flip
AnswerTask 4.5

Calculate from the processing window: with 24-hour batch processing, submitting every 4 hours guarantees a 30-hour end-to-end SLA.

← → navigate
QuestionTask 4.5

What does custom_id enable, and what should you do before processing large volumes?

Space to flip
AnswerTask 4.5

custom_id correlates request/response pairs and lets you resubmit only failed documents (e.g. chunk ones that exceeded context limits). First, refine the prompt on a sample set to maximize first-pass success.

← → navigate
QuestionTask 4.6

Why is a second independent Claude instance better at reviewing code than self-review?

Space to flip
AnswerTask 4.6

The model that generated the code retains its reasoning context and won't question its own decisions. An independent instance (no generation context) catches subtle issues — more effectively than self-review instructions or extended thinking.

← → navigate
QuestionTask 4.6

How do you restructure a large multi-file review that gives inconsistent, contradictory results?

Space to flip
AnswerTask 4.6

Split into focused passes: analyze each file individually for local issues, then run a separate integration pass for cross-file data flow. Avoids attention dilution and contradictory findings.

← → navigate
QuestionTask 4.6

What does a verification pass with self-reported confidence enable?

Space to flip
AnswerTask 4.6

The model reports a confidence level alongside each finding, enabling calibrated review routing — auto-apply high-confidence findings, send low-confidence ones to a human.

← → navigate
QuestionTask 4.6

Trap check: three wrong fixes for an inconsistent 14-file single-pass review?

Space to flip
AnswerTask 4.6
  • "Use a larger context window" — the problem is attention quality, not capacity
  • "Require consensus across multiple full-PR passes" — suppresses intermittently-caught bugs
  • "Make developers split the PR" — shifts the burden, doesn't fix the system
← → navigate

Keyboard: navigate · Space flip · S shuffle · R restart · G got it · V review again