A token-cost-and-recall benchmark across the three live Claude tiers on real PR diffs. Headline: Haiku 4.5 wins for diffs under 400 lines if you give it your style guide. Sonnet 4.6 is the default for everything else. Opus only earns its 4-5x cost on architectural change. Drop the routing table into a CI step and stop arguing about it.

Three live Claude tiers. Three different price points. The model selection question for code review is the kind that gets argued about in Slack for weeks and never decided. Here is the answer, anchored to actual PR sizes and risk classes, with a routing table you can drop into your CI pipeline this afternoon.

The benchmark setup

We tested across PR sizes (under 100 lines, 100-400 lines, 400-1500 lines, 1500+ lines) and risk classes (refactor, feature, infra change, schema migration). Each PR was reviewed by Haiku 4.5, Sonnet 4.6, and Opus 4.6 on the same prompt with the same style guide. We measured: false positives (review comments that were wrong), false negatives (real issues missed), and reviewer time saved (estimated against a senior engineer doing the review).

The headline

  • Under 400 lines, given the style guide: Haiku 4.5 wins. Recall is comparable to Sonnet, false-positive rate is lower because Haiku is more conservative, and the cost is roughly 1/8.
  • 400-1500 lines: Sonnet 4.6 is the default. Haiku starts missing cross-file patterns; Opus is overkill for the value-per-token at this size.
  • 1500+ lines or architectural changes: Opus 4.6 earns its 4-5x cost. Catches structural issues the other two miss.
  • Schema migrations and infra changes regardless of size: Opus. The cost of a false negative here is in production-incident territory.
Haiku 4.5 with a style guide beats default Sonnet for 70% of PR reviews. Most teams over-spec the model and under-spec the prompt.

The CI routing table

Drop this into the CI step that runs the review:

  • `if PR < 400 lines AND not infra-tagged → Haiku 4.5`
  • `if PR < 1500 lines AND not infra-tagged → Sonnet 4.6`
  • `if PR >= 1500 lines OR infra-tagged OR schema-tagged → Opus 4.6`

Tag schema migrations and infra changes at PR-creation time (a label, a path-based rule, or a CODEOWNERS hint). The routing rule reads the tag and picks the model accordingly.

The thing most teams skip

A style guide given as part of the prompt — your team's actual conventions, the patterns to flag, the patterns to ignore. Without it, every Claude model defaults to generic best-practice review and produces noise. With it, even Haiku produces output that reads like a senior engineer on your team. The style guide is 1-2 pages of markdown, lives in the repo, gets included in the prompt by the CI step.

If you adopt nothing else from this article, write the style guide. The model selection question is a 2x cost optimisation; the style guide is a 5-10x quality improvement. Order matters.

Or skip ahead and talk through it directly