GPT-5.5, Opus 4.7, or both: the new model decision

Eight days separated Opus 4.7 (16 April) from GPT-5.5 (24 April). Two vendor pages. Two leaderboard claims. Three group chats already arguing about which one is "the one". You're going to be asked at standup which model your team should adopt, and the right answer is not on either vendor's marketing page.

Both models are real upgrades. Both have done well on the benchmarks each vendor chose to publish. Both will sit on your shortlist for the next two quarters. The question that matters isn't which one wins; it's which shape of work each one wins at, and whether your team's work has enough variety that the right answer is "both, routed".

"Pick one" is the wrong frame. You're going to route per task, or you're going to pay for the wrong model on half your work.

What each vendor actually shipped

The pitches, drained of marketing. We're using the vendor pages directly so you can audit the framing yourself.

Opus 4.7 (Anthropic, 16 April 2026). Headline feature is "adaptive thinking": the model decides how much reasoning to spend on a task, scaling time-to-answer with task complexity. Anthropic positions it as a sustained-reasoning model for ambiguous, long-horizon work. The Anthropic page reports Opus 4.7 resolving "3x more production tasks than Opus 4.6" on Rakuten-SWE-Bench. SWE-bench Verified scores around the high eighties have been quoted in coverage, including by CNBC. Treat that figure as vendor-cited; the leaderboard at BenchLM moves week to week.

GPT-5.5 (OpenAI, 24 April 2026). Headline feature is agentic-coding focus: faster, more direct execution, tuned for tool use and command-line agency. OpenAI's launch claims 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-bench Pro. Pricing is $5/$30 per million tokens, double GPT-5.4's $2.5/$15. The 58.6% figure is from OpenAI's own card; we haven't seen independent reproduction yet, so flag it as vendor-claimed when you cite it.

Simon Willison's read of the difference, after spending time with both, lands a useful framing: "5.4 is to 5.5 as Claude Sonnet is to Claude Opus". Capability tiers, not direct competitors. The two flagships overlap less than the headlines suggest.

Why the benchmarks don't settle it

Two reasons, both familiar by now.

Different benchmarks measure different things. SWE-bench Verified rewards multi-step issue resolution against real GitHub repositories. SWE-bench Pro is a harder variant. Terminal-Bench 2.0 rewards command-line task completion. None of these is "coding ability". They're all coding-shaped, and they reward different work.

Default settings are not the settings you'd run. Willison's pelican-on-a-bicycle test, the one he runs against every flagship model, caught GPT-5.5 producing a serviceable but flawed SVG on default reasoning. Switching to reasoning_effort: xhigh fixed the output, but at the cost of nearly four minutes of latency and roughly 9,300 reasoning tokens for a single picture. The benchmark was cheap; the actual answer was expensive. Ethan Mollick's "jagged frontier" line, which Willison cites, captures the same shape: model quality is task-shape-specific, not a scalar, and the cliffs in capability are unpredictable in advance.

The benchmark trap

Reading a vendor's launch post and inferring "this model is better at coding" is a category error. The model is better at the slice of coding that benchmark measures, on the settings the vendor evaluated. Your Tuesday will look different. Build a three-prompt mini-suite of your own work shapes and run it on every flagship release. Twenty minutes of evaluation will tell you more than a hundred leaderboard rows.

Four shapes of work, and which model tends to win

The decision is per-task, not per-model. Below is the working frame we've ended up with after a month of using both. It's an opinion, formed from the work, not a benchmark recap. Calibrate against your own three-prompt suite before you commit to any of it.

1. Greenfield feature work

A new module, a new endpoint, a clean slate. The constraints are the spec; the codebase doesn't push back yet. GPT-5.5 has tended to win here in our reading: faster, more direct, less likely to over-think a straightforward implementation. The agentic-coding emphasis pays off when the work is "build the obvious thing, well".

2. Deep refactors in an old codebase

You're moving a load-bearing abstraction across a hundred files. Every step has implications. Most of the work is judging which assumptions still hold and which were quietly load-bearing. Opus 4.7 has been our pick for this kind of work. Adaptive thinking lets it spend the reasoning time the task actually needs, and the sustained-reasoning emphasis is exactly the right shape for code archaeology.

3. Debugging the hairy bug

The reproduction is intermittent. The stack trace is a lie. Three subsystems are involved. Opus 4.7 again, in our reading, particularly when the bug requires holding several simultaneous hypotheses and discarding them as evidence comes in. GPT-5.5 with extended reasoning catches up but pays for it in tokens and latency.

4. Glue code, scripts, and small CLIs

"Write me a script that reads CSV X, joins against database Y, and emits report Z." Bounded scope, clear contract, low ambiguity. GPT-5.5 again, almost always. The work is well-suited to its strengths and the cost is lower per task. Sending Opus 4.7 at this is overkill that you'll feel on the bill.

The cost shape changed too

The conversation up to now has mostly been about capability. The bill is starting to matter. GPT-5.5's $5 input / $30 output per million tokens is double GPT-5.4's pricing. Opus 4.7 sits in the same tier (Anthropic's tier-pricing convention puts Opus above Sonnet by a similar multiple). Running either flagship on every task is a real budget line in 2026 in a way it wasn't in 2024.

This is what makes the routing decision concrete. If your team uses GPT-5.5 for greenfield and glue (call it sixty percent of work) and Opus 4.7 for refactors and hairy debugging (the other forty percent), you've kept both flagships in the toolbox and aimed each one at the work it's actually shaped for. If you commit to one, you'll either underspend on the work it's wrong for, or overspend on the work the cheaper tier could have handled.

What the three-prompt suite should actually test

The obvious failure mode of "build your own evaluation" is to test the wrong things. If your three prompts are all greenfield CRUD endpoints, you've built a benchmark that says GPT-5.5 wins, every time, and tells you nothing about the work that's hard. Three principles that have helped us:

One prompt per work shape, not three flavours of one shape. If your easy prompt is a script and your gnarly prompt is a more complicated script, the suite tells you almost nothing. Greenfield, refactor, and debugging are three different shapes; pick one of each.
Use real code from your real repository. Synthetic benchmarks are clean and unrepresentative. Your codebase has the specific weirdness that matters: the badly-named module, the test that takes forever, the abstraction that turned out wrong. The model that handles that weirdness well is the model that helps you on Tuesday.
Score the output the way you'd score a colleague's PR, not the way a benchmark scores a diff. Did the model understand the constraint? Did it pick a sensible level of abstraction? Did it propose tests? Did it ask a clarifying question when one was warranted? These are the questions a senior reviewer asks, and they're the ones that distinguish flagship-tier from cheaper-tier work.

Twenty minutes of running this against a new flagship release is a tax that pays back inside a sprint. The first time you run it against a model you'd assumed was strong and watch it stumble on your refactor prompt is the moment your routing strategy stops being theoretical.

How to decide, this month

Three concrete moves.

Build a three-prompt suite. One easy task from your week (a glue script). One gnarly one (a real refactor you've been putting off). One weird one (a domain-specific edge case nobody else's benchmark covers). Run it on both models. Save the outputs. Compare them honestly, not against the vendor's claims.
Set a routing default. Pick the cheaper tier (GPT-5.4 or Sonnet) for daily glue. Reserve the flagship tier (5.5 or Opus 4.7) for the work shapes you've calibrated it for. The default is the lever; flagship-by-default is what runs your bill up.
Re-run the suite every release. The frontier moves fast. Your three-prompt suite takes twenty minutes to run. Doing it once a quarter is a tax worth paying.

None of this is novel methodology. It's the working developer's version of evaluation, and it's the move that matters more than every leaderboard read combined.

The "one model to rule them all" framing already lost. You're going to route per task, with a cheaper tier as your default and the flagships reserved for the work they're shaped for. Build the three-prompt suite. Calibrate. Route. Pick one and you're paying for the wrong model on half your work; pick both and you're doing the actual job.