Part 1 · Methodology series · 6 min read

Two reviewers caught what no test could

At one of our recent verification phase boundaries, four reviewers read the same code change in parallel. None of them saw the others' reports until their own was filed. Two flagged the same defect.

The defect was small. A validation channel was wired so it appeared to be running, and its tests all passed. In production the call site was disabling the input the channel was checking. The channel itself was sound. The channel's caller had silently turned off the data it needed.

Automated tests caught nothing. They couldn't. Each test exercised the channel directly, never through the call site that had disabled its input. The bug lived between two pieces of correct code.

Two reviewers, working in parallel and to different briefs, surfaced the same defect. The design of the review made that convergence likely.

What just happened

A convergent finding is the strongest single signal a review process can produce. The same flaw, surfaced by independent eyes, with no coordination between them. For the bug to remain hidden, it would have had to clear the same blind spot in every reviewer's frame. Two reviewers caught it; the probability that a third would have missed it is low. The probability that the engineers downstream of the review would have caught it is even lower.

Independence is load-bearing. Our reviewers do not see each other's reports until their own is filed. No "did anyone else flag X?" early. No reading the room. Each one reads the change as if they were the only set of eyes on it. The integration step that follows is structured to reconcile findings, not to converge opinions while they are still forming.

This is not a rare event. It is the design of the panel.

Four reviewers, four failure classes

The panel has four standing roles.

The Logician reads for soundness. Does each step in the argument follow from the step before it, and does the overall claim survive the steps.

The Formal Verification Engineer reads for the machinery that proves the system right. Are the proof obligations real, are the extraction boundaries clean, are the trusted assumptions accounted for.

The Software Engineer reads for the production reality. Error paths, scale, the gap between the lab and the deploy.

The Doctoral CS researcher reads for theoretical foundations. This looks familiar, this is a pattern with a name, this has a known failure mode.

Each role looks for a different class of failure. The Logician catches the logical step that does not follow. The Formal Verification Engineer catches the assumption that was trusted but should have been proved. The Software Engineer catches the integration that is correct in isolation but broken under deployment. The Doctoral CS researcher catches the pattern that looks novel but is actually an old failure mode in fresh clothes.

In the case from the opening, the Logician and the Software Engineer converged. The Logician noticed that the validation channel's correctness argument assumed an input the production call site was no longer providing. The Software Engineer noticed that production integration tests of the call site never exercised the channel under realistic conditions. Two different inquiries, the same conclusion. This check is not actually running.

The pair makes sense. The bug had a logical face (an assumption that no longer held) and an operational face (a call site that had drifted from its specification). The Logician sees the first face. The Software Engineer sees the second. Either alone would have surfaced it. Both, independently, made the finding certain.

What makes it reproducible

Three properties hold the panel together.

Independence. Reviewers work in parallel. They do not see each other's reports until their own is filed. We use a shared submission point. No live chat, no "what did you find" before reports are in.

Composition. Each role has a defined remit. We do not draft a panel from whoever is free. The four roles cover four classes of failure, and a panel that is missing one role is documented as such before the review begins. The remit is what makes the independence productive. Without it, two reviewers might miss the same thing for the same reason.

Cadence. The panel sits at every phase boundary, not "when in doubt." It is a fixed cost, not a discretionary one. Discretionary review is the failure mode where the panel runs when someone is anxious and skips when someone is confident. The bugs that cost the most are the ones the team was confident about.

The cost is real. Each phase boundary takes engineering time from people doing other work. The pay-off is a record of bugs the panel has caught in recent passes that would otherwise have shipped:

An FFI memory-management defect invisible at unit scale and fatal at production scale.
A defensive validator inert in production because the call site had disabled its input.
An error path that returned a success status to clients.
A constant in a lookup table set to the wrong value in a way the existing tests did not reach.
A stepwise check whose interaction with the surrounding loop the specification missed.
A safety boundary that admitted evidence outside the scope it was meant to enforce.

Six findings of this severity in recent review passes. Each was a BLOCKING issue. The phase did not ship until the finding was resolved. Each of them passed every automated test the team had written for the code in question.

Why we publish this

For technical buyers. This is the review discipline an AI vendor working in regulated contexts should be able to demonstrate. If a vendor cannot describe what their review panel looks like, who is on it, what classes of failure each role hunts, and how convergence is structured to surface bugs the rest of the team missed, the vendor's review is informal. Informal review catches the things people thought to look for. The bugs that ship are the ones nobody thought to look for.

For people thinking about defensibility. This discipline is reproducible. It is also expensive. It is the kind of moat that compounds the longer it is run. The aggregate findings are a record. The record is the proof that the discipline works.

What's next

The rest of the series unpacks the other disciplines that travel with the panel. The five-minute audit step that prevents months of no-op work. The bug class that unit tests cannot see by construction. The three-tier accounting we use to make our trusted assumptions explicit. A retrospective on how the disciplines, applied together, compressed our team's design estimates by an order of magnitude across multiple stages. Each will arrive in this series.

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See our medical deployment and join the early-access waitlist on the home page.

Direct correspondence: daniel@shellfinity.com.