Part 1 · Methodology series · 6 min read

Two reviewers caught what no test could

At one of our recent verification phase boundaries, four reviewers read the same code change in parallel. Each of them worked blind to the others' reports until their own was filed. Two flagged the same defect.

The defect was small. A validation channel was wired so it appeared to be running, and its tests all passed. In production the call site was disabling the input the channel was checking. The channel itself was sound. The channel's caller had silently turned off the data it needed.

Automated tests stayed green. They had to. Each test exercised the channel directly, always stopping short of the call site that had disabled its input. The bug lived between two pieces of correct code.

Two reviewers, working in parallel and to different briefs, surfaced the same defect. The design of the review made that convergence likely.

What just happened

A convergent finding is the strongest single signal a review process can produce. The same flaw, surfaced by independent eyes working in isolation from each other. For the bug to remain hidden, it would have had to clear the same blind spot in every reviewer's frame. Two reviewers caught it; the probability that a third would have missed it is low. The probability that the engineers downstream of the review would have caught it is even lower.

Independence is load-bearing. Our reviewers file blind, each report submitted before anyone sees another. The submission process seals each reviewer off from asking whether someone else flagged the same issue, or from reading the room. Each one reads the change as if they were the only set of eyes on it. The integration step that follows reconciles findings; opinions converging during the review would defeat its purpose.

This is the design of the panel.

Four reviewers, four failure classes

The panel has four standing roles.

The Logician reads for soundness. Does each step in the argument follow from the step before it, and does the overall claim survive the steps.

The Formal Verification Engineer reads for the machinery that proves the system right. Are the proof obligations real, are the extraction boundaries clean, are the trusted assumptions accounted for.

The Software Engineer reads for the production reality. Error paths, scale, the gap between the lab and the deploy.

The Doctoral CS researcher reads for theoretical foundations. This looks familiar, this is a pattern with a name, this has a known failure mode.

Each role looks for a different class of failure. The Logician catches the logical step that breaks the chain. The Formal Verification Engineer catches the assumption that was trusted but should have been proved. The Software Engineer catches the integration that is correct in isolation but broken under deployment. The Doctoral CS researcher catches the pattern that looks novel but is actually an old failure mode in fresh clothes.

In the case from the opening, the Logician and the Software Engineer converged. The Logician noticed that the validation channel's correctness argument assumed an input the production call site was no longer providing. The Software Engineer noticed that production integration tests of the call site never exercised the channel under realistic conditions. Two different inquiries, the same conclusion. This check only appears to run.

The pair makes sense. The bug had a logical face (an assumption that had quietly lapsed) and an operational face (a call site that had drifted from its specification). The Logician sees the first face. The Software Engineer sees the second. Either alone would have surfaced it. Both, independently, made the finding certain.

What makes it reproducible

Three properties hold the panel together.

Independence. Reviewers work in parallel. Each report is filed blind, sealed before anyone sees another. We use a shared submission point; reports land silently, ahead of any discussion.

Composition. Each role has a defined remit. We staff the panel by role rather than by availability. The four roles cover four classes of failure, and a panel short one role is documented as such before the review begins. The remit is what makes the independence productive. Lacking it, two reviewers might miss the same thing for the same reason.

Cadence. The panel sits at every phase boundary on a fixed schedule. Discretionary review is the failure mode where the panel runs when someone is anxious and skips when someone is confident. The bugs that cost the most are the ones the team was confident about.

Each phase boundary takes engineering time from people doing other work. The pay-off is a record of bugs the panel has caught in recent passes that would otherwise have shipped:

Six findings of this severity in recent review passes. Each was a BLOCKING issue. The phase shipped only once the finding was resolved. Each of them passed every automated test the team had written for the code in question.

Why we publish this

For technical buyers. This is the review discipline an AI vendor working in regulated contexts should be able to demonstrate. When a vendor can only gesture at what their review panel looks like, who is on it, what classes of failure each role hunts, and how convergence is structured to surface bugs the rest of the team missed, the vendor's review is informal. Informal review catches the things people thought to look for. The bugs that ship are the ones that slipped past everyone.

For people thinking about defensibility. This discipline is reproducible. It is also expensive. It is the kind of moat that compounds the longer it is run. The aggregate findings are a record. The record is the proof that the discipline works.

What's next

The rest of the series unpacks the other disciplines that travel with the panel. The five-minute audit step that prevents months of no-op work. The bug class that unit tests structurally overlook. The three-tier accounting we use to make our trusted assumptions explicit. A retrospective on how the disciplines, applied together, compressed our team's design estimates by an order of magnitude across multiple stages. Each will arrive in this series.

Next arc: Vertical Use-Cases, Part 1 (Medical DDx)

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See our medical deployment and join the early-access waitlist on the home page.

Direct correspondence: daniel@shellfinity.com.