Case study · Verified medical AI · 5 min read

When the transcript invents what wasn't said

The Associated Press ran an investigation in October 2024 that deserved more attention than it got. Whisper, OpenAI's speech-to-text model, hallucinates. Across the audio corpora researchers studied, it generated content absent from the audio in roughly one segment out of a hundred. Sometimes whole sentences. Sometimes demographic descriptions of people the audio left undescribed. In clinical deployments, sometimes fabricated medical advice the model produced on its own.

Whisper is the engine inside ambient-listening tools several hospitals are already deploying. Nabla is the most-named example; others exist. The output flows into the EHR, then into billing, then into the next visit, where the prior note is treated as ground truth by the next clinician, the next AI system, whoever opens the chart.

A clinician notices a missing note. Nobody notices a fabricated one.

The bug is structural

This runs deeper than an implementation flaw, in the usual sense of that phrase. It is a property of the architecture, present because the design left it unaddressed.

Speech-to-text models in the current dominant class predict tokens by sampling from a distribution that was learned during training and shaped by the language of the training corpus. At inference, the audio is supposed to condition the prediction. Mostly it does. When the audio is clear and within the distribution the model was trained on, the prediction tracks. When the audio is unclear, briefly silent, or at the edge of distribution, the model still emits a prediction. The prediction has to come from somewhere. It comes from training.

The architecture lacks any mechanism for refusal. Every path through the forward pass runs to a committed token. The model has one move available, to emit; the option to say "the input underdetermines this token; stay silent" lives outside its forward pass. Output is mandatory. Anchoring is optional.

You can add training data. You can tune thresholds. Both leave the gap open. The gap sits between "produce an output" and "produce only outputs the input justifies," two different problems, addressed by different architectures.

"We tested it on a thousand encounters" is the wrong evidence to ask for. The bug class is shaped by the architecture; the test plan is shaped by the audio researchers thought to record. The two shapes overlap only by luck.

What a verified engine refuses to commit

The architectural property that closes this kind of gap is straightforward to state, and considerably less straightforward to ship. The engine must refuse to commit output the input leaves unanchored. Every output element gets traced to evidence. Output elements that resist tracing get a different name: unabsorbed evidence. The engine acknowledges them; it stops short of certifying them.

The Shellfinity engine produces, for each ruling, what we call an exclusion certificate. It names what was ruled in, what was ruled out, and what evidence the input was missing that prevented further commitment. The certificate is the artifact that proves the architecture held.

Transcription sits outside our scope. The question this engine answers is one level upstream: when a system is asked to make a clinical commitment, what stops it from emitting a commitment the input leaves unjustified? In most clinical AI shipping today, the stopping is done outside the system. Human review. Flag detection. Vendor reputation. Each is doing work, and the work matters, and every bit of it sits outside the architecture itself.

A formal-verification approach puts the refusal inside the system. The invariant (every commitment carries an anchor) holds because the proof says it holds. Whatever the system already refused to emit stays off the operator's desk.

The procurement question this reshapes

Most procurement processes for clinical AI ask a benchmark question. Sensitivity, specificity, accuracy on the published validation set. Useful numbers. Silent on this class of failure.

The load-bearing question is architectural. What mechanism in your system prevents it from emitting content beyond what the input justifies?

A vendor that can answer with a structural claim (an invariant, a verification target, a refusal mechanism that holds because the proof says so) is a vendor whose worst case is bounded by their architecture. A vendor that answers with a process ("our reviewers catch flagged cases," "we monitor for hallucination patterns") is a vendor whose worst case is bounded by their team. Both are bounded. The architectural bound is one you can model. The team bound is one you have to staff against. Different things, different planning horizons.

Ask anyway. The answers separate two categories.

A note on why this is in our series

We've been writing a methodology series on how we ship a formally verified evaluation engine. Panel review. Surface audit. Perf-stress. Tier accounting. Each addresses a class of failure that the discipline-free approach lets through.

This case study sits adjacent to that series rather than inside it. Same shape of argument, more specific target: the failure class where a system commits to content the input leaves unjustified, and the architectural property that closes the gap. Whether it's worth closing is for you to decide. Whether it can be closed is a verification problem, and that's what the series is about.

Differential-diagnosis engine in pre-launch polish. Demo on request. Subscribe to follow the rest.

Subscribe for the rest of the series at shellfinity.substack.com.

Working on verified medical AI? Read the methodology series, or email daniel@shellfinity.com to see the differential-diagnosis demo and discuss pilot interest.

Direct correspondence: daniel@shellfinity.com.