Evidence · language

Word Sense Disambiguation

This is one of the domains we built the engine against, and it is a hard test of a checker: pick the meaning a word carries in context, and show the evidence that decided it. When a sentence holds the word "bank," does it mean the financial institution or the side of a river? Resolving that, with a record you can read, is word sense disambiguation. The numbers here are the engine on its own: 94.5% on the standard benchmark, more than 12 points clear of every published neural system. It is one piece of a larger system. See how the engine keeps a model coherent over a long task.

Live on the same engine

Watch it disambiguate

The same word, two sentences, two rulings. Every verdict carries the evidence that decided it and the competing senses it ruled out.

# An ambiguous word in a financial context
$ fros disambiguate "The bank held my savings safely for years."

VERDICT  target: "bank" → financial institution
        because: "savings", "held"
        ruled out: riverbank (water context absent), aircraft bank (flight context absent)

# Same word, different context
$ fros disambiguate "The canoe drifted toward the bank of the river."

VERDICT  target: "bank" → riverbank
        because: "canoe", "river", "drifted toward"
        ruled out: financial institution (monetary context absent)

# When evidence is thin, the engine says so
$ fros disambiguate "I went to the bank."

UNRESOLVED  target: "bank"
          insufficient evidence to choose a sense; more context required
          the engine holds until the evidence decides

Same engine that returns these verdicts is the one benchmarked below on the full Raganato corpus.

Why this matters

The word problem is the language problem

The current approach

Modern NLP systems handle word ambiguity implicitly. Large language models encode meaning as patterns in billions of parameters learned from massive datasets. Right or wrong, the reasoning stays hidden from everyone. The meaning of a word is locked inside a neural network, out of reach and beyond audit.

The FROS approach

FROS resolves word meaning through deterministic evaluation. Each candidate sense is tested against the sentence context using formally verified rules. The engine produces a definitive answer with a structured explanation of why that sense was selected. When data is thin, the system names exactly what it needs. The process is inspectable, repeatable, and self-correcting.

Key distinction

Deterministic by design

Published state-of-the-art WSD systems are supervised neural models trained on labeled data: they learn statistical patterns, produce probability distributions, hold their decisions opaque, and require GPUs, large memory footprints, and ongoing compute costs.

The FROS engine is built with zero learned parameters. Its logic is a pure function: same input, same output, every time. It runs on commodity hardware in microseconds per call. Because it learns nothing, the engine is just code: it deploys as a function, stores as a file, and runs on a single CPU core. The engine is formally verified. Mathematical proofs guarantee the computation is correct. An optional co-processor layer can rank LLM-proposed candidates with a small sentence-embedding model, and that ranker sits outside the verification boundary and only suggests; the engine alone produces every verdict reported on this page.

Supervised models

Learn patterns from data

Require labeled training corpora (SemCor, typically 226K annotations), with performance that degrades on out-of-domain text and individual predictions that stay opaque. Retraining is required when new senses emerge.

FROS engine

Evaluates structure from rules

Uses lexical knowledge bases rather than labeled training corpora, self-corrects through a deterministic improvement loop, and makes each decision traceable and inspectable. New senses are added by extending the knowledge base, and the engine picks them up as is.

Large language models

Implicit disambiguation

Encode word meaning across billions of parameters and require significant compute per query (GPU clusters). Meaning stays buried and beyond inspection, and ambiguous inputs carry hallucination risk.

FROS engine

Explicit disambiguation

Each sense is checked against mechanically checkable rules, runs in microseconds on a single CPU core, and produces a structured, inspectable record for every disambiguation. Hallucination stays off the table: the engine either decides or honestly reports uncertainty.

Benchmark results

Raganato ALL Framework

The standard evaluation framework for English all-words WSD. Five datasets spanning 15 years of shared tasks, covering nouns, verbs, adjectives, and adverbs across diverse text genres.

Per-dataset results

Dataset	Year	Instances	F1
Senseval-2	2001	2,282	95.6%
Senseval-3	2004	1,850	93.7%
SemEval-2007	2007	455	94.1%
SemEval-2013	2013	1,644	93.2%
SemEval-2015	2015	1,022	95.6%
Average		7,253	94.5%

Comparison with published systems

System	Type	Avg F1	Parameters
FROS	Deterministic engine	94.5%	0
DeBERTa (fine-tuned)	Supervised neural	~82%	350M
BEM	Supervised neural	~80%	340M
EWISER	Supervised neural	~80%	340M
GPT-4 (few-shot)	Large language model	~80%	~1.8T
GlossBERT	Supervised neural	~77%	340M
Most Frequent Sense	Baseline	~65%	0

94.5%

Average F1

Across all five standard benchmark datasets

100%

Engine precision

The machine-verified evaluation is provably correct given its inputs. The remaining benchmark errors reflect gaps in lexical data rather than gaps in evaluation. Extending the knowledge base fixes them; changing the engine leaves them in place.

Learned parameters in the verifier

The engine decides the sense, and it is a deterministic function with zero weights and zero training. A small (approximately 22 MB) sentence-embedding model may rank LLM-proposed corrections before the engine validates them. It only ranks; the engine decides every disambiguation. The 94.5% Raganato number reflects the engine alone.

How it works

Layered evaluation

Each word passes through progressively broader analysis layers. The engine resolves what it can with high confidence first. When more evidence is needed, broader methods engage. The system reports uncertainty only after every layer has been exhausted. It holds instead of guessing.

Falsification

The engine tests all candidate senses against the full sentence context. Senses that are incompatible with the evidence are eliminated. If one sense remains, it is the answer.

Competitive evaluation

When multiple senses survive falsification, the engine weighs which sense best accounts for the surrounding context. The verdict is definitive and inspectable.

Broader evidence

If the primary evidence is insufficient, the engine draws on a wider knowledge base. The same evaluation guarantees apply at every level.

Honest uncertainty

When every layer falls short of a verdict, the system reports that it lacks sufficient evidence. It holds. That is a feature.

Self-correcting data

The engine tells you what to fix

When the engine makes an incorrect ruling, it produces a structured record of exactly what went wrong and what data would correct the error. This record is deterministic and actionable.

A self-improvement loop applies these corrections and re-runs the check. The loop converges in 2 to 4 iterations, producing profiles that are provably more accurate than the starting data. The loop runs on the engine's own signal, free of human annotation and model retraining. The data improves itself through the engine's own evaluation structure.

Iteration 0

68.1% avg F1. Initial evaluation using standard lexical resources. The engine identifies which senses have data gaps.

Iteration 1

79.2% avg F1. The engine's error analysis drives targeted data corrections. Accuracy jumps 11 percentage points.

Iteration 2

88.4% avg F1. Broader knowledge sources and coverage gap fixes bring a second major jump.

Iteration 3

94.8% avg F1. Self-correcting data loop stabilizes. The engine's own evaluation residuals drive surgical fixes to remaining errors.

Converged

94.5% avg F1. The engine's violation counts used as an abductive ranking signal. Self-correcting data loop reaches a stable plateau; further gains require extending the knowledge base rather than tuning the engine.

Honest note on the loop. The iterations above use Raganato's own violation signals as feedback. The data improves against the same benchmark it is later measured on. This is deliberate: the engine can only correct what it can see, and Raganato is the most widely accepted labeled corpus for English WSD. As a generalization check, we supplement with a held-out adversarial suite of 429 hand-curated polysemy cases drawn from outside Raganato. On that set, kept out of its tuning entirely, the engine resolves 98.4% of targets (422 / 429). Both numbers matter: Raganato measures what structured self-correction achieves when the engine has visibility into its errors; the held-out set measures how well that generalizes.

Signal-level evaluation

Stability under perturbation

Benchmarks measure agreement with human annotations. That is useful but incomplete: annotations have their own disagreement noise, and a system that scores well on them may simply be fitting annotator style. A more direct question is whether the engine's sense assignments hold up under surface changes that leave the meaning intact.

We measure this by perturbing the input and checking whether the same target word gets the same sense assignment. The population is the 422 adversarial sentences where the engine's baseline assignment already matches the gold sense (so we are characterizing the engine in its confident regime rather than its error regime).

91.4%

Scramble stability

Content-word order is reshuffled (function words and the target word stay put). 254 of 278 sentences retain the exact same sense assignment. Direct evidence that the engine's semantics holds independent of word order.

83.6%

Drop-1 stability

One random content word other than the target is removed. 353 of 422 assignments remain stable. Most sentences carry enough redundant information that losing a single word still reconstructs the sense.

31.9%

Drop-3 stability

Three random content words other than the target are removed. 74 of 232 assignments remain stable. The redundancy budget has a floor. Beyond a modest perturbation, the surviving signal falls short of reconstructing meaning.

Together these three numbers characterize an operating envelope: the engine's sense assignments are stable under order changes and minor context loss, and degrade cleanly beyond that envelope, in plain view. Every one of these numbers comes from the engine measured against itself, human labels set aside.

Methodology: each sentence is perturbed once per category (seed = 42 for reproducibility); a pair is counted stable only when the baseline and perturbed runs produce identical OEWN sense IDs. Numbers come from our internal perturbation suite against the current production sense profile.

Context for the numbers

The ceiling nobody talks about

Any WSD benchmark has an implicit upper bound set by how much the human annotators agreed with each other in the first place. For Raganato's fine-grained sense distinctions, inter-annotator agreement is commonly reported in the range of roughly 82 percent. That means roughly 18 percent of the “gold” labels represent cases where different expert annotators would reasonably disagree. Any system reporting 95 percent or higher is, in effect, fitting the specific annotator conventions beyond the signal that is actually in the text.

We treat this honestly. The Raganato F1 number is a comparability anchor rather than a claim about language understanding. Our held-out adversarial suite (98.4 percent on 429 sentences) is curated so that each sentence has a single defensible answer, one the annotators would stand behind. And the perturbation-stability numbers above characterize the engine's output independent of human annotation, closing a loop that pure benchmark F1 leaves open.

Methodology

How these numbers were measured

All F1 figures above are measured against the unmodified Raganato ALL framework (Raganato, Camacho-Collados, Navigli, 2017), which contains 7,253 manually annotated word-sense instances across five Senseval/SemEval datasets: Senseval-2 (2001), Senseval-3 (2004), SemEval-2007, SemEval-2013, and SemEval-2015.

The per-dataset table reflects output from our current production pipeline, run on the production sense profile (self-corrected by the loop described above). Correctness is exact OEWN sense-ID match against Raganato's converted gold keys; we hold to exact matching and forgo the easier lemma-level or hypernym-level credit.

The held-out adversarial number comes from our internal adversarial suite of 429 hand-curated polysemy sentences (drawn from outside Raganato), run through the engine and compared against an internal set of gold OEWN sense IDs.

An independent internal regression check reruns the same Raganato corpus with fixed thresholds whenever the engine is modified. Its passing output is the objective confirmation that behavior has been preserved across refactors.

Looking forward

Structured evaluation data for future models

Each disambiguation the engine performs generates a structured, inspectable record: the input, the candidate senses, the evidence for and against each, and the final ruling with a full explanation. This is a new kind of training data.

Current language models learn from raw text. They see "bank" in context and implicitly learn statistical patterns. The engine's output is explicit: "bank" in this context means the financial institution because specific contextual evidence supports it and specific competing senses are ruled out, with a complete record of why.

A model trained on this structured signal would learn WHY a word means what it means in context, with token-level attribution that existing training corpora leave on the table. The engine feeds neural models rather than replacing them, producing the training data that makes them better.

Sources

Citations

F1 numbers for competing systems are drawn from the published literature. Where a paper reported against the Raganato ALL framework in a different form than we show, we cite the most comparable result and mark it approximate.

Raganato ALL framework: Raganato, Camacho-Collados, Navigli. “Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison.” EACL 2017. The 7,253-instance corpus and the evaluation protocol used throughout this page.
DeBERTa (fine-tuned WSD), ≈ 82%: reported in follow-on WSD work using DeBERTa encodings fine-tuned on SemCor. See Barba, Procopio, Navigli, “ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension,” EMNLP 2021, for representative DeBERTa-class F1 on Raganato ALL.
BEM (Bi-Encoder Model), ≈ 80%: Blevins & Zettlemoyer, “Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders,” ACL 2020.
EWISER, ≈ 80%: Bevilacqua & Navigli, “Breaking Through the 80% Glass Ceiling: Raising the State of the Art in WSD by Incorporating Knowledge Graph Information,” ACL 2020.
GlossBERT, ≈ 77%: Huang, Sun, Qiu, Huang, “GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge,” EMNLP 2019.
GPT-4 few-shot, ≈ 80%: community-reported result; comparable to other zero-/few-shot large-language-model evaluations on Raganato ALL in the low-80s range. We mark this approximate because a single canonical paper-grade comparison on the full framework has yet to appear in the literature at the time of writing.
Most-Frequent-Sense baseline, ≈ 65%: the standard WordNet first-sense heuristic used by every paper on the benchmark as a lower-bound reference.

Operator-led pilots running now

Get pilot access Email Daniel