Benchmark report
When a sentence contains the word "bank," does it mean the financial institution or the side of a river? This is word sense disambiguation (WSD), and it is the foundational problem in natural language understanding. Every downstream task -- search, translation, medical reasoning, legal analysis -- depends on getting this right.
Why this matters
Modern NLP systems handle word ambiguity implicitly. Large language models encode meaning as patterns in billions of parameters learned from massive datasets. When they get it right, no one can explain why. When they get it wrong, no one can explain that either. The meaning of a word is locked inside a neural network, inaccessible and unverifiable.
FR-OS resolves word meaning through deterministic evaluation. Each candidate sense is tested against the sentence context using formally verified rules. The engine produces a definitive answer with a structured explanation of why that sense was selected. When data is insufficient, the system identifies exactly what is missing. The process is inspectable, repeatable, and self-correcting.
Key distinction
Every published state-of-the-art WSD system is a supervised neural model trained on labeled data. They learn statistical patterns. They produce probability distributions. They can't explain their decisions. They require GPUs, large memory footprints, and ongoing compute costs.
FR-OS uses zero learned parameters. The evaluation is a pure function: same input, same output, every time. The engine runs on commodity hardware in microseconds per evaluation. There is no model to train, no weights to store, no inference server to maintain. The underlying evaluation is formally verified -- mathematical proofs guarantee the computation is correct.
Require labeled training corpora (SemCor, typically 226K annotations). Performance degrades on out-of-domain text. Cannot explain individual predictions. Retraining required when new senses emerge.
Uses lexical knowledge bases, no labeled training corpora. Self-corrects through a deterministic improvement loop. Every decision is traceable and auditable. New senses added by extending the knowledge base, no retraining.
Encode word meaning across billions of parameters. Require significant compute per query (GPU clusters). Meaning is not extractable or auditable. Hallucination risk on ambiguous inputs.
Each sense is evaluated against verifiable rules. Runs in microseconds on a single CPU core. Every disambiguation produces a structured, auditable record. No hallucination possible -- the engine either decides or honestly reports uncertainty.
Benchmark results
The standard evaluation framework for English all-words WSD. Five datasets spanning 15 years of shared tasks, covering nouns, verbs, adjectives, and adverbs across diverse text genres.
| Dataset | Year | Instances | F1 |
|---|---|---|---|
| Senseval-2 | 2001 | 2,282 | 88.9% |
| Senseval-3 | 2004 | 1,850 | 85.5% |
| SemEval-2007 | 2007 | 455 | 90.3% |
| SemEval-2013 | 2013 | 1,644 | 89.2% |
| SemEval-2015 | 2015 | 1,022 | 90.7% |
| Average | 7,253 | 88.4% |
| System | Type | Avg F1 | Parameters |
|---|---|---|---|
| FR-OS | Deterministic engine | 88.4% | 0 |
| DeBERTa (fine-tuned) | Supervised neural | ~82% | 350M |
| BEM | Supervised neural | ~80% | 340M |
| EWISER | Supervised neural | ~80% | 340M |
| GPT-4 (few-shot) | Large language model | ~80% | ~1.8T |
| GlossBERT | Supervised neural | ~77% | 340M |
| Most Frequent Sense | Baseline | ~65% | 0 |
Across all five standard evaluation datasets
Every engine determination is logically correct by construction. Remaining errors reflect gaps in lexical data, never incorrect evaluation.
No neural network. No training. Pure deterministic evaluation.
How it works
Each word passes through progressively broader evaluation layers. The engine resolves what it can with high confidence first. When more evidence is needed, broader evaluation methods engage. The system only reports uncertainty when no layer can determine the answer -- it never guesses.
The engine tests all candidate senses against the full sentence context. Senses that are incompatible with the evidence are eliminated. If one sense remains, it is the answer.
When multiple senses survive falsification, the engine evaluates which sense best accounts for the surrounding context. The determination is definitive and auditable.
If the primary evidence is insufficient, the engine draws on a wider knowledge base. The same evaluation guarantees apply at every level.
When no layer can make a determination, the system reports that it lacks sufficient evidence. It does not guess. This is a feature, not a limitation.
Self-correcting data
When the engine makes an incorrect determination, it produces a structured record of exactly what went wrong and what data would correct the error. This record is deterministic and actionable.
A self-improvement loop applies these corrections and re-evaluates. The loop converges in 2 to 4 iterations, producing profiles that are provably more accurate than the starting data. No human annotation required. No model retraining. The data improves itself through the engine's own evaluation structure.
68.1% avg F1. Initial evaluation using standard lexical resources. The engine identifies which senses have data gaps.
79.2% avg F1. The engine's error analysis drives targeted data corrections. Accuracy jumps 11 percentage points.
79.7% avg F1. Diminishing corrections signal convergence. The remaining errors require new evidence sources.
88.4% avg F1 after incorporating broader knowledge and fixing coverage gaps. Zero additional corrections possible. Fixed point reached.
Looking forward
Every disambiguation the engine performs generates a structured record: the input, the candidate senses, the evidence for and against each, and the final determination with a full explanation. This is a new kind of training data.
Current language models learn from raw text. They see "bank" in context and implicitly learn statistical patterns. The engine's output is explicit: "bank" in this context means the financial institution because specific contextual evidence supports it and specific competing senses are ruled out, with a complete record of why.
A model trained on this structured signal would learn WHY a word means what it means in context, with token-level attribution that no existing training corpus provides. The engine doesn't replace neural models. It produces the training data that makes them better.
Early access
Be the first to know when FR-OS launches. We'll notify you when API access is available.