Estimate Sentence Difficulty


How it works

The algorithm is implemented in Rust via WebAssembly. It broadly works by:

  • Splitting the sentence into words, discarding punctuation and spacing.
  • Using a pre-prepared bloom filter of all words to work out what "words" are actually in the dictionary.
  • Categorising each word into a CEFR category or "other" (mostly technical language) using more preprepared bloom filters.
  • Applying a score to each category and getting an average difficulty score.
  • Calculating a grammar score using punctuation, tense heuristics, the length of the sentence, etc.
  • Combining the grammar score with the vocabulary difficulty score and a few other minor heuristics to output a sentence difficulty.

There's no "AI" or neural networks, just classical algorithms.

The design goal is to be extremely fast and good enough for roughly deciding which sentences to show to the user. A little variation isn't particularly undesirable - it organically pushes the user to learn a little beyond their current knowledge level.

It can categorise an entire text-filled website in less than a millisecond. It also tends to accelerate by about 1.5x once it's used in a hot loop; I don't know whether that's on the JS side, the WASM side, or simply CPU caches and branch prediction.

If you'd like to see it in action, try out Nuenki!