Nuenki | Accelerating Bloom Filters by 310x using Rust and WASM

Some Context

I built Nuenki to translate everyday web content into your target language, but only the bits you can actually learn from. It processes webpages, picks out sentences at your knowledge level, then translates them as you browse.

The central technical problem is economics: High-quality translation is expensive, particularly when you're translating parts of every single website you visit, every single day. One of the solutions to this is client-aware caching: The extension knows what translations are in the server's cache, so it can be more liberal about translating sentences that won't incur additional costs.

Bloom Filters

Nuenki uses Bloom filters, regularly downloaded from the server to the client, to track cached translations. They're probabilistic data structures that compress large sets into a compact form that can be queried for membership. They're space-efficient and fast, but can return false positives. That's a non-issue for caching, but worth keeping in mind, particularly because they tend to grow non-linearly as you reduce the false positive rate.

Optimising bloom filters quickly became a key part of optimising Nuenki as a whole. On top of querying every sentence in every webpage, Nuenki also extensively uses them in its sentence-difficulty estimation.

Javascript Bloom Filters

I began with the npm bloom-filters library, using the classic bloom filter with an acceptable error rate of 1%. I'll be benchmarking against Nuenki's German cache, which has approximately 50 000 entries. Everything was tested in a browser extension's background script, running on Firefox.

1import { BloomFilter } from 'bloom-filters';
2class BasicFilterWrapper extends AbstractBloomFilter {
3    constructor(items, errorRate) {
4        super();
5        this.filter = BloomFilter.from(items, errorRate);
6    }
7
8    query(item) {
9        return this.filter.has(item);
10    }
11}

The mean was 0.094ms and the median 0.080ms, distributed as follows. The chart is using a random sample of 500 datapoints from the original million.

This performance is adequate for small webpages but scales poorly. A thousand queries - not uncommon - will take 94ms.

Cuckoo Filters: 1.03x faster

Cuckoo filters work similarly to bloom filters. They are substantially newer - first described in 2014 - and on top of having various useful capabilities that Bloom filters don't, like deletion, they also tend to be faster and more space-efficient.

I used the same library as before to achieve a 1.03x performance improvement on the same dataset.

1import { CuckooFilter } from 'bloom-filters';
2class CuckooFilterWrapper extends AbstractBloomFilter {
3    constructor(items, errorRate) {
4        super();
5        this.filter = CuckooFilter.from(items, errorRate);
6    }
7
8    query(item) {
9        return this.filter.has(item);
10    }
11}

The cuckoo filter had a mean of 0.091ms and a median of 0.080ms, distributed as follows. Cuckoo filters have a noticeably different distribution.

Rust via WASM: 310.02x faster

The next step of optimisation was to leave Javascript behind and use Rust's fastbloom library via Webassembly. I generated the bloom filters using a separate Rust project then loaded them in at runtime. This meant going back to standard bloom filters, rather than using Cuckoo filters, due to a lack of library support.

This improved performance by 310.0x, including WASM overhead. All the instrumentation remained on the Javascript side. I was surprised by how fast the WASM calls were - I expected the FFI (Foreign Function Interface) overhead to limit the performance improvements.

1use fastbloom::BloomFilter;
2use once_cell::sync::Lazy;
3use rmp_serde;
4use serde::Deserialize;
5use wasm_bindgen::prelude::*;
6
7const RAW_BLOOM_DATA: &'static [u8; 67037] = include_bytes!("translation_filters.bin");
8
9#[derive(Deserialize)]
10struct TranslationFilterDump {
11    filter: Vec<u64>,
12    hashes: u32,
13}
14
15static TRANSLATION_FILTER: Lazy<BloomFilter> = Lazy::new(|| load_translation_filter());
16
17fn load_translation_filter() -> BloomFilter {
18    let filter_dump: TranslationFilterDump = rmp_serde::from_slice(RAW_BLOOM_DATA).unwrap();
19    BloomFilter::from_vec(filter_dump.filter)
20        .seed(&42)
21        .hashes(filter_dump.hashes)
22}
23
24#[wasm_bindgen]
25pub fn basic_contains(text: &str) -> bool {
26    TRANSLATION_FILTER.contains(text)
27}

It had a mean of 0.000302ms (302ns) and a median of 0.000300ms (300ns), distributed as follows.

Zooming into that distribution:

The following optimisation flags were in place, but didn't have a noticeable effect on performance:

1[profile.release]
2lto = "fat"
3strip = true
4codegen-units = 1
5panic = "abort"
6opt-level = 3

Compressed Bloom Filters

I experimented with compressing the input to the bloom filter using lz4. Unsurprisingly, this only reduced performance: Bloom filters already begin by hashing the input to a constant size, and compressing then hashing the input will practically always be slower than simply hashing it. It took 1.0μs longer on average, 4.4x slower than the uncompressed version.

Conclusion

While the optimisations were fairly straightforward, aspects of them surprised me. I was particularly surprised by how low the WASM overhead is for simple calls, and by just how much faster Rust was than JS. I assumed the JIT would be able to optimise the repeated number-crunching workload to be far closer to Rust's performance. Perhaps Chrome's JIT would have fared better?

I was also surprised by Rust's compilation flags not measurably impacting performance - any changes seemed to be less than the statistical noise of each run. Perhaps wasmpack is overriding them?

In the future I intend to go over the sentence difficulty heuristics Nuenki uses and how I optimised them to take mere microseconds, as well as the datastructures involved in quickly processing Furigana and Pinyin.