Tl;dr: Here's the data
Model | Overall Score | Coherence | Idiomaticity | Accuracy | Latency (ms) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | IQR | P90 | Mean | IQR | P90 | Mean | IQR | P90 | Mean | P90 | ||
gpt-4o-2024-08-06 | 9.08 | 8.80 | 1 | 8 | 8.09 | 2 | 6 | 8.96 | 1 | 8 | 2088 | 4818 |
claude-3-5-sonnet-20241022 | 8.84 | 8.55 | 1 | 7 | 8.40 | 1 | 7 | 8.94 | 1 | 8 | 1771 | 2923 |
llama-3.3-70b-versatile | 8.59 | 8.46 | 1 | 7 | 7.68 | 2 | 6 | 8.67 | 1 | 7 | 404 | 681 |
gemini-2.0-flash-exp | 8.22 | 8.67 | 1 | 7 | 8.11 | 2 | 6 | 8.91 | 2 | 8 | 849 | 1798 |
gemma2-9b-it | 7.80 | 7.96 | 2 | 6 | 7.58 | 2 | 4 | 8.39 | 2 | 6 | 474 | 723 |
llama-3.1-8b-instant | 6.29 | 6.22 | 6 | 2 | 6.90 | 4 | 4 | 7.66 | 2 | 4 | 358 | 721 |
mistral-small-latest | 5.07 | 6.60 | 5 | 2 | 7.05 | 3 | 4 | 7.34 | 3 | 3 | 377 | 596 |
Nuenki needs to do a lot of translation, quickly (to avoid noticeable latency when browsing), and at a high quality - learning from mistakes can do more harm than good. Right now it uses DeepL for text you can see (since it's fast), and Claude for text you can't (because it's higher quality and cheaper).
Can we do better?
Claude often refuses to translate, and the problem is worsening as Anthropic tightens its "alignment" ever further. DeepL is also quite expensive, with translation costs being the majority of Nuenki's subscription price. If I can find a model of comparable quality to Sonnet that can be used with Groq for low latency translation, it would make DeepL obsolete.
Creating a benchmarking methodology
Benchmarking language translation is difficult. In an ideal world we would have a large sample of human translators compare and evaluate the different models across a broad range of sentences, but I don't have the resources for that.
Using LLMs to rate other LLMs is an attractive idea, but in my experience they have a tendency to miss out nuance and grammatical subtleties - not to mention the possibility that they simply hallucinate. Their RLHF"d optimism is a problem, too: Beyond being irritating, I've also found that it leads to a bias towards high ratings. "Grammatical trainwreck; 8/10" was quite a common occurrence before I explicitly told them to be harsher.
Coherency
Coherency tries to mitigate this bias by asking LLMs to do less. To get the coherency metric, I translate from English to the target language, then translate the translated text back to English, and repeat the process three times. Then I ask an LLM judge to rate how close the original English sentence and the sextuple-translated ones are. The LLM judge is chosen from Sonnet, GPT-4o, and Gemini by hashing the original sentence, so that the psuedorandom pick is deterministic across runs and languages.
Accuracy and Idiomaticity
These are the result of asking an LLM judge (chosen using the aforementioned process) to rate a translation for idiomaticity and accuracy. Like coherency, the question is blinded. Due to the LLM biases I talked about earlier, this is a smaller part of the overall score than coherency.
Distribution Matters
Beyond examining the mean, we must also consider how narrow the distribution is through the interquartile range (IQR) and the worst 10% (P90). An LLM that provides consistently OK translations is more useful than ones that gives excellent translations 75% of the time and misleads the learner the other 25%.
Sample Size
Most of the LLMs were tested on 21 sentences per language, or 210 sentences in total. llama-3.3-70b, gpt-4o, and Sonnet were also tested on an additional 10 sentences once I saw that they were the front runners, as I wanted to be more confident about replacing Sonnet with 4o and llama. The 10 languages chosen are a subset of Nuenki's supported languages, largely determined by usage. The sentences are a range of casual ones from the internet ("You made it to the light!"), those designed to target idiomatic language ("My name is Alex"), deliberately difficult grammar ("that this is how things have been done does not make it any less so"), and weirdly formatted wildcards ("Eggs US – Price – Chart").
The Overall Score
The overall score is weighted based on what I think is most important - the coherence, P90 quality, and refusal rate - while also including idiomaticity and accuracy.
1const computeOverallScore = (entry) => {
2 const { coherence, idiomatic, accuracy, refusal_rate_percent } = entry
3
4 let mean_val = ((coherence.mean * 2) + idiomatic.mean + accuracy.mean) / 4;
5 let refusal_adjust = 0.5 + ((100 - refusal_rate_percent) / 200)
6
7 let p90_avg = ((coherence.bottom_10 * 2) + idiomatic.bottom_10 + accuracy.bottom_10) / 4;
8 let p90_mult = 0.75 + (p90_avg / 40);
9
10 return p90_mult * refusal_adjust * mean_val * 1.13;
11}
The Data
Model | Overall Score | Coherence | Idiomaticity | Accuracy | Latency (ms) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | IQR | P90 | Mean | IQR | P90 | Mean | IQR | P90 | Mean | P90 | ||
gpt-4o-2024-08-06 | 9.08 | 8.80 | 1 | 8 | 8.09 | 2 | 6 | 8.96 | 1 | 8 | 2088 | 4818 |
claude-3-5-sonnet-20241022 | 8.84 | 8.55 | 1 | 7 | 8.40 | 1 | 7 | 8.94 | 1 | 8 | 1771 | 2923 |
llama-3.3-70b-versatile | 8.59 | 8.46 | 1 | 7 | 7.68 | 2 | 6 | 8.67 | 1 | 7 | 404 | 681 |
gemini-2.0-flash-exp | 8.22 | 8.67 | 1 | 7 | 8.11 | 2 | 6 | 8.91 | 2 | 8 | 849 | 1798 |
gemma2-9b-it | 7.80 | 7.96 | 2 | 6 | 7.58 | 2 | 4 | 8.39 | 2 | 6 | 474 | 723 |
llama-3.1-8b-instant | 6.29 | 6.22 | 6 | 2 | 6.90 | 4 | 4 | 7.66 | 2 | 4 | 358 | 721 |
mistral-small-latest | 5.07 | 6.60 | 5 | 2 | 7.05 | 3 | 4 | 7.34 | 3 | 3 | 377 | 596 |
There's some fascinating stuff here. Beginning with the All Languages tab, gpt-4o is better than Sonnet, with llama-3.3-70b being marginally worse and Gemini being acceptable. What I find really interesting, though, is the variance between different languages. LLama is substantially better at Chinese than Claude, while 4o remains better still.
Gemini has a very high (45%) refusal rate with Esperanto - perhaps it knows its knowledge is limited?
German is another outlier, with Gemma 9b impressively close to substantially larger models. It demonstrates why it's important to test across a broad range of languages - since I'm learning German it would be quite tempting to focus on German translation as it's something I have some knowledge with, but that would give me a very unrepresentative impression that would generalise poorly to other languages.
As for the latency, gpt-4o remains substantially slower than Claude. I previously found this during some basic tests when I first added Claude, and it's a major barrier to introducing it as another translation source. Mistral is impressively fast, even if it is pretty awful at translation - it's beating Groq, whose niche is using hundreds of custom processors to do very low latency inference.
Future improvements
This methodology is useful, but imperfect. Crowdsourcing is the obvious solution, but it inherently involves a large crowd - and I doubt there are many people particularly interested in this.
I'd also like to introduce Deepseek R1 as a rating LLM. The rating prompt encourages some train of thought before giving the answer, but a dedicated CoT LLM might be better at deeply analysing the vocabulary and grammar choices. Deepseek is currently limiting new credit deposits, so it'll have to wait till another time.
As for Nuenki, I intend to begin introducing llama-3.3-70b via Groq in a limited capacity, keeping an eye out for any regressions.
If you liked this, please consider giving Nuenki a try! It's free to try, people seem to quite like it, and I'd really appreciate feedback.