Model | Overall Score | Coherence | Idiomaticity | Accuracy | Latency (ms) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Std Dev | P90 | Mean | Std Dev | P90 | Mean | Std Dev | P90 | Median | P90 | ||
NuenkiHybrid | 8.41 | 8.69 | 0.84 | 8 | 8.44 | 1.03 | 7 | 9.01 | 0.71 | 8 | 0 | 0 |
optimus-alpha | 8.32 | 8.73 | 1.14 | 8 | 8.40 | 1.43 | 7 | 8.97 | 1.03 | 8 | 1182 | 2581 |
openai/gpt-4.1 | 8.27 | 8.73 | 1.25 | 8 | 8.30 | 1.52 | 7 | 8.95 | 1.08 | 8 | 1285 | 2035 |
quasar-alpha | 8.26 | 8.66 | 1.32 | 7 | 8.37 | 1.47 | 7 | 8.98 | 0.98 | 8 | 832 | 1280 |
gpt-4o-2024-08-06 | 8.15 | 8.75 | 1.05 | 7 | 8.05 | 1.63 | 6 | 8.90 | 1.08 | 8 | 803 | 1288 |
x-ai/grok-3-beta | 8.14 | 8.64 | 1.43 | 7 | 8.19 | 1.55 | 6 | 8.88 | 1.13 | 8 | 1398 | 6440 |
claude-3-5-sonnet-20241022 | 8.10 | 8.53 | 1.13 | 7 | 8.38 | 1.37 | 7 | 8.91 | 0.97 | 8 | 1387 | 3112 |
openai/gpt-4.1-mini | 8.09 | 8.66 | 1.35 | 7 | 8.07 | 1.59 | 6 | 8.87 | 1.03 | 7 | 1083 | 1594 |
gemini-2.5-flash-preview-04-17 | 8.07 | 8.57 | 1.58 | 7 | 8.19 | 1.62 | 6 | 8.92 | 1.00 | 8 | 4349 | 7869 |
deepl | 8.02 | 8.59 | 1.45 | 7 | 8.07 | 1.70 | 6 | 8.67 | 1.51 | 7 | 226 | 322 |
gemma-3-27b-it | 7.95 | 8.48 | 1.36 | 7 | 8.31 | 1.41 | 7 | 8.92 | 1.08 | 8 | 1115 | 1605 |
gemini-2.0-flash-exp | 7.86 | 8.66 | 1.35 | 7 | 8.24 | 1.56 | 6 | 9.00 | 0.91 | 8 | 514 | 702 |
lingvanex | 7.76 | 8.69 | 1.42 | 7 | 7.39 | 2.07 | 4 | 8.52 | 1.54 | 6 | 206 | 270 |
LLama 4 Scout | 7.76 | 8.58 | 1.37 | 7 | 7.79 | 1.70 | 6 | 8.75 | 1.18 | 7 | 246 | 331 |
llama-3.3-70b-versatile | 7.73 | 8.36 | 1.57 | 7 | 7.72 | 1.77 | 6 | 8.64 | 1.35 | 7 | 334 | 1687 |
openai/gpt-4.1-nano | 7.68 | 8.25 | 1.57 | 6 | 7.74 | 1.84 | 6 | 8.62 | 1.34 | 7 | 826 | 1294 |
gemma2-9b-it | 7.32 | 7.87 | 1.94 | 6 | 7.71 | 1.94 | 5 | 8.44 | 1.58 | 7 | 407 | 489 |
qwen-2.5-32b | 6.76 | 8.23 | 1.71 | 6 | 7.00 | 2.09 | 4 | 8.02 | 1.96 | 5 | 326 | 472 |
llama-3.1-8b-instant | 5.88 | 6.03 | 2.89 | 2 | 6.84 | 2.31 | 3 | 7.59 | 2.19 | 4 | 266 | 818 |
mistral-small-latest | 5.69 | 6.54 | 2.74 | 2 | 7.17 | 2.23 | 4 | 7.46 | 2.57 | 3 | 603 | 1659 |
mistral-saba-24b | 5.19 | 5.50 | 2.86 | 2 | 6.93 | 2.36 | 3 | 7.18 | 2.70 | 2 | 255 | 345 |
Building a better translator
I've noticed while doing this LLM translation quality research that models will often by idiomatic in about 70% of their choices, but fail in at least one part of the sentence. With coherence now being a practically "solved" metric, and idiomaticity being the target, I wanted to see if I could combine the best models to produce a translation greater than the max of its parts.
I built that, and made it open source. It turns out that you can! While its coherence is slightly lower than its peers (more on that in a moment), it is the most idiomatic model while also being far more consistent, with a much lower standard deviation across all three metrics. It works by taking the top 3-4 models for a given language (based on this research), translating with them, then having a judge model (currently GPT-4.1) consider the strengths and weaknesses of each translation and merge them together in an idiomatic way.
Benchmark saturation and going beyond coherence and rating
This benchmark has clearly reached saturation. The difference between models is stuck within a narrow range, and it's reaching the point where increases to idiomaticity are negatively correlated with coherence, because beyond a certain point coherence becomes more of a proxy about how literal the translation is than how accurate it is.
I was recently discussing this with someone else who's benchmarking LLM translation, and he approached this by directly comparing translations against each other and applying a a Bradley-Terry model, which is conceptually similar to ELO. I think that that's the best way forward, for future evaluations.
Using hybrid translation in Nuenki
Hybrid translation is far too slow for the low-latency translation Nuenki does as you browse the web, so I'm going to keep it to the translation utility. Paying users also get additional access to the utility, should you find it useful and use it so much that the ratelimits become a problem.