Model | Overall Score | Coherence | Idiomaticity | Accuracy | Latency (ms) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Std Dev | P90 | Mean | Std Dev | P90 | Mean | Std Dev | P90 | Median | P90 | ||
optimus-alpha | 8.32 | 8.73 | 1.14 | 8 | 8.40 | 1.43 | 7 | 8.97 | 1.03 | 8 | 1182 | 2581 |
openai/gpt-4.1 | 8.27 | 8.73 | 1.25 | 8 | 8.30 | 1.52 | 7 | 8.95 | 1.08 | 8 | 1285 | 2035 |
quasar-alpha | 8.26 | 8.66 | 1.32 | 7 | 8.37 | 1.47 | 7 | 8.98 | 0.98 | 8 | 832 | 1280 |
gpt-4o-2024-08-06 | 8.15 | 8.75 | 1.05 | 7 | 8.05 | 1.63 | 6 | 8.90 | 1.08 | 8 | 803 | 1288 |
x-ai/grok-3-beta | 8.14 | 8.64 | 1.43 | 7 | 8.19 | 1.55 | 6 | 8.88 | 1.13 | 8 | 1398 | 6440 |
claude-3-5-sonnet-20241022 | 8.10 | 8.53 | 1.13 | 7 | 8.38 | 1.37 | 7 | 8.91 | 0.97 | 8 | 1387 | 3112 |
openai/gpt-4.1-mini | 8.09 | 8.66 | 1.35 | 7 | 8.07 | 1.59 | 6 | 8.87 | 1.03 | 7 | 1083 | 1594 |
deepl | 8.02 | 8.59 | 1.45 | 7 | 8.07 | 1.70 | 6 | 8.67 | 1.51 | 7 | 226 | 322 |
gemma-3-27b-it | 7.95 | 8.48 | 1.36 | 7 | 8.31 | 1.41 | 7 | 8.92 | 1.08 | 8 | 1115 | 1605 |
gemini-2.0-flash-exp | 7.86 | 8.66 | 1.35 | 7 | 8.24 | 1.56 | 6 | 9.00 | 0.91 | 8 | 514 | 702 |
lingvanex | 7.76 | 8.69 | 1.42 | 7 | 7.39 | 2.07 | 4 | 8.52 | 1.54 | 6 | 206 | 270 |
LLama 4 Scout | 7.76 | 8.58 | 1.37 | 7 | 7.79 | 1.70 | 6 | 8.75 | 1.18 | 7 | 246 | 331 |
llama-3.3-70b-versatile | 7.73 | 8.36 | 1.57 | 7 | 7.72 | 1.77 | 6 | 8.64 | 1.35 | 7 | 334 | 1687 |
openai/gpt-4.1-nano | 7.68 | 8.25 | 1.57 | 6 | 7.74 | 1.84 | 6 | 8.62 | 1.34 | 7 | 826 | 1294 |
gemma2-9b-it | 7.32 | 7.87 | 1.94 | 6 | 7.71 | 1.94 | 5 | 8.44 | 1.58 | 7 | 407 | 489 |
qwen-2.5-32b | 6.76 | 8.23 | 1.71 | 6 | 7.00 | 2.09 | 4 | 8.02 | 1.96 | 5 | 326 | 472 |
llama-3.1-8b-instant | 5.88 | 6.03 | 2.89 | 2 | 6.84 | 2.31 | 3 | 7.59 | 2.19 | 4 | 266 | 818 |
mistral-small-latest | 5.69 | 6.54 | 2.74 | 2 | 7.17 | 2.23 | 4 | 7.46 | 2.57 | 3 | 603 | 1659 |
mistral-saba-24b | 5.19 | 5.50 | 2.86 | 2 | 6.93 | 2.36 | 3 | 7.18 | 2.70 | 2 | 255 | 345 |
GPT-4.1 and Quasar Alpha
OpenAI recently released GPT-4.1, alongside "Mini" and "Nano" variants. GPT-4.1 had been teased before with the Quasar Alpha and Optimus Alpha anonymous models on openrouter. I evaluated them at the time, and noted how good they were at translation. Now that it's fully released, and I've evaluated all three versions, I'm not quite sure what the difference between Quasar and Optimus is - they're both very close to the full performance of the full GPT-4.1, and nowhere near Mini and Nano.
Either way, I've collected some data showing that the full GPT-4.1 remains excellent at translation, with the distills being OK but nothing special. Mini is only slightly worse than Sonnet 3.5, which is what Nuenki currently uses to translate sentences as you browse. And of course, there's a lot of variance between different languages; I encourage you to go through and see.
Hybrid Translation and Grok
I've been working on an open-source translation system that combines multiple different models to inform a translation that will hopefully beat any one model. It's still a work in progress - I'll probably release a demo tomorrow. Anyway, I decided to evaluate Grok for that system, though I wouldn't consider using it in the Nuenki extension for privacy reasons.
It's actually rather good at translation, as you can see from the data - I'm pleasantly surprised.
Methodology
I went over it in more detail in the first post, but effectively:
- Coherence is calculated by translating from English -> target language -> English 3 times, then having 3 blinded LLMs rate how close the original text is to the new text
- Idiomaticity is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how idiomatic the translation is.
- Accuracy is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how accurate the translation is.
- This is repeated across a sample of ~40 sentences, and the overall score is weighted. I changed the weighting algorithm to use the standard deviation, rather than IQR, and changed the ratios to put greater emphasis on idiomaticity and slightly less on refusal rate.
Future of Nuenki Translation
I intend to integrate GPT-4.1 as a translation source in the next few days. Right now Nuenki uses a mixture of DeepL, Llama 3.3, and Sonnet depending on the circumstances. Sonnet is quite an old model now, so I intend to replace it with GPT-4.1 for some languages. It will still remain as a fallback in case of downtime. It's worth noting that the overall score heavily weights against refusals, and when you look purely at translation performance alone Sonnet is still quite a good model.