The best translator is a hybrid translator

All Languages
Chinese
Esperanto
French
German
Hungarian
Italian
Japanese
Korean
Spanish
Swedish
Ukrainian
Vietnamese
ModelOverall ScoreCoherenceIdiomaticityAccuracyLatency (ms)
MeanStd DevP90MeanStd DevP90MeanStd DevP90MedianP90
NuenkiHybrid8.418.690.8488.441.0379.010.71800
optimus-alpha8.328.731.1488.401.4378.971.03811822581
openai/gpt-4.18.278.731.2588.301.5278.951.08812852035
quasar-alpha8.268.661.3278.371.4778.980.9888321280
gpt-4o-2024-08-068.158.751.0578.051.6368.901.0888031288
x-ai/grok-3-beta8.148.641.4378.191.5568.881.13813986440
claude-3-5-sonnet-202410228.108.531.1378.381.3778.910.97813873112
openai/gpt-4.1-mini8.098.661.3578.071.5968.871.03710831594
gemini-2.5-flash-preview-04-178.078.571.5878.191.6268.921.00843497869
deepl8.028.591.4578.071.7068.671.517226322
gemma-3-27b-it7.958.481.3678.311.4178.921.08811151605
gemini-2.0-flash-exp7.868.661.3578.241.5669.000.918514702
lingvanex7.768.691.4277.392.0748.521.546206270
LLama 4 Scout7.768.581.3777.791.7068.751.187246331
llama-3.3-70b-versatile7.738.361.5777.721.7768.641.3573341687
openai/gpt-4.1-nano7.688.251.5767.741.8468.621.3478261294
gemma2-9b-it7.327.871.9467.711.9458.441.587407489
qwen-2.5-32b6.768.231.7167.002.0948.021.965326472
llama-3.1-8b-instant5.886.032.8926.842.3137.592.194266818
mistral-small-latest5.696.542.7427.172.2347.462.5736031659
mistral-saba-24b5.195.502.8626.932.3637.182.702255345

Building a better translator

I've noticed while doing this LLM translation quality research that models will often by idiomatic in about 70% of their choices, but fail in at least one part of the sentence. With coherence now being a practically "solved" metric, and idiomaticity being the target, I wanted to see if I could combine the best models to produce a translation greater than the max of its parts.

I built that, and made it open source. It turns out that you can! While its coherence is slightly lower than its peers (more on that in a moment), it is the most idiomatic model while also being far more consistent, with a much lower standard deviation across all three metrics. It works by taking the top 3-4 models for a given language (based on this research), translating with them, then having a judge model (currently GPT-4.1) consider the strengths and weaknesses of each translation and merge them together in an idiomatic way.

Benchmark saturation and going beyond coherence and rating

This benchmark has clearly reached saturation. The difference between models is stuck within a narrow range, and it's reaching the point where increases to idiomaticity are negatively correlated with coherence, because beyond a certain point coherence becomes more of a proxy about how literal the translation is than how accurate it is.

I was recently discussing this with someone else who's benchmarking LLM translation, and he approached this by directly comparing translations against each other and applying a a Bradley-Terry model, which is conceptually similar to ELO. I think that that's the best way forward, for future evaluations.

Using hybrid translation in Nuenki

Hybrid translation is far too slow for the low-latency translation Nuenki does as you browse the web, so I'm going to keep it to the translation utility. Paying users also get additional access to the utility, should you find it useful and use it so much that the ratelimits become a problem.