Nuenki | The best translator is a hybrid translator

All Languages

Chinese

Esperanto

French

German

Hungarian

Italian

Japanese

Korean

Spanish

Swedish

Ukrainian

Vietnamese

Model	Overall Score	Coherence			Idiomaticity			Accuracy			Latency (ms)
Model	Overall Score	Mean	Std Dev	P90	Mean	Std Dev	P90	Mean	Std Dev	P90	Median	P90
NuenkiHybrid	8.41	8.69	0.84	8	8.44	1.03	7	9.01	0.71	8	ERR	ERR
optimus-alpha	8.32	8.73	1.14	8	8.40	1.43	7	8.97	1.03	8	1182	2581
openai/gpt-4.1	8.27	8.73	1.25	8	8.30	1.52	7	8.95	1.08	8	1285	2035
quasar-alpha	8.26	8.66	1.32	7	8.37	1.47	7	8.98	0.98	8	832	1280
gpt-4o-2024-08-06	8.15	8.75	1.05	7	8.05	1.63	6	8.90	1.08	8	803	1288
x-ai/grok-3-beta	8.14	8.64	1.43	7	8.19	1.55	6	8.88	1.13	8	1398	6440
claude-3-5-sonnet-20241022	8.10	8.53	1.13	7	8.38	1.37	7	8.91	0.97	8	1387	3112
openai/gpt-4.1-mini	8.09	8.66	1.35	7	8.07	1.59	6	8.87	1.03	7	1083	1594
gemini-2.5-flash-preview-04-17	8.07	8.57	1.58	7	8.19	1.62	6	8.92	1.00	8	4349	7869
deepl	8.02	8.59	1.45	7	8.07	1.70	6	8.67	1.51	7	226	322
gemma-3-27b-it	7.95	8.48	1.36	7	8.31	1.41	7	8.92	1.08	8	1115	1605
gemini-2.0-flash-exp	7.86	8.66	1.35	7	8.24	1.56	6	9.00	0.91	8	514	702
lingvanex	7.76	8.69	1.42	7	7.39	2.07	4	8.52	1.54	6	206	270
LLama 4 Scout	7.76	8.58	1.37	7	7.79	1.70	6	8.75	1.18	7	246	331
llama-3.3-70b-versatile	7.73	8.36	1.57	7	7.72	1.77	6	8.64	1.35	7	334	1687
openai/gpt-4.1-nano	7.68	8.25	1.57	6	7.74	1.84	6	8.62	1.34	7	826	1294
gemma2-9b-it	7.32	7.87	1.94	6	7.71	1.94	5	8.44	1.58	7	407	489
qwen-2.5-32b	6.76	8.23	1.71	6	7.00	2.09	4	8.02	1.96	5	326	472
llama-3.1-8b-instant	5.88	6.03	2.89	2	6.84	2.31	3	7.59	2.19	4	266	818
mistral-small-latest	5.69	6.54	2.74	2	7.17	2.23	4	7.46	2.57	3	603	1659
mistral-saba-24b	5.19	5.50	2.86	2	6.93	2.36	3	7.18	2.70	2	255	345

Note: Due to an error, I didn't collect latency for the hybrid translator. It's generally around 5-10 seconds.

Building a better translator

I've noticed while doing this LLM translation quality research that models will often by idiomatic in about 70% of their choices, but fail in at least one part of the sentence. With coherence now being a practically "solved" metric, and idiomaticity being the target, I wanted to see if I could combine the best models to produce a translation greater than the max of its parts.

I built that, and made it open source. It turns out that you can! While its coherence is slightly lower than its peers (more on that in a moment), it is the most idiomatic model while also being far more consistent, with a much lower standard deviation across all three metrics. It works by taking the top 3-4 models for a given language (based on this research), translating with them, then having a judge model (currently GPT-4.1) consider the strengths and weaknesses of each translation and merge them together in an idiomatic way.

Benchmark saturation and going beyond coherence and rating

This benchmark has clearly reached saturation. The difference between models is stuck within a narrow range, and it's reaching the point where increases to idiomaticity are negatively correlated with coherence, because beyond a certain point coherence becomes more of a proxy about how literal the translation is than how accurate it is.

I was recently discussing this with someone else who's benchmarking LLM translation, and he approached this by directly comparing translations against each other and applying a a Bradley-Terry model, which is conceptually similar to ELO. I think that that's the best way forward, for future evaluations.

Using hybrid translation in Nuenki

Hybrid translation is far too slow for the low-latency translation Nuenki does as you browse the web, so I'm going to keep it to the translation utility. Paying users also get additional access to the utility, should you find it useful and use it so much that the ratelimits become a problem.