Nuenki | GPT 4.1 and Grok are excellent at translation

All Languages

Chinese

Esperanto

French

German

Hungarian

Italian

Japanese

Korean

Spanish

Swedish

Ukrainian

Vietnamese

Model	Overall Score	Coherence			Idiomaticity			Accuracy			Latency (ms)
Model	Overall Score	Mean	Std Dev	P90	Mean	Std Dev	P90	Mean	Std Dev	P90	Median	P90
optimus-alpha	8.32	8.73	1.14	8	8.40	1.43	7	8.97	1.03	8	1182	2581
openai/gpt-4.1	8.27	8.73	1.25	8	8.30	1.52	7	8.95	1.08	8	1285	2035
quasar-alpha	8.26	8.66	1.32	7	8.37	1.47	7	8.98	0.98	8	832	1280
gpt-4o-2024-08-06	8.15	8.75	1.05	7	8.05	1.63	6	8.90	1.08	8	803	1288
x-ai/grok-3-beta	8.14	8.64	1.43	7	8.19	1.55	6	8.88	1.13	8	1398	6440
claude-3-5-sonnet-20241022	8.10	8.53	1.13	7	8.38	1.37	7	8.91	0.97	8	1387	3112
openai/gpt-4.1-mini	8.09	8.66	1.35	7	8.07	1.59	6	8.87	1.03	7	1083	1594
deepl	8.02	8.59	1.45	7	8.07	1.70	6	8.67	1.51	7	226	322
gemma-3-27b-it	7.95	8.48	1.36	7	8.31	1.41	7	8.92	1.08	8	1115	1605
gemini-2.0-flash-exp	7.86	8.66	1.35	7	8.24	1.56	6	9.00	0.91	8	514	702
lingvanex	7.76	8.69	1.42	7	7.39	2.07	4	8.52	1.54	6	206	270
LLama 4 Scout	7.76	8.58	1.37	7	7.79	1.70	6	8.75	1.18	7	246	331
llama-3.3-70b-versatile	7.73	8.36	1.57	7	7.72	1.77	6	8.64	1.35	7	334	1687
openai/gpt-4.1-nano	7.68	8.25	1.57	6	7.74	1.84	6	8.62	1.34	7	826	1294
gemma2-9b-it	7.32	7.87	1.94	6	7.71	1.94	5	8.44	1.58	7	407	489
qwen-2.5-32b	6.76	8.23	1.71	6	7.00	2.09	4	8.02	1.96	5	326	472
llama-3.1-8b-instant	5.88	6.03	2.89	2	6.84	2.31	3	7.59	2.19	4	266	818
mistral-small-latest	5.69	6.54	2.74	2	7.17	2.23	4	7.46	2.57	3	603	1659
mistral-saba-24b	5.19	5.50	2.86	2	6.93	2.36	3	7.18	2.70	2	255	345

GPT-4.1 and Quasar Alpha

OpenAI recently released GPT-4.1, alongside "Mini" and "Nano" variants. GPT-4.1 had been teased before with the Quasar Alpha and Optimus Alpha anonymous models on openrouter. I evaluated them at the time, and noted how good they were at translation. Now that it's fully released, and I've evaluated all three versions, I'm not quite sure what the difference between Quasar and Optimus is - they're both very close to the full performance of the full GPT-4.1, and nowhere near Mini and Nano.

Either way, I've collected some data showing that the full GPT-4.1 remains excellent at translation, with the distills being OK but nothing special. Mini is only slightly worse than Sonnet 3.5, which is what Nuenki currently uses to translate sentences as you browse. And of course, there's a lot of variance between different languages; I encourage you to go through and see.

Hybrid Translation and Grok

I've been working on an open-source translation system that combines multiple different models to inform a translation that will hopefully beat any one model. It's still a work in progress - I'll probably release a demo tomorrow. Anyway, I decided to evaluate Grok for that system, though I wouldn't consider using it in the Nuenki extension for privacy reasons.

It's actually rather good at translation, as you can see from the data - I'm pleasantly surprised.

Methodology

I went over it in more detail in the first post, but effectively:

Coherence is calculated by translating from English -> target language -> English 3 times, then having 3 blinded LLMs rate how close the original text is to the new text
Idiomaticity is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how idiomatic the translation is.
Accuracy is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how accurate the translation is.
This is repeated across a sample of ~40 sentences, and the overall score is weighted. I changed the weighting algorithm to use the standard deviation, rather than IQR, and changed the ratios to put greater emphasis on idiomaticity and slightly less on refusal rate.

Future of Nuenki Translation

I intend to integrate GPT-4.1 as a translation source in the next few days. Right now Nuenki uses a mixture of DeepL, Llama 3.3, and Sonnet depending on the circumstances. Sonnet is quite an old model now, so I intend to replace it with GPT-4.1 for some languages. It will still remain as a fallback in case of downtime. It's worth noting that the overall score heavily weights against refusals, and when you look purely at translation performance alone Sonnet is still quite a good model.

GPT 4.1 and Grok are excellent at translation.

GPT-4.1 and Quasar Alpha

Hybrid Translation and Grok

Methodology

Future of Nuenki Translation