Nuenki | LLama 4 is worse at translation than Llama 3

All Languages

Chinese

Esperanto

French

German

Hungarian

Italian

Japanese

Korean

Spanish

Swedish

Ukrainian

Vietnamese

Model	Overall Score	Coherence			Idiomaticity			Accuracy			Latency (ms)
Model	Overall Score	Mean	IQR	P90	Mean	IQR	P90	Mean	IQR	P90	Median	P90
quasar-alpha	9.12	8.66	1	7	8.37	1	7	8.98	1	8	832	1280
gpt-4o-2024-08-06	8.91	8.75	1	7	8.05	2	6	8.90	1	8	803	1288
deepl	8.81	8.59	1	7	8.07	2	6	8.67	0	7	226	322
claude-3-5-sonnet-20241022	8.77	8.53	1	7	8.38	1	7	8.91	0	8	1387	3112
gemma-3-27b-it	8.59	8.48	1	7	8.31	1	7	8.92	1	8	1115	1605
llama-3.3-70b-versatile	8.54	8.36	1	7	7.72	2	6	8.64	1	7	334	1687
lingvanex	8.46	8.69	2	7	7.39	3	4	8.52	1	6	206	270
LLama 4 Scout	8.35	8.58	1	7	7.79	2	6	8.75	1	7	246	331
gemini-2.0-flash-exp	8.26	8.66	1	7	8.24	1	6	9.00	1	8	514	702
gemma2-9b-it	7.91	7.87	2	6	7.71	2	5	8.44	2	7	407	489
qwen-2.5-32b	6.74	8.23	1	6	7.00	3	4	8.02	2	5	326	472
llama-3.1-8b-instant	6.12	6.03	6	2	6.84	3	3	7.59	2	4	266	818
mistral-small-latest	5.25	6.54	5	2	7.17	3	4	7.46	2	3	603	1659
mistral-saba-24b	4.92	5.50	5	2	6.93	3	3	7.18	3	2	255	345

LLama 4

Last night, Meta released the Llama 4 series of models. Unlike previous Llama releases, Llama 4 uses a mixture of experts architecture. Tha largest model, Behemoth, uses 288B active parameters out of 2T total parameters, while Maverick uses 17B out of 400B, and Scout uses 17B out of 109B.

Mixture of experts could, in theory, be quite good for my use case of low-latency translation due to its improved inference speed. Unfortunately, in my testing its high inference speed is overshadowed by its atrocious translation performance. The Scout model is worse than Llama-3.3-70B, despite having substantially more total parameters. Perhaps the mixture of experts architecture doesn't work well with language translation, but even so, it's worse than Gemma-3-27b, which has a third of the total parameters!

Not quite as bad at low-resource languages and coherence

Meta advertised multilingual support over 200 languages, which implies that it might be using the dataset used for Facebook's nllb-200 low-resource translation models. It's not quite as bad as its peers at Esperanto - notably falling short on idiomaticity, but not coherence, perhaps as a result of more training data but a smaller parameter count. This is similar to what I observed, and speculated upon, with Lingvanex and DeepL. It also beats Llama-3.3-70b at Hungarian, while being even worse relative to its peers at relatively high-resource languages like German.

Methodology

I went over it in more detail in the first post, but effectively:

Coherence is calculated by translating from English -> target language -> English 3 times, then having 3 blinded LLMs rate how close the original text is to the new text
Idiomaticity is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how idiomatic the translation is.
Accuracy is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how accurate the translation is.
This is repeated across a sample of ~40 sentences, and the overall score is weighted towards coherence and interquartile range as the most reliable metrics.
Inference is done using Groq for open models, the Anthropic API, Openrouter (for Quasar Alpha), the DeepL API, the Lingvanex API, and the OpenAI API.

Conclusion

The Llama 4 release has been disappointing in many respects - they are massive models that underperform smaller ones across multiple dimensions - and language translation appears to be a particularly poor spot. I'm unlikely to use it in Nuenki, as it seems to be a downgrade from LLama 3.3 in every respect and Quasar Alpha appears to beat it in every way, though we don't yet know how large it is or whether it can be hosted on Groq. While its translation speed via Groq is pretty good, translation quality is really important when you're using them to learn and immerse yourself, and there are simply better alternatives.

Llama 4 performs worse than Llama 3 at translation

LLama 4

Not quite as bad at low-resource languages and coherence

Methodology

Conclusion