Llama 4 performs worse than Llama 3 at translation

All Languages
Chinese
Esperanto
French
German
Hungarian
Italian
Japanese
Korean
Spanish
Swedish
Ukrainian
Vietnamese
ModelOverall ScoreCoherenceIdiomaticityAccuracyLatency (ms)
MeanIQRP90MeanIQRP90MeanIQRP90MedianP90
quasar-alpha9.128.66178.37178.98188321280
gpt-4o-2024-08-068.918.75178.05268.90188031288
deepl8.818.59178.07268.6707226322
claude-3-5-sonnet-202410228.778.53178.38178.910813873112
gemma-3-27b-it8.598.48178.31178.921811151605
llama-3.3-70b-versatile8.548.36177.72268.64173341687
lingvanex8.468.69277.39348.5216206270
LLama 4 Scout8.358.58177.79268.7517246331
gemini-2.0-flash-exp8.268.66178.24169.0018514702
gemma2-9b-it7.917.87267.71258.4427407489
qwen-2.5-32b6.748.23167.00348.0225326472
llama-3.1-8b-instant6.126.03626.84337.5924266818
mistral-small-latest5.256.54527.17347.46236031659
mistral-saba-24b4.925.50526.93337.1832255345

LLama 4

Last night, Meta released the Llama 4 series of models. Unlike previous Llama releases, Llama 4 uses a mixture of experts architecture. Tha largest model, Behemoth, uses 288B active parameters out of 2T total parameters, while Maverick uses 17B out of 400B, and Scout uses 17B out of 109B.

Mixture of experts could, in theory, be quite good for my use case of low-latency translation due to its improved inference speed. Unfortunately, in my testing its high inference speed is overshadowed by its atrocious translation performance. The Scout model is worse than Llama-3.3-70B, despite having substantially more total parameters. Perhaps the mixture of experts architecture doesn't work well with language translation, but even so, it's worse than Gemma-3-27b, which has a third of the total parameters!

Not quite as bad at low-resource languages and coherence

Meta advertised multilingual support over 200 languages, which implies that it might be using the dataset used for Facebook's nllb-200 low-resource translation models. It's not quite as bad as its peers at Esperanto - notably falling short on idiomaticity, but not coherence, perhaps as a result of more training data but a smaller parameter count. This is similar to what I observed, and speculated upon, with Lingvanex and DeepL. It also beats Llama-3.3-70b at Hungarian, while being even worse relative to its peers at relatively high-resource languages like German.

Methodology

I went over it in more detail in the first post, but effectively:

  • Coherence is calculated by translating from English -> target language -> English 3 times, then having 3 blinded LLMs rate how close the original text is to the new text
  • Idiomaticity is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how idiomatic the translation is.
  • Accuracy is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how accurate the translation is.
  • This is repeated across a sample of ~40 sentences, and the overall score is weighted towards coherence and interquartile range as the most reliable metrics.
  • Inference is done using Groq for open models, the Anthropic API, Openrouter (for Quasar Alpha), the DeepL API, the Lingvanex API, and the OpenAI API.

Conclusion

The Llama 4 release has been disappointing in many respects - they are massive models that underperform smaller ones across multiple dimensions - and language translation appears to be a particularly poor spot. I'm unlikely to use it in Nuenki, as it seems to be a downgrade from LLama 3.3 in every respect and Quasar Alpha appears to beat it in every way, though we don't yet know how large it is or whether it can be hosted on Groq. While its translation speed via Groq is pretty good, translation quality is really important when you're using them to learn and immerse yourself, and there are simply better alternatives.