Gemma 3's reasoning improvements don't translate to translations

All Languages
Chinese
Esperanto
French
German
Hungarian
Italian
Japanese
Spanish
Swedish
Ukrainian
ModelOverall ScoreCoherenceIdiomaticityAccuracy
MeanIQRP90MeanIQRP90MeanIQRP90
gpt-4o-2024-08-069.088.80188.09268.9618
claude-3-5-sonnet-202410228.848.55178.40178.9418
llama-3.3-70b-versatile8.598.46177.68268.6717
gemma-3-27b-it8.508.49178.25168.8318
gemini-2.0-flash-exp8.228.67178.11268.9128
gemma2-9b-it7.807.96267.58248.3926
llama-3.1-8b-instant6.296.22626.90447.6624
mistral-small-latest5.076.60527.05347.3433

Nuenki needs to do a lot of translation, quickly (to avoid noticeable latency when browsing), and at a high quality - learning from mistakes can do more harm than good. In a previous blog post I investigated which LLM models were best at translation, and found that Claude was consistently pretty good, with Llama 3.3-70b being good enough or better at some languages.

Gemma 3's release

Gemma2-9b was a pretty interesting model when I last evaluated it. It was pretty poor at most languages, but surprisingly good at some, e.g. German, particularly for its model size. Gemma 3's benchmark scores seem impressive, but with the recent advances in applying RL to models, benchmarking ability is becoming increasingly detached with linguistic - particularly multilinguistic - skill. This is particularly true for small models. It's worth noting that the last benchmark used Groq for Gemma inference, while I used the official Google API this time.

Bar chart of gemma 3's Chatbot Arena Elo Score, showing it slightly behind R1

Consistent but unremarkable progress

Gemma 3-27b performed better than gemma2-9b. However, the average gap wasn't as large as I expected, particularly considering it's also a much larger model. To its credit, though, it did beat gemini-2.0-flash-experimental. There was substantial variability between languages - while its improvements in French were pretty small and it was marginally worse than gemma2-9b at German, it had larger improvements in Hungarian and was superb at Italian and Spanish.

27b's tiny regression at German is particularly interesting. Gemma 9b was always particularly good at German, not just in this quantitative benchmark but in qualitative feedback from an Austrian friend I'd asked about it. The regression is not statistically significant with this sample size, but the lack of improvement is.

LLM progress has been moving away from ever-more, ever-optimised pretraining towards reinforcement learning and posttraining. While this is good for most tasks, it doesn't seem to generalise to the core lexical abilities of the models, which have been improving at a relatively slower pace. Nevertheless, its substantial improvements in select languages mean that I'll look to integrate it into Nuenki once it's added to Groq.

Future changes and methodology improvements

This quick test has a number of issues - I'm not controlling for the inference provider or the model size. This doesn't compromise the core point: Grok optimises their models for speed vs Google optimising for quality, and we're using a larger model, so it only reinforces that the performance gap isn't as large as I'd hoped for. However, it isn't particularly scientific, and that's something I'd like to change by creating a proper translation benchmark with larger samples sizes, better (crowdsourced?) evaluation, and a more broadly scientific and reproducible approach. If you're interested in more information about the methodology I used, my previous blog post goes into a lot more detail.

Conclusion

While Gemma 3's translation performance is a little disappointing compared to its benchmarks, it has made progress in some areas and remains more useful than previous versions. Also, while we're here, if you're interested in language learning and spend a lot of time on your computer, try out Nuenki! It turns everyday browsing - reading blog posts, scrolling through HN, Reddit, or Twitter, etc - into constant, consistent language immersion.