Nuenki | Gemma 3 is good at translation for its size

All Languages

Chinese

Esperanto

French

German

Hungarian

Italian

Japanese

Korean

Spanish

Swedish

Ukrainian

Vietnamese

Model	Overall Score	Coherence			Idiomaticity			Accuracy
Model	Overall Score	Mean	IQR	P90	Mean	IQR	P90	Mean	IQR	P90
gpt-4o-2024-08-06	8.92	8.75	1	7	8.07	2	6	8.92	1	8
claude-3-5-sonnet-20241022	8.77	8.53	1	7	8.42	1	7	8.90	1	8
gemma-3-27b-it	8.57	8.48	1	7	8.30	1	7	8.85	1	8
llama-3.3-70b-versatile	8.51	8.36	1	7	7.63	2	6	8.59	1	7
gemini-2.0-flash-exp	8.21	8.66	1	7	8.12	2	6	8.90	2	8
gemma2-9b-it	7.82	7.87	2	6	7.65	2	5	8.36	1	6
llama-3.1-8b-instant	6.18	6.03	6	2	6.88	4	4	7.58	3	4
mistral-small-latest	5.25	6.54	5	2	7.15	3	4	7.45	3	3

Nuenki needs to do a lot of translation, quickly (to avoid noticeable latency when browsing), and at a high quality - learning from mistakes can do more harm than good. In a previous blog post I investigated which LLM models were best at translation, and found that Claude was consistently pretty good, with Llama 3.3-70b being good enough or better at some languages.

Gemma 3's release

Gemma2-9b was a pretty interesting model when I last evaluated it. It was pretty poor at most languages, but surprisingly good at some, e.g. German, particularly for its model size. Gemma 3's benchmark scores seem impressive, but with the recent advances in applying RL to models, benchmarking ability is becoming increasingly detached with linguistic - particularly multilinguistic - skill. This is particularly true for small models. It's worth noting that the last benchmark used Groq for Gemma inference, while I used the official Google API this time.

Bar chart of gemma 3's Chatbot Arena Elo Score, showing it slightly behind R1

Consistent but unremarkable progress

Gemma 3-27b performed better than gemma2-9b. However, the average gap wasn't as large as I expected, particularly considering it's also a much larger model. To its credit, though, it did beat gemini-2.0-flash-experimental. There was substantial variability between languages - while its improvements in French were pretty small and it was marginally worse than gemma2-9b at German, it had larger improvements in Hungarian and was superb at Italian and Spanish.

27b's tiny regression at German is particularly interesting. Gemma 9b was always particularly good at German, not just in this quantitative benchmark but in qualitative feedback from an Austrian friend I'd asked about it. The regression is not statistically significant with this sample size, but the lack of improvement is.

LLM progress has been moving away from ever-more, ever-optimised pretraining towards reinforcement learning and posttraining. While this is good for most tasks, it doesn't seem to generalise to the core lexical abilities of the models, which have been improving at a relatively slower pace. Nevertheless, its substantial improvements in select languages mean that I'll look to integrate it into Nuenki once it's added to Groq.

Future changes and methodology improvements

This quick test has a number of issues - I'm not controlling for the inference provider or the model size. This doesn't compromise the core point: Grok optimises their models for speed vs Google optimising for quality, and we're using a larger model, so it only reinforces that the performance gap isn't as large as I'd hoped for. However, it isn't particularly scientific, and that's something I'd like to change by creating a proper translation benchmark with larger samples sizes, better (crowdsourced?) evaluation, and a more broadly scientific and reproducible approach. If you're interested in more information about the methodology I used, my previous blog post goes into a lot more detail.

Conclusion

While Gemma 3's translation performance is a little disappointing compared to its benchmarks, it has made progress in some areas and remains more useful than previous versions. Also, while we're here, if you're interested in language learning and spend a lot of time on your computer, try out Nuenki! It turns everyday browsing - reading blog posts, scrolling through HN, Reddit, or Twitter, etc - into constant, consistent language immersion.

You might also like the Deep Translate system, which was created a few months after I first wrote this blog post, and uses everything I've learnt about LLM translation to produce world-leading translations.

Deep Translate

Translation that doesn't sound like it. Built on our research.

Try $10 free

Gemma 3 is good at translation for its size

Click here for newer data.

Gemma 3's release

Consistent but unremarkable progress

Future changes and methodology improvements

Conclusion

Deep Translate