Model | Overall Score | Coherence | Idiomaticity | Accuracy | Latency (ms) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | IQR | P90 | Mean | IQR | P90 | Mean | IQR | P90 | Median | P90 | ||
gpt-4o-2024-08-06 | 8.92 | 8.75 | 1 | 7 | 8.07 | 2 | 6 | 8.90 | 1 | 8 | 803 | 1288 |
deepl | 8.81 | 8.59 | 1 | 7 | 8.09 | 2 | 6 | 8.67 | 1 | 7 | 226 | 322 |
claude-3-5-sonnet-20241022 | 8.77 | 8.53 | 1 | 7 | 8.39 | 1 | 7 | 8.92 | 1 | 8 | 1387 | 3112 |
gemma-3-27b-it | 8.59 | 8.48 | 1 | 7 | 8.31 | 1 | 7 | 8.92 | 1 | 8 | 1115 | 1605 |
llama-3.3-70b-versatile | 8.54 | 8.36 | 1 | 7 | 7.72 | 2 | 6 | 8.62 | 1 | 7 | 334 | 1687 |
lingvanex | 8.48 | 8.69 | 2 | 7 | 7.42 | 3 | 4 | 8.54 | 1 | 6 | 206 | 270 |
gemini-2.0-flash-exp | 8.26 | 8.66 | 1 | 7 | 8.24 | 1 | 6 | 9.00 | 1 | 8 | 514 | 702 |
gemma2-9b-it | 7.90 | 7.87 | 2 | 6 | 7.70 | 2 | 5 | 8.43 | 2 | 7 | 407 | 489 |
llama-3.1-8b-instant | 6.18 | 6.03 | 6 | 2 | 6.89 | 3 | 4 | 7.59 | 2 | 4 | 266 | 818 |
mistral-small-latest | 5.25 | 6.54 | 5 | 2 | 7.18 | 3 | 4 | 7.45 | 2 | 3 | 603 | 1659 |
Nuenki needs to do a lot of translation, quickly (to avoid noticeable latency when browsing), and at a high quality - learning from mistakes can do more harm than good. In previous blog posts (1, 2) I compared general LLMs, but I'd recently heard about Lingvanex, a specialised translation service, and wanted to give it a try. I also added DeepL for context while I was there.
Lingvanex
Lingvanex's website promises "advanced natural processing solutions", or in other words, on-premise unlimited-usage ML translation. They also offer an API with reasonable pricing.
Excellent latency
While the latency numbers should be taken with a grain of salt as they're only taken from a single geographic location (in the UK), Lingvanex clearly has comparable latency to DeepL. It's also very consistent, with a low P90 (worst 10%).
Coherent, but not as idiomatic
Across all languages, Lingvanex has middling performance. However, it gets more interesting as you go through individual languages. For some, like French, it beats everything but DeepL. Yet across every language there is a consistent trend of high coherence and low idiomaticity. I think that this is a result of Lingvanex (and DeepL, which has the same trend at a low amplitude) using small models that are finetuned to be good at translation. Larger models may be slower, costlier, and unnecessarily generalised, but they're able to translate idiomatically rather than literally.
Conclusion
Lingvanex is OK, and its low latency is impressive, but its idiomaticity is too low for me to replace DeepL with it yet.
You might also be interested in my testing of Quasar Alpha, a mysterious new model on openrouter. It performs impressively well.