Whatever Quasar Alpha is, it's excellent at translation

All Languages
Chinese
Esperanto
French
German
Hungarian
Italian
Japanese
Korean
Spanish
Swedish
Ukrainian
Vietnamese
ModelOverall ScoreCoherenceIdiomaticityAccuracyLatency (ms)
MeanIQRP90MeanIQRP90MeanIQRP90MedianP90
quasar-alpha9.128.66178.37178.98188321280
gpt-4o-2024-08-068.918.75178.05268.90188031288
deepl8.818.59178.07268.6707226322
claude-3-5-sonnet-202410228.778.53178.38178.910813873112
gemma-3-27b-it8.598.48178.31178.921811151605
llama-3.3-70b-versatile8.548.36177.72268.64173341687
lingvanex8.468.69277.39348.5216206270
gemini-2.0-flash-exp8.268.66178.24169.0018514702
gemma2-9b-it7.917.87267.71258.4427407489
llama-3.1-8b-instant6.126.03626.84337.5924266818
mistral-small-latest5.256.54527.17347.46236031659

Quasar Alpha

A few days ago a mysterious "Quasar Alpha" model appeared on openrouter. It's described as a "cloaked model provided to the community to gather feedback". There's a lot of speculation as to what it is - the million-token context is typical to Gemini models, but there's also some speculation that it could be from OpenAI or Qwen. It'd be interesting to see if the difference in latency from different places could be used to work out where the datacentre is located. Either way, it's quite impressive.

An unusually large improvement

Nuenki needs to do a lot of language translation, quickly (to avoid noticeable latency when browsing), and at a high quality - learning from mistakes can do more harm than good. In previous blog posts I've compared various different models as they came out, generally with small improvements in performance. Quasar Alpha is completely different to those incremental improvements. It's at the top of the leaderboard for practically every metric, particularly idiomaticity, something that small models tend to struggle on. That implies that it's quite a large model, and yet it also has an impressive token speed.

Methodology

I went over it in more detail in the first post, but effectively:

  • Coherence is calculated by translating from English -> target language -> English 3 times, then having an LLM rate how close the original text is to the new text
  • Idiomaticity is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how idiomatic the translation is.
  • Accuracy is calculated by translating from English to the target langauge, then having 3 other (blinded) LLMs rate how accurate the translation is.
  • This is repeated across a sample of ~40 sentences, and the overall score is weighted towards coherence and interquartile range as the most reliable metrics.

Conclusion

Quasar Alpha's performance is incredibly impressive. If its pricing is comparable to Sonnet, and it's either open-source or served by a reputable western company, it'll probably replace Sonnet as Nuenki's tertiary translation method.

The latency and latency distribution is, interestingly, quite close to that of gpt-4o's. The benchmark directly calls OpenAI's servers, while Quasar goes through Openrouter, so it's probably a coincidence. Either way, I'm looking forward to its public release.