Benchmarking LLMs is Rather Difficult, Actually - Designing Better Experiments

Some Context

I want to know which LLMs are best at language translation, so that I can use the best models for each language that Nuenki supports. When people are learning from translations, it's really important to make sure they're accurate and idiomatic.

I initially built a fairly naive benchmark that translated various benchmark sentences, then had four benchmark LLMs rate their performance from 1 to 10 on three different metrics. It's a bit more complicated than that, because I designed the coherence metric to be more resistant to LLM bias, but that was the crux of it.

The benchmark was "good enough" to inform Nuenki, but as more people took an interest in it and it ran into oversaturation issues, I wanted to build a better benchmark. Someone I'd spoken to who was doing similar work had mentioned Bradley-Terry models, which take in a set of pairwise comparisons ("Is A or B better in this instance?") and turn them into scores. I hoped that the switch would make the benchmark more rigorous, and better at discerning between models that were already very good.

I also wanted to take the opportunity to look into LLM bias when judging, and to screen it out via consensus mechanisms.

I implemented that, and it's open source. Here are the initial results for German:

RankModel nameTempScore (95% CI)
1GPT 4.10.01.93 [0.92 - 4.06]
2Claude Sonnet 3.5 2024-10-220.01.79 [0.90 - 3.58]
3GPT 4.1 Mini0.01.79 [1.03 - 3.12]
4Gemma 3 27b0.01.47 [0.82 - 2.63]
5Nuenki HybridN/A1.44 [0.64 - 3.23]
6GPT 4o0.01.25 [0.62 - 2.51]
7Gemini 2.5 Flash Preview 04-170.01.24 [0.69 - 2.24]
8GPT 4.1 Nano0.01.24 [0.64 - 2.40]
9LLama 4 Maverick0.01.17 [0.54 - 2.54]
10Grok 3 Beta0.01.13 [0.61 - 2.10]
11Claude Sonnet 3.7 2025-02-190.01.12 [0.62 - 2.02]
12Qwen 3 32B0.00.93 [0.50 - 1.74]
13Mistral Small Latest0.00.88 [0.42 - 1.82]
14Qwen 3 235B A22B0.00.85 [0.47 - 1.54]
15Qwen 3 14B0.00.81 [0.39 - 1.68]
16LLama 4 Scout0.00.81 [0.45 - 1.44]
17Llama 3.3 70b0.00.67 [0.31 - 1.44]
18Qwen 3 30B A3B0.00.58 [0.39 - 0.87]
19DeepLN/A0.48 [0.27 - 0.87]
20LingvanexN/A0.31 [0.15 - 0.64]

Weird results: DeepL, Lingvanex, and Nuenki Hybrid

This dataset is a bit weird in general, and a lot of that can be put down to the poor p-values and large amount of noise. Nevertheless, there is a clear trend that stands out, even when you look at the error boundaries. Nuenki Hybrid, DeepL, and Lingvanex are all much lower than they were in the original research, as well as my general knowledge of their performance.

I think I know why. These three models are the only ones were translated via their own APIs, without the evaluation benchmark setting the prompt. And that prompt happened to introduce a bias.

While working on the evaluator I noticed that the models were disagreeing about what the "best choice" was, and it was leading to a lot of noise and uncertainty in the data. For example, models would disagree over whether it was best to keep an English acronym, or translate it into German. My first mistake was to attempt to *standardise* on these disagreements by stating a style in the translation and evaluation prompts. Specifically, I included:

1Translate {target_spec} into the idiomatic, native, absolutely correct {to_lang} you'd find in a high-quality language-learning textbook. Avoid English loanwords where direct equivalents exist.

This is something that I should have foreseen earlier.

In the future, I intend to tell the evaluation prompts to ignore these kind of disagreements, and I'll make the prompt less opinionated. Evaluation focus on *mistakes* over *stylistic differences*, and if neither makes mistakes it should reply with "Identical". The whole process of seeing weird results and tweaking the experiment to "fix" them is something that makes me quite uncomfortable, but I think it's appropriate in this case.

Dealing with cost and uncertainty

It turns out that using a corpus of four quite expensive models (Sonnet 3.7; Gemini Flash Thinking; GPT-4.1; Grok 3) to perform hundreds of evaluations, many of which have no clear signal, between models that are largely quite similar tends to produce a pretty awful cost-to-information ratio. While the old benchmark cost 20 USD to produce some pretty useful data over a dozen languages and ~20 models, this one produces noisy, practically useless data for one language and fewer models while costing a little more.

Many Metrics?

The old benchmark used three metrics rather than one, and I think it might be worth going back to that model. A clear delineation between accuracy and idiomaticity is useful, but coherence is the really useful distinction.

The old coherence metric worked by translating from English to the target language and back three times, then asking blinded models to rate how close it was in meaning to the original text. In doing so, it reduced the impact of model lineage bias and produced a less subjective measure of quality. Comparing two English texts is a much simpler test than trying to have models - themselves flawed - critique a translation.

Model Lineage Bias

One of the features of the new program is that it produces various different leaderboards based on whether or not different judge models were included.

Judge models for filtering could not be determined from the dataset.


What does this tell us? Well, we could choose to read into the data. For example, Sonnet 3.5 does a lot worse when Sonnet 3.7 isn't scoring, and 2.5 Flash Preview seems to actually slightly dislike its own outputs. Honestly, though, this data is so noisy that it's not worth drawing any conclusions from.

Iterative Matchmaking

A friend of mine is a game developer, and he suggested borrowing the skill based matchmaking algorithms used in gamedev. The idea would be to model each translation method as a game developer would model a player through an algorithm like Elo or Trueskill, then selectively compare the two most uncertain players until the algorithm reached a minimum confidence in its ratings. The current system actually produces p-values already, so we needn't switch to an entirely new algorithm - just pick the pair with the lowest p-value, compare them, and keep doing that until everything is p>0.95.

It raises concerns about how scientific it is to run a dynamic algorithm, rather than simply calculating a fixed batch of comparisons then fitting a model to them. Nevertheless, it may well produce much better results, and I can't think of any concrete statistical flaws with it provided we're careful about ensuring a minimum amount of linkage.

So, what's next?

I'm going to implement the lessons in an updated benchmarking system. The new system will:

  • Conduct fixed comparisons, then start iteratively fitting the model and comparing the two worst p-values until confidence is achieved.
  • Split the score into three axis - coherence, accuracy, and idiomaticity. They will be compared simultaneously, and the decision to compare two models will be based on the worst of the three.
  • Make the translation and comparison prompts more neutral, telling it to ignore many subjective measures and focus solely on actual errors.
  • Disable thinking for Gemini Flash 2.5 - the prompt makes them think aloud anyway, and it's a good model on its own.