Nuenki | Experimentation Matters: Why Nuenki isn't using pairwise evaluations

Experiments with pairwise evaluations

Nuenki's old language translation quality benchmark used a simple system where a suite of LLMs would score the outputs of other LLMs between 1 and 10. There was a little additional complexity - namely, the coherence metric - but that was the gist of it. If you're interested in more details, take a look at the previous blog posts.

I've spent the last week or so working on a new benchmark using pairwise evaluations and a Bradley-Terry model. I documented some of the issues I was having with cost, and implemented the fixes.

I then spent 100 USD attempting to run enough comparisons to get the P-values to reduce and the model to find some actual signal from the noise. It didn't work, and it's not even close. Further improvement follows diminishing returns. The changes I discussed in the last blog post helped, but are nowhere near enough. It would take many hundreds to thousands of dollars to do a single experiment on a single language.

This leaves me with a dilemma: Yes, I narrowly prefer pairwise comparisons from a scientific perspective, but I can't get any useful data if I can't afford to run any experiments.

A compromise system

I built a new system that combines the two. It works, roughly (you can look in the repo if you're curious) by:

Iterating over the ~160 sentences
Translating the sentence with all of the tested models
Iterating over the 6 translation evaluation systems
Telling them to judge each separate translation by various factors, rank them, and score them from 0-100
Combining those scores and doing statistics on them

There are of course various controls (e.g. consolidating multiple translations into one when there are duplicates, randomising the order, and making the test blinded).

This system is a lot cheaper while also getting results with fairly good p-values! It differs from the old system in that it relies on only a single metric, rather than three, and isn't completely blinded - the comparison models see all the different translations when scoring. Here are the results of an early test run on German that cost ~6 USD. You can click on a row to view p-values!

Experimentation Matters: Why Nuenki isn't using pairwise evaluations

Deep Translate

Experiments with pairwise evaluations

A compromise system

P.S. If you'd like to learn a language while you browse the web, give Nuenki a try!