The more LLMs think, the worse they translate

Why test this?

In my testing of LLM translation, I've long observed that LLMs tend to perform worse when they think before translating, rather than simply generating the answer directly. I wanted to know if critiquing after translating, rather than thinking before, would act differently, and compare them with a baseline and the technique used in my open-source hybrid translator.

My hypothesis

I previously had some success with my hybrid translator, which performs better than any single model by producing translations with the top 3-5 models, then having GPT-4.1 critique and combine them into a single, very-high-quality translation. I presumed that the critique stage was a major contributor to that success. I thought that applying the same technique to a single LLM output - perhaps with the added variety of two different LLMs - might produce similar improvements. Perhaps diversity of LLMs - having one model critique another - would cover blind spots inherent in the models, and improve performance?

Unexpected results

Loading...
This is a partial dataset - there is more data below!

The results show quite the opposite. A baseline translation - that is, asking the LLM to translate without thinking first - performed better than all forms of thinking, which were:

  • "Pre-think" - thinking before answering, as is fairly standard nowadays.
  • "Post-think" - taking a baseline translation then having every model critique it and synthesise a new one.
  • "Pre-and-Post-think" - combining the two.

Post-thinking performed worse than pre-thinking, with the combination performing worse still.

Diversity doesn't help

As you can see in the full, unaggregated results at the bottom, the "Post-think" mode was performed against all other models. When you look at that in more detail - though note that the p-values are quite poor here - you find that critiquing with a superior model produces a better output than the original translation, while critiquing with a worse model makes it worse. There's seemingly no advantage to using a different model for the critique, even when the models are close in quality (e.g. Deepseek V3 and 4.1 Mini). The diversity hypothesis has clearly been disproven - fascinating!

Validating the hybrid translator

Finally, I included an Ensemble strategy, where the baseline outputs of all the other models were provided and a synthesised, combined translation produced. This uses the same technique as the hybrid translator, and the results are consistent with broader evaluations: it beats its peers. I also tried disabling Gemma - by far the weakest model - as a contributor to the ensemble. That showed a slight, but statistically insignificant, improvement. The results also showed that the quality of the synthesizing model matters significantly, with substantial performance differences between models. Anyway - to the data!

Skip to Strategies

Skip to Models

Skip to the full dataset

If you find this interesting, you might like to try the tool that all of this research is for - a browser extension that lets you learn a language while you browse the web. There is also a Discord server, if you'd like to discuss this stuff!

Comparing Strategies

Loading...

Comparing Models

This data is consistent with my other benchmarks, which is a nice bit of semi-independent validation!

Loading...

Full Data

Beware the p-values - you can click to see them.

Loading...