Why test this?
In my testing of LLM translation, I've long observed that LLMs tend to perform worse when they think before translating, rather than simply generating the answer directly. I wanted to know if critiquing after translating, rather than thinking before, would act differently, and compare them with a baseline and the technique used in my open-source hybrid translator.
My hypothesis
I previously had some success with my hybrid translator, which performs better than any single model by producing translations with the top 3-5 models, then having GPT-4.1 critique and combine them into a single, very-high-quality translation. I presumed that the critique stage was a major contributor to that success. I thought that applying the same technique to a single LLM output - perhaps with the added variety of two different LLMs - might produce similar improvements. Perhaps diversity of LLMs - having one model critique another - would cover blind spots inherent in the models, and improve performance?
Unexpected results
The results show quite the opposite. A baseline translation - that is, asking the LLM to translate without thinking first - performed better than all forms of thinking, which were:
- "Pre-think" - thinking before answering, as is fairly standard nowadays.
- "Post-think" - taking a baseline translation then having every model critique it and synthesise a new one.
- "Pre-and-Post-think" - combining the two.
Post-thinking performed worse than pre-thinking, with the combination performing worse still.
Diversity doesn't help
As you can see in the full, unaggregated results at the bottom, the "Post-think" mode was performed against all other models. When you look at that in more detail - though note that the p-values are quite poor here - you find that critiquing with a superior model produces a better output than the original translation, while critiquing with a worse model makes it worse. There's seemingly no advantage to using a different model for the critique, even when the models are close in quality (e.g. Deepseek V3 and 4.1 Mini). The diversity hypothesis has clearly been disproven - fascinating!
Validating the hybrid translator
Finally, I included an Ensemble strategy, where the baseline outputs of all the other models were provided and a synthesised, combined translation produced. This uses the same technique as the hybrid translator, and the results are consistent with broader evaluations: it beats its peers. I also tried disabling Gemma - by far the weakest model - as a contributor to the ensemble. That showed a slight, but statistically insignificant, improvement. The results also showed that the quality of the synthesizing model matters significantly, with substantial performance differences between models. Anyway - to the data!
If you find this interesting, you might like to try the tool that all of this research is for - a browser extension that lets you learn a language while you browse the web. There is also a Discord server, if you'd like to discuss this stuff!
Comparing Strategies
Comparing Models
This data is consistent with my other benchmarks, which is a nice bit of semi-independent validation!
Full Data
Beware the p-values - you can click to see them.