Lost in Machine Translation

Thursday, April 17, 2025

Image Credit: https://unsplash.com/photos/a-bunch-of-flags-hanging-from-a-pole-mEctnMdXQ0M

In the race to build large language models (LLMs) that work across dozens of languages, the tech world is charging ahead (but not always with a clear understanding of how well these models actually perform across cultures and tongues). Multilingual AI is no longer a futuristic concept; it’s already in our daily lives—responding to customer questions, summarizing contracts, translating content, and even tutoring students. But as this technology becomes more central to global business operations, one question becomes critical: How do we know these models are doing a good job in every language they claim to support?

This is where the research paper “Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation” steps in. The researchers identify a blind spot in how we currently evaluate these powerful models: most existing tests were originally designed for English. When models are tested in other languages, the process is often messy, inconsistent, and hard to reproduce. And that’s a serious issue (because if we can’t measure multilingual performance accurately, we can’t trust the results we’re getting). Worse, we risk deploying biased or underperforming models in critical real-world settings, from financial services to education to healthcare.

To illustrate the point, think of it like this: If a car company tested its vehicles only on smooth, sunny highways in one country, then rolled them out globally without checking how they perform on snow, dirt roads, or in heavy traffic, we’d call that irresponsible. That’s essentially what’s happening today with multilingual LLMs. Companies are building global products with evaluation methods that often don’t account for the full range of languages or contexts where these models are used.

The researchers propose a solution: borrow the tools and lessons from the field of machine translation (MT). MT is an older discipline that’s spent decades figuring out how to measure the quality of automated language outputs. It has mature, well-vetted benchmarks, metrics, and testing protocols designed specifically to assess language accuracy, fluency, and faithfulness to source meaning (across many languages).

Instead of reinventing the wheel, the paper argues that we should treat multilingual LLM evaluation more like machine translation evaluation. That means applying techniques such as:

Reference-based evaluations, which compare model output against high-quality human translations.
Human evaluation frameworks, where bilingual speakers assess how well models preserve meaning and tone.
Granular analysis, to pinpoint which languages or tasks a model handles well (and where it falls short).

By borrowing these practices, the researchers aim to inject greater rigor and transparency into how we measure multilingual AI. The key isn’t just about being fair to the technology; it’s also about being fair to the people using it, whether that’s a Spanish-speaking client receiving financial advice, a Tagalog-speaking customer navigating tech support, or a French student learning through an AI-powered tutor.

In short, the research reframes multilingual LLM evaluation not as an AI performance issue, but as a business risk, a user trust problem, and a barrier to global product success. And their method—drawing from the time-tested discipline of MT—offers a practical path forward.

To move from theory to practice, the researchers behind “Déjà Vu” conducted a series of controlled experiments (each designed to expose where current evaluation methods fall short and how more structured, translation-inspired approaches can improve the picture). Their goal wasn’t just to critique the status quo, but to also stress-test their proposed fixes in a way that decision-makers can understand and act on.

One of the first experiments zeroed in on the practice of using machine-translated prompts when testing multilingual models. In many current evaluations, instead of crafting native prompts in each language, researchers rely on English prompts that are machine-translated into other languages. On paper, this seems efficient. In practice, it’s problematic. The study showed that these machine-generated prompts introduce subtle biases. A model might perform well not because it understands the target language better, but because the phrasing—coming from an English source—plays to its strengths in ways that wouldn’t naturally occur in real-world usage. This creates the illusion of strong multilingual performance, when the reality may be quite different.

Another experiment explored the way researchers report performance across multiple languages. Most commonly, results are averaged (across all languages and tasks) to produce a single score. But this kind of aggregation masks variation. A model might excel in French and Chinese, for instance, but flounder in Swahili or Vietnamese. Averaging those scores blurs the distinction and overstates the model’s consistency. The implication: decision-makers might deploy a model in markets where it’s simply not ready, unaware of the hidden weaknesses.

Then there’s the issue of statistical rigor. Surprisingly, many evaluations don’t apply even basic statistical significance testing to their results. This means that minor, even random fluctuations in model performance can be interpreted as meaningful differences between competing systems. The researchers demonstrated that without proper significance testing, companies might favor one model over another based on noise, not signal.

To evaluate whether their proposed translation-informed approach offered a more reliable alternative, the researchers applied MT benchmarks (like reference-based scoring and structured human assessments) to the same LLM outputs. These MT methods revealed patterns that were otherwise invisible under traditional LLM evaluation. For example, they could better distinguish when a model was producing fluent but inaccurate content (i.e., “hallucinations”) versus when it truly grasped the source material and translated it faithfully into the target language.

Perhaps more importantly, these translation-based evaluations made it easier to compare models side by side in a way that was fair, repeatable, and less prone to accidental bias. When human judges were brought in to evaluate output quality, they could do so using clearly defined rubrics borrowed from the MT world (something that’s often missing in the more subjective world of LLM evaluation).

In essence, the research measured the success of its proposed solution by how much clarity, consistency, and transparency it brought to multilingual model assessments. By grounding evaluations in established translation frameworks, they didn’t just create a new way of grading AI; they also created a more dependable way of trusting it.

These experiments and evaluation processes underscore a fundamental business insight: If we want to scale AI globally, we need to measure it globally—using tools that have stood the test of time across languages and cultures.

So how do you know if a better evaluation framework is actually better? That question sat at the heart of the research effort. The team behind this study didn’t just argue for using MT-style evaluations; they also put that approach through its paces and scrutinized how much value it added.

The key indicator of success wasn’t simply whether the new method could produce more metrics, but whether it could surface insights that traditional LLM evaluations missed. For example, the translation-inspired evaluations could detect when models generated content that looked fluent but was semantically off-base (something standard metrics often fail to catch). This mattered because many multilingual evaluations today reward fluency over faithfulness. If a model says something confidently in German or Swahili, that doesn’t mean it’s saying the right thing.

Success, then, was redefined: it wasn’t about higher scores but about higher confidence in what those scores meant. The translation-based benchmarks provided more precise and explainable results, especially when identifying underperformance in specific languages or content categories. From a product or deployment perspective, this means decision-makers can stop relying on averaged scores and start understanding model capabilities in sharper detail (language by language, task by task).

That said, the researchers were clear-eyed about the limitations of their approach. MT evaluation frameworks, while robust, weren’t designed for the full spectrum of LLM use cases. An LLM used for reasoning, summarizing, or question answering goes far beyond simple translation tasks. So while these methods offer a strong starting point, they don’t cover everything.

There’s also the challenge of resource intensity. Human evaluations, especially ones involving bilingual experts, are expensive and time-consuming. Not every organization has the budget or infrastructure to run these types of assessments at scale. And the tools for integrating machine translation-style evaluations into LLM development pipelines are still maturing—making automation a hurdle for some.

The researchers point to future directions that could close these gaps. One idea is building community-wide, open multilingual benchmarks that combine translation-grounded metrics with more LLM-specific testing scenarios. These could include tasks like summarization or sentiment analysis, evaluated in a culturally and linguistically nuanced way. They also emphasize the need for reproducible evaluation setups—making it easier for any organization to test models consistently across different locales and use cases.

What’s at stake here isn’t just better benchmarking for AI developers; it’s also a ripple effect across industries that rely on these technologies. For global businesses, education platforms, customer support operations, or wealth management firms expanding into non-English-speaking markets, a more reliable multilingual evaluation system translates directly into less risk, more trust, and greater market readiness. Inaccurate or biased language performance isn’t a technical flaw; it’s a strategic liability.

By anchoring their solution in decades of translation research, the authors of “Déjà Vu” don’t just improve how we measure LLMs; they remind us that innovation doesn’t always require inventing from scratch. Sometimes, it just means being smart about which existing tools we adapt and where we point them.

Their work lays the foundation for a new standard in multilingual AI assessment … one where success isn’t defined by lofty averages, but by how precisely and fairly we measure what matters across languages, markets, and users.

Further Readings