On the Evaluation of Multilingual Multimodal Models.

“The Pitch Files” is a curated series of proposal excerpts, thoughts, and value offerings we’ve crafted for potential clients across various industries. These insights offer a glimpse into how we customise our strategies to address the unique challenges of each sector, along with quick drafts of recommendations for the most effective expansion approaches tailored to their needs—just a taste of the full strategic depth we offer.

Some claims are safe because they are vague. Others are safe because nobody checks. When AI labs say their model is “aligned across modalities and languages,” they often benefit from both kinds of safety.

The field of multilingual multimodal AI is, at present, full of claims that sound good and mean very little. A model “performs well”* on a dataset that was never designed to measure what they claim it does. It is “aligned” in the sense that it gives non-horrifying answers on a few prompts in English and looks fine on a few others in Chinese or Arabic. And even when it is horrifying, the problem is assumed to be “fine-tunable.”

There’s a subtle problem here. But first, a concrete one.

The Reference Problem (but with Models)

When I say “the model answered safely,” what is “safely” referring to?

  • The factual correctness of the content?
  • The lack of hate speech?
  • The preservation of nuance between visual and textual modes?
  • The cultural appropriateness of the response in a given language?

Often, eval reports imply all of these. But the benchmark being cited only captures one. Or none.

This is a reference problem — like “she” in a sentence with two women. But at the level of an entire research claim. The ambiguity is not just annoying, it’s dangerous. Because if a word can mean many things, someone will choose the meaning most convenient to their belief (or product roadmap).

Alignment Is a Specific Word

If alignment is to mean anything in this space, it must mean something like:

The model’s outputs across all relevant modalities and languages do not cause harm, misunderstanding, or misalignment with intended user goals, as judged within the cultural, contextual, and ethical frame in which the task occurs.

This is not easily benchmarked. That doesn’t make it irrelevant but makes it important.

A model giving a correct medical answer in English, while suggesting turmeric cures cancer in Hindi, is not “aligned” in an averaged sense. It is misaligned in a distributed sense: misaligned with reality somewhere, in a way that matters.

Benchmark Inflation

There is a pattern we have started noticing. AI companies treat evaluation the way corporations treat ESG: as an obligation to be fulfilled, not a signal to be updated on.

You run a model on a benchmark. You get a number. If the number is high enough, you include it in a footnote. If the number is low, you either:

  • a) Don’t report it
  • b) Argue the benchmark is flawed
  • c) Claim future versions will fix it

This is not reasoning. This is PR.

And yet: we all tend to do it. Especially when we’re tired, or under pressure, or excited about what we’ve built.

Chesterton’s Benchmarks

Benchmarks are not sacred. But neither are they optional. They are like Chesterton’s fences: before you tear one down (or dismiss one as “not really relevant for multilingual multimodal tasks”), you should be able to say why it was there.

Some benchmarks are flawed. But others are flawed because they’re honest. They reveal performance gaps we’d rather not see. These are the ones we should keep.

Ideas That Might Help

Some desiderata, if we are to take alignment evaluation seriously in this space:

  • Intent-sensitive evaluation: Test not just whether a model gives a safe answer, but whether it correctly understood the goal in context.
  • Cross-cultural adversarial prompting: Use localized prompts that carry specific cultural risks. “What should I wear to a funeral?” is not a culturally neutral question.
  • Multimodal-translation stress tests: See what happens when text and image carry conflicting signals, especially across languages with different idioms, norms, or taboos.
  • Error taxonomies: Not just whether the model is wrong, but how: is it subtly biased? Hallucinating a cultural belief? Misinterpreting tone?

How We Benchmark Multilingual Multimodal Models at ORB

We have designed and maintain a rigorous benchmark framework to evaluate the alignment, safety, and real-world applicability of multilingual, multimodal AI systems. Our goal is not just to measure what the model can do, but to uncover where it fails, why, and how to fix it.

1. Alignment Objectives First

We begin by defining what it means for a model to be aligned across multilingual and multimodal contexts:

  • Ethical consistency across languages and cultures
  • Norm-sensitive behavior in culturally specific settings
  • Accurate cross-modal reasoning (e.g., matching text to images)
  • Preservation of intent and tone across translation

We treat alignment as a behavioral constraint, not just a capability metric.

2. A Diverse and Targeted Task Suite

We build tasks that simulate the real-world environments our models will face. These include:

  • Translation and Transcreation: Evaluates fidelity, tone, register, and cultural nuance preservation.
  • Image and Text Understanding: Tests visual grounding, cultural symbolism, and ambiguity handling.
  • Norm Sensitivity Tasks: Culture-specific etiquette, legal norms, and taboo recognition.
  • Ambiguity and Reference Resolution: For example, pronoun clarity in Turkish, sarcasm interpretation in Japanese.
  • Religious and Cultural Safety: Risk detection in sensitive contexts, such as religious symbols or historic trauma.

Each task probes high-risk areas where misalignment tends to emerge.

3. Evaluation Dimensions

Each task is evaluated across multiple axes:

  • Accuracy: Is the answer or generation correct for the context?
  • Safety: Is the response norm-compliant and non-harmful?
  • Fidelity: Does output reflect the original intent (especially in translation)?
  • Grounding: Is multimodal input and output coherent and interpretable?
  • Calibration: Is the model’s confidence appropriate to its performance?

4. Hybrid Evaluation: Human and Automated

We use a combination of:

  • Automated metrics (BLEU, CIDEr, Exact Match, etc.)
  • Human ratings from native speakers and cultural reviewers
  • Error labeling for traceable failure types (e.g., tone loss, visual misalignment)

All human evaluations include inter-annotator agreement checks and anonymized prompts.

5. Iterative and Transparent Benchmarking

We version and document our benchmarks like software. Every release includes:

  • Changelog of task updates or removals
  • Breakdown of model performance by language, modality, and task
  • Representative examples of both success and failure
  • Public release of methodology and annotation instructions

We aim to make benchmark failures actionable, diagnosable, and auditable.

Why This Matters

Most evaluations in the industry focus on what models get right. Ours focus on what they get wrong, and why. That is how we build safer, more culturally aware, and more trustworthy AI.

Models don’t align themselves. Benchmarks don’t speak for themselves. And we don’t get alignment for free just because we didn’t notice the problem in another language.

Quentin Lucantis @orb