orb - Why AI Still Doesn't Get Idioms.

“The Pitch Files” is a curated series of proposal excerpts, thoughts, and value offerings we’ve crafted for potential clients across various industries. These insights offer a glimpse into how we customise our strategies to address the unique challenges of each sector, along with quick drafts of recommendations for the most effective expansion approaches tailored to their needs—just a taste of the full strategic depth we offer.

A few weeks ago, Appen released a study showing that large language models struggle to translate idioms and culturally loaded marketing content.

The setup was simple: they gave three LLMs a set of light, playful marketing emails in English, things like “Will you brie mine?" and asked them to translate them into twenty languages.

None of the results were good enough to publish.
Not one.

Grammatically correct? Sure.
But stripped of tone, humour, and charm: all the little cultural signals that make marketing work.
The AI understood the words, but not the world around them.

And that got us thinking: why does this keep happening?

The illusion of understanding

If you look at it mechanically, LLMs are brilliant statistical mirrors.
They can finish your sentences, mimic your style, and even produce passable jokes, as long as those jokes have already been told before.

But idioms are different.
Idioms are tiny cultural programs. They only make sense inside a shared mental environment.
When I say, “kick the bucket,” I’m not talking about buckets or feet; I’m relying on your prior knowledge of an entire linguistic culture where that phrase means “to die.”

A model trained on internet-scale data may have seen the phrase a million times, but it doesn’t participate in the culture that gave rise to it.
It knows the pattern, but not the point.

Culture as a hidden variable

We talk a lot about “multilingual AI,” but language isn’t the hard part.
Culture is.

Language is what you say.
Culture is what you assume while saying it.

If two people share enough assumptions, communication feels smooth.
If they don’t, it breaks, with a smile that means something different to each side.

That’s what happens in machine translation. The model maps words across languages while ignoring the invisible priors underneath.

The Appen results and their implications

Appen’s findings had an interesting twist: high-resource languages didn’t necessarily perform better than low-resource ones.
Japanese did well; Mandarin didn’t.
Linguistic proximity to English wasn’t predictive.

That’s weird if you think this is a data problem, but obvious if you think it’s a culture problem.

Our work at Orb

At Orb, we treat translation errors as cultural inference failures. When an idiom collapses in translation, it’s not because the model forgot a dictionary entry but rather because it doesn’t know what that idiom is for.

So instead of trying to “fix translation,” we’re trying to teach models to reason about culture.

Here’s how we’re approaching it:

A database of functions

We’re building a dataset of idioms and expressions tagged by function: what they do socially over what they say literally.
For example:

“Break the ice” → “Start a conversation in a tense or formal situation.”

Then, for each culture, we find its local analog: “détendre l’atmosphère” in French, or “打破僵局” in Chinese.
This lets the model reason in terms of purpose.

Prompting for cultural reasoning

Instead of saying, “Translate this,” we ask:

“Explain what this expression means in context. Then rewrite it so it feels natural to a native reader.”

This forces the model into a small act of introspection.
It has to unpack intention before rewriting it.
The act of thinking about meaning changes the quality of the output more than you’d expect.

Human editors as teachers, not janitors

Most post-editing workflows treat humans as cleanup crews for AI mistakes.
We do the opposite: we treat editors as instructors.
When they correct tone or replace a clumsy idiom, they annotate why.

Those notes feed back into the model as training data.
Over time, it starts to internalise patterns of cultural substitution.

Measuring cultural resonance

We’re experimenting with a metric that tries to quantify how “native” a text feels: a mix of human judgment, stylistic signals, and emotional tone matching.
Not perfect, but it gives visibility to what was previously invisible.
You can’t optimize what you can’t see.

What this might mean

When people talk about “AI alignment,” they usually mean moral alignment: ensuring AI follows human values.
But there’s also cultural alignment: ensuring AI speaks within human frames of reference.

And maybe those two problems aren’t so different.
In both cases, the model fails when it can’t infer the implicit.

You can give an AI all the words in the world, but if it doesn’t share your priors, it’ll miss your meaning.\ Until AI can simulate that (or at least reason about it) culture will remain the boundary of its understanding.

And maybe that’s a good thing.
It reminds us that what makes language alive is mutual context: the invisible web we live inside every time we speak.