
“From the Orb vault” is a series of previous market research, presentations, blurbs, and other conceptual writings that we will start publishing regularly, in hopes it might help shape views on the often unregarded topic of global expansion and localisation (L10N). Through these insights, we aim to shed light on the complexities and inefficiencies that many overlook in the rush to scale internationally.
The architecture of modern AI systems exhibits a fundamental asymmetry: English is predominantly the default, relegating other languages to secondary considerations. This systemic bias is not merely a technical oversight but has profound real-world implications.
Language transcends mere communication; it embodies culture, cognition, and social identity. The marginalisation of linguistic diversity in AI systems equates to a form of epistemic injustice, where the knowledge and cultural nuances of non-English-speaking communities are undervalued or ignored. Research indicates that AI language technology currently caters to only about 3% of the world’s most widely spoken languages, leaving the vast majority underrepresented.
English as the Default: The AI Monoculture
Most modern AI models are trained primarily on English data. When other languages are included, they’re often treated as an afterthought: mere translations of English datasets, stripped of cultural nuance, processed with one-size-fits-all assumptions. It’s like designing a keyboard for English and then getting confused when it doesn’t work well for Chinese or Hebrew.
The consequences of this design choice are everywhere:
- Content moderation failures: AI often mislabels speech in non-English languages, leading to absurd bans or censorship. (Imagine getting flagged for hate speech because your language uses a word that just happens to sound like something offensive in English.)
- Misinformation blind spots: Since most AI models learn to detect misinformation in English first, they often fail to recognise harmful narratives in underrepresented languages, allowing disinformation to spread unchecked.
- Lost in translation: AI translation tools trained on high-resource languages produce hilariously (or dangerously) distorted results when handling low-resource languages. (You might remember the Arabic speaker who got arrested because his “good morning” post on Facebook got translated as “attack at dawn.")
Collectively, these failures don’t just inconvenience people but reinforce digital inequalities, marginalising entire linguistic communities. If your language isn’t well-represented in AI, it’s not just your memes that suffer; your ability to participate in the digital world can be fundamentally constrained.
The Wrong Mental Model: Language as Math
At the heart of the problem is a flawed assumption: that language is just a sequence of interchangeable tokens that can be reshuffled and remapped without loss. This is the linguistic equivalent of thinking you can copy-paste cultural meaning the way you copy-paste code. Spoiler: You can’t.
Language isn’t just words: it’s history, context, and social meaning. AI models optimise for syntactic and statistical regularities, but meaning isn’t just about structure; it’s about how words are used, by whom, and in what context. When AI systems ignore this, they don’t just make minor translation errors; they systematically fail to understand non-English languages on their own terms.
This problem isn’t new. The dream of a universal, purely logical approach to language goes back to early AI’s rationalist ambitions. We can think of it as the Cartesian fantasy of intelligence as disembodied computation. But just as intelligence isn’t just about logic, language isn’t just about grammar. Pretending otherwise leads to AI systems that are technically multilingual but practically flawed.
Rethinking Multilingual AI: Beyond Cultural Monism
Fixing this problem isn’t about throwing more data at the model; it’s about rethinking how AI interacts with language. Instead of treating non-English languages as poorly-supported plugins for an English-first operating system, we should be building models that recognise linguistic diversity as a fundamental feature, not a bug.
That’s why Orb is developing new ways to evaluate AI systems that go beyond the standard English-first benchmarks. The goal isn’t just to make AI “work” in other languages in the most superficial sense but to rethink what linguistic intelligence in AI should look like.
Because ultimately, this isn’t just a technical problem; it’s an ideological choice. We can keep designing AI that perpetuates linguistic hierarchies, or we can build AI that actually respects language as the complex, culture-bound, gloriously messy thing that it is.