orb - Rapid fire on how to Build a Multilingual LLM.

“The Pitch Files” is a curated series of proposal excerpts, thoughts, and value offerings we’ve crafted for potential clients across various industries. These insights offer a glimpse into how we customise our strategies to address the unique challenges of each sector, along with quick drafts of recommendations for the most effective expansion approaches tailored to their needs—just a taste of the full strategic depth we offer.

We like to think of large language models as blank slates that just “learn everything.”

In truth, if you want a multilingual model that truly understands culture instead of just translating words, you can’t bolt that on later, you have to build for it from the start. There are some scenarios in which companies prefer to try and build their own custom LLMs, for instance to control cultural fit, tone and voice. It becomes especially interesting if said company has access to a lot of specialized data, as we need to remember that SOTA LLMs are trained on general internet content.

Vision

Every great model starts with a question:
Whose voice do we want this model to speak in?

That way, we have to think of multilinguality as a core design principle.
Decide early: a universal jack-of-all-trades, or a polyglot specialist tuned for global business communication? That choice shapes languages, data, costs, and product fit.

 # Vision (conceptual)

model_scope = {
    "intent": "multilingual understanding",
    "focus_languages": ["en", "fr", "es", "zh", "ar", "hi"],
    "audience": ["marketing", "localization", "support"]
}

def set_strategy(scope):
    if scope["intent"] == "multilingual understanding":
        return "Train diversity-first"
    return "Train dominance-first"

strategy = set_strategy(model_scope)
print(strategy)  # "Train diversity-first"

Foundations

Data is policy and choosing sources is both a moral and strategic decision.
Mix public web text, curated human samples, and synthetic parallel data. Balance is the secret: enough breadth for coverage, enough curation to avoid amplifying harm.

Aim for a language mix that reflects your product’s markets, not the noisy majority of the web.

data_sources = ["open_web", "licensed_corpora", "human_pairs", "synthetic_parallel"]

def build_mix(sources, priorities):
    mix = {}
    for s in sources:
        mix[s] = sample(s, weight=priorities.get(s, 1.0))
    return normalize(mix)

priorities = {"open_web": 1.0, "human_pairs": 2.0, "synthetic_parallel": 1.5}
training_mix = build_mix(data_sources, priorities)

Feeding the Model

Tokenization is the chewing process. A tokenizer optimized for English will punish non-Latin scripts with longer token sequences and higher compute costs. That affects latency, cost, and UX.

Design or tune tokenizers to minimize imbalance across scripts and to respect morpheme boundaries where it matters.

for lang in ["en","hi","zh","ar"]:
    text = sample_text(lang)
    tokens = tokenizer.encode(text)
    token_ratio = len(tokens) / len(text.split())
    if token_ratio > threshold(lang):
        adjust_tokenizer(lang, strategy="subword_mix")

Tuning for Culture

Language confusion (for instance answering in the wrong language or peppering code-switches) is a real UX failure. To fix this, we need to double-down on context, intent detection, and human-shaped reinforcement.

Post-train with human feedback focused on language consistency, code-switching etiquette, and register. Teach the model when to switch rather than simply how.

for example in human_feedback_set:
    response = model.generate(example.prompt)
    if detect_language_mismatch(response, example.expected_lang):
        add_fine_tune_example(model, example)
fine_tune(model, examples=collected_examples)

Scaling Across Hardware

Deployments vary : cloud, private infra, edge devices. Quantization (storing numbers with fewer bits) lets you run models cheaper and faster, but it’s a trade-off: quality can drop, often worse for non-Latin scripts and complex tasks.

Measure impact with human evaluations in target languages, not only with automated metrics.

precisions = [16, 8, 4]
results = {}
for p in precisions:
    q_model = quantize(model, bits=p)
    results[p] = evaluate_multilingual(q_model, metrics=["human_score","latency"])
report(results)

The Human Finish

Post-training is where the model learns to be useful: instruction-following, safety, domain skills. A productive path is to create small “expert” models (code, translation, safety), iterate those with domain teams, then merge and polish.

Human ranking, dogfooding real tasks, and repeated merges produce a model that’s usable across cultures.

experts = {
    "code": train_expert(data="code_examples"),
    "translation": train_expert(data="bilingual_pairs"),
    "safety": train_expert(data="safety_judgments")
}
merged = merge_models(list(experts.values()))
for feedback in user_tests:
    reward = rank(merged, feedback)
    reinforce(merged, reward)

The Loop Never Ends

Deploying a multilingual LLM is a continuous loop: collect real-world signals, fix regressions, refine edge languages, and repeat. Models must evolve with culture.

while True:
    user_signals = collect_feedback()
    updates = curate_updates(user_signals)
    retrain(merged, updates)
    redeploy(merged)
    sleep(interval="weekly")

Takeaway

A multilingual model is never “done."
It’s a negotiation between culture, computation, and clarity. Build with language in mind from day one: design vision, curate balanced foundations, optimize tokenization, teach cultural etiquette, benchmark quantization with human raters, and polish with human-in-the-loop cycles.

That’s how you make models that work for global brands and for real people.