orb - Our Engineer’s Note on Contrastive Alignment.

“Through the Binary Lens” brings together articles where code meets culture. By weaving technical and human perspectives, this series uncovers how software, language, and global expansion influence one another in ways both seen and unseen.

Multilingual models are not equally safe in all languages: a prompt that works in English can produce toxic output in a low-resource language. This is usually more a problem of alignment than translation ressources available. We use contrastive alignment to tackle this, by embedding prompts, responses, and safety labels into the same space. Semantically equivalent items go together while adversarial items are pushed apart.

emb_pos = embed_texts(["What is the best way to travel?"])
loss = info_nce_loss(emb_src, emb_pos)

Hard negatives are important. They prevent the model from learning shallow associations. They can be paraphrases that are unsafe or adversarial. We add a safety head. It predicts a scalar safety score for a response. We train it jointly with contrastive loss.

safety_pred = safety_head(joint)[:,0]
s_loss = F.mse_loss(safety_pred, safety_labels)

After embeddings are aligned, we scale to generation. Generate candidate responses for multiple languages. Use AI to critique and rank them. Produce preferred responses. Use them as preference pairs. Train a reward model. Align the policy to the reward model while keeping close to base.

labels = torch.zeros(N, dtype=torch.long)
loss = F.cross_entropy(pred, labels)

Evaluation is per-language: we need to measure added toxicity and refusal consistency. Once done, we check safety score calibration. Even with AI-generated preference data, we need to validate with humans in low-resource languages. This approach works without huge human datasets. Contrastive alignment makes representations consistent across languages. We found that reward modeling with AI feedback scales safe behavior. Low-resource languages improve, unsafe outputs drop = refusal consistency rises. The result is safer multilingual generation with minimal human labeling.