“Through the Binary Lens” brings together articles where code meets culture. By weaving technical and human perspectives, this series uncovers how software, language, and global expansion influence one another in ways both seen and unseen.
Let’s begin with a simple falsehood: “Software is just logic."
Software is actually logic for someone. And that someone may read right-to-left, might input text from a soft keyboard using phonemes you’ve never heard of, or might prefer their commas as decimal separators. To ship software globally is to court ambiguity (unless you’ve made peace with Unicode and its priestly companion, internationalisation). The stakes are real, as your “world-ready” product might collapse under a single Unicode character.
The Unicode Cataclysm, or Why ‘Strings’ Are Not Strings
In ASCII’s Eden, strings were simple. Each character was one byte.
Then the flood came: umlauts, kana, emojis, Zalgo. We were expelled from Eden and thrown into the wilderness of variable-width encoding.
Here’s the core lesson: Unicode is not UTF-8.
Unicode is an abstract encoding standard. UTF-8 is one way to serialise it. UTF-16 is another. If you treat a Unicode string like it’s a list of characters, you’ll get hurt.
// This string contains a single character: 𐍈 (Gothic letter hwair)
const gothic = '𐍈';
console.log(gothic.length); // Output: 2 --- UTF-16 stores it as a surrogate pair
There is a cruel irony in this. The user sees one glyph.
The engine sees two 16-bit code units, and your .length call lies to you.
To count user-perceived characters (grapheme clusters), you need a proper segmenter:
// Using Intl.Segmenter (modern JavaScript)
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment('𐍈')];
console.log(segments.length); // Output: 1 --- this is what the user expects`
So begins the long story of globalisation: realising that your data structures lie to you.
The I18N Directive: Separate Code from Culture
Internationalisation (i18n, in the cryptic shorthand) is the process of enabling software to adapt to different languages and regions without engineering changes.
You’re not translating yet. You’re preparing the architecture so that translation won’t break it.
The most well-worn but essential advice: never hard-code strings.
// Don't do this:
System.out.println("Hello, world!");
// Do this:
ResourceBundle bundle = ResourceBundle.getBundle("messages", locale);
System.out.println(bundle.getString("greeting.hello"));`
This is a fence built around the pit you will fall into otherwise. Ask anyone who has had to retrofit i18n into a monolith written with English assumptions.
Time, Place, and Locale: The Dignity of Format
Users do not want to be “internationalised."
They want to feel native. Dates, currencies, numbers: these are political. And personal.
# Python: Localised number formatting
import locale
locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
print(locale.format_string('%.2f', 12345.678, grouping=True))
# Output: 12 345,68 --- note the space and the comma`
This is not just a matter of aesthetics as statistics prove it builds user trust.
If your fintech dashboard shows a balance of 12,345.67 to a French user, they may think they’re bankrupt.
Charsets in the Database: Silent Saboteurs
Text encoding bugs are particularly vicious because they often fail silently.
Your API works fine until someone’s name includes a character outside Latin-1. Then:
-- MySQL default charset used to be latin1
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(255) -- silently misinterprets 'José' as 'José'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The correct default charset in 2025 is not utf8.
Use utf8mb4 to support the full Unicode range, including emoji, ancient scripts, and composite characters.
Rendering and the Text That Breaks Boxes
Your string doesn’t end where you think it does. It may contain zero-width joiners, combining marks, or right-to-left overrides.
<!-- Hebrew with LTR override -->
<p dir="ltr">English <bdo dir="rtl">עברית</bdo></p>
Without proper handling, bidirectional text renders like a cipher. Worse, it breaks layouts. It breaks assumptions. It breaks your mind a little. If your UI can be localised into Arabic and remain visually intact, it’s ready for the world.
If it breaks, good. Now you know what you missed.
The Cost of Not Knowing
Here’s where this ties back to business strategy: Global expansion is a revenue multiplier only if the product is genuinely usable outside its country of origin. Failing to internalise i18n and Unicode is like designing a bridge without knowing about tensile strength. It might work, at first. Until the first truck comes along. Internationalisation suppress friction, and if you know anything about us by now, we believe it should be your number one priority to reduce costs .
Closing Thoughts
Unicode and I18n make your software humble. They are what let a system retreat just enough for the user to feel like the protagonist. And that is precisely what global users deserve.
Build like you’re being watched by a multilingual, right-to-left, low-bandwidth future.
Because you are.