
“Through the Binary Lens” brings together articles where code meets culture. By weaving technical and human perspectives, this series uncovers how software, language, and global expansion influence one another in ways both seen and unseen.
Let’s begin with a simple falsehood: “Software is just logic."
Software is actually is logic for someone. And that someone may read right-to-left, might input text from a soft keyboard using phonemes you’ve never heard of, or might prefer their commas as decimal separators. To ship software globally is to court ambiguity (unless you’ve made peace with Unicode and its priestly companion, internationalisation). The stakes are real, as your “world-ready” product might collapse under a single Unicode character.
The Unicode Cataclysm, or Why ‘Strings’ Are Not Strings
In ASCII’s Eden, strings were simple. Each character was one byte.
Then the flood came: umlauts, kana, emojis, Zalgo. We were expelled from Eden and thrown into the wilderness of variable-width encoding.
Here’s the core lesson: Unicode is not UTF-8.
Unicode is an abstract encoding standard. UTF-8 is one way to serialise it. UTF-16 is another. If you treat a Unicode string like it’s a list of characters, you’ll get hurt.
// This string contains a single character: 𐍈 (Gothic letter hwair)
const gothic = '𐍈';
console.log(gothic.length); // Output: 2 --- UTF-16 stores it as a surrogate pair
There is a cruel irony in this. The user sees one glyph.
The engine sees two 16-bit code units, and your .length
call lies to you.
To count user-perceived characters (grapheme clusters), you need a proper segmenter:
// Using Intl.Segmenter (modern JavaScript)
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment('𐍈')];
console.log(segments.length); // Output: 1 --- this is what the user expects`
So begins the long story of globalisation: realising that your data structures lie to you.
The I18N Directive: Separate Code from Culture
Internationalisation (i18n, in the cryptic shorthand) is the process of enabling software to adapt to different languages and regions without engineering changes.
You’re not translating yet. You’re preparing the architecture so that translation won’t break it.
The most well-worn but essential advice: never hard-code strings.
// Don't do this:
System.out.println("Hello, world!");
// Do this:
ResourceBundle bundle = ResourceBundle.getBundle("messages", locale);
System.out.println(bundle.getString("greeting.hello"));`
This is not a best practice. It’s a fence built around the pit you will fall into otherwise. Ask anyone who has had to retrofit i18n into a monolith written with English assumptions.
What about plurals? Gender? Contextual tone? These are not edge case, they are the baseline for global UX.
Time, Place, and Locale: The Dignity of Format
Users do not want to be “internationalised."
They want to feel native. Dates, currencies, numbers: these are political. And personal.
# Python: Localised number formatting
import locale
locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
print(locale.format_string('%.2f', 12345.678, grouping=True))
# Output: 12 345,68 --- note the space and the comma`
This is not just a matter of aesthetics. It’s about user trust.
If your fintech dashboard shows a balance of 12,345.67
to a French user, they may think they’re bankrupt.
Global expansion isn’t about turning your software into Esperanto.
It’s about code that steps aside so local conventions can speak.
Charsets in the Database: Silent Saboteurs
Text encoding bugs are particularly vicious because they often fail silently.
Your API works fine until someone’s name includes a character outside Latin-1. Then:
-- MySQL default charset used to be latin1
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(255) -- silently misinterprets 'José' as 'José'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The correct default charset in 2025 is not utf8
.
Use utf8mb4
to support the full Unicode range, including emoji, ancient scripts, and composite characters.
Rendering and the Text That Breaks Boxes
Your string doesn’t end where you think it does. It may contain zero-width joiners, combining marks, or right-to-left overrides.
<!-- Hebrew with LTR override -->
<p dir="ltr">English <bdo dir="rtl">עברית</bdo></p>
Without proper handling, bidirectional text renders like a cipher. Worse, it breaks layouts. It breaks assumptions. It breaks your mind a little. If your UI can be localised into Arabic and remain visually intact, it’s ready for the world.
If it breaks, good. Now you know what you missed.
The Cost of Not Knowing
Here’s where this ties back to business strategy: Global expansion is a revenue multiplier only if the product is genuinely usable outside its country of origin. Failing to internalise i18n and Unicode is like designing a bridge without knowing about tensile strength. It might work, at first. Until the first truck comes along. Internationalisation isn’t “a feature.” It’s the absence of friction, the silent courtesy your software pays to the rest of the world.
Closing Thoughts
Unicode and I18n don’t make your software smart. They make it humble. They are what let a system retreat just enough for the user to feel like the protagonist. And that is precisely what global users deserve.
Build like you’re being watched by a multilingual, right-to-left, low-bandwidth future.
Because you are.