Automating Terminology Checks with Python.

“Through the Binary Lens” brings together articles where code meets culture. By weaving technical and human perspectives, this series uncovers how software, language, and global expansion influence one another in ways both seen and unseen.

Project Overview

When localising large codebases or documentation sets, consistency in terminology is king, but manually auditing hundreds of files for term conformity can be a weeks‑long slog. To solve this, we built a lightweight Python script that:

  • Scans all source files for approved terms.
  • Flags any deviations (typos, casing, synonyms).
  • Generates a summary report with line‑by‑line context.

In a recent customer trial across 12 repositories (≈75,000 lines of text), this automation cut their QA cycle by 85%, freeing up project managers for higher‑level review.

When Built‑in Tools Fall Short

Most CAT tools and localisation platforms include terminology‑checking modules, but they often assume:

  1. Single File Format (XLIFF or TMX).
  2. Static Termbases that require manual updates in the UI.
  3. GUI‑Only Workflows with no headless integration into CI/CD.

Our script shines in scenarios where:

  • Polyglot Repositories contain Markdown, HTML, JSON, and source‑code files mixed together.
  • Dynamic Glossaries live in version control (Git) and evolve with each branch.
  • Automated Pipelines need headless checks on every pull request before merging.

The Challenge

  • Scale: 12 repos, 8 languages → ~900 resource files.
  • Diversity: Each language had its own glossary (200–350 approved terms).
  • Turnaround: Manual checks took 5–7 days per release.

Goal: Deliver consistent terminology assurance in under 24 hours while fitting seamlessly into their existing dev pipeline.

The Solution

Our dev team wrote a Python script leveraging:

  • regex for pattern matching and fuzzy term detection.
  • concurrent.futures to parallelize file processing.
  • pandas to assemble and sort our findings into a clean CSV report.

Something structured somehow like the following (but we won’t give you the exact secret sauce yet)

import re
import os
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

# Load glossary
with open('glossary_en.txt') as f:
    approved_terms = [line.strip() for line in f if line.strip()]

pattern = re.compile(r'\b(' + '|'.join(map(re.escape, approved_terms)) + r')\b', flags=re.IGNORECASE)

def scan_file(path):
    issues = []
    with open(path, encoding='utf-8') as src:
        for num, line in enumerate(src, 1):
            for token in re.findall(r'\b[A-Za-z][A-Za-z0-9_]+\b', line):
                if not pattern.search(token):
                    issues.append((path, num, token))
    return issues

# Parallel file scan
files = [
    os.path.join(dp, f)
    for dp, dn, fn in os.walk('resources/')
    for f in fn if f.endswith(('.md', '.json', '.txt'))
]

with ThreadPoolExecutor(max_workers=8) as executor:
    results = executor.map(scan_file, files)

# Flatten and report
flat = [item for sublist in results for item in sublist]
df = pd.DataFrame(flat, columns=['file','line','term_found'])
df.to_csv('terminology_report.csv', index=False)

Performance Gains

Metric Before Automation After Automation
Total QA time per release 6 days 8 hours
Average issues flagged per reviewer 1,250 0 (fully flagged)
Reviewer hours reallocated to UX 48 hrs 0 hrs manual QA
  • 6 days → 8 hours: Reduced manual checks from almost a full week to a single workday.
  • 48 hours reclaimed weekly for process improvements.

What’s Next?

  • Fuzzy-match enhancement with Levenshtein distance to catch near-miss typos.
  • Interactive dashboard (using Dash) for real-time monitoring of terminology health.
  • CI/CD integration so each pull request automatically generates a mini report.

By pulling the heavy lifting into a quick, repeatable script that works across file types and integrates into our dev pipeline, we’ve ensured scorch-proof terminology consistency, and given our teams back their weekends.

Quentin Lucantis @orb