Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a writing system that you’re not familiar with. You want to be able to differentiate the items, but this is hard when you can’t read what they say. Well, let’s turn them into Latin characters (which, assuming you’re reading this in English, you are able to read)!

>>> import icu
>>> tr = icu.Transliterator.createInstance("Any-Latin; Title").transliterate
>>> tr("Αριστοτέλης, Πλάτων, Σωκράτης")
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even though you probably still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

"Any-Latin; Title" means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Caveats

As previously aluded to, do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible; that’s not how the word is normally romanised, and if you try to pronounce it like that nobody will understand you.

Another issue is that the transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. Considering that Japanese writing makes extensive use of these characters, using ICU to transliterate Japanese texts may be a bad idea depending on your use case.

>>> tr("日本国")  # "Nippon-koku" in Japanese, meaning "Japan"
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you need to properly deal with Japanese text.

The JavaScript kuroshiro.js module seems to do Kanji romanisation really well; not sure if there is a Python equivalent.

> const kuroshiro = require("kuroshiro");
> kuroshiro.init(() => {});
> kuroshiro.convert("日本国", { to: "romaji", mode: "furigana" });
'<ruby>日本<rp>(</rp><rt>nippon</rt><rp>)</rp>国<rp>(</rp><rt>koku</rt><rp>)</rp></ruby>'

It looks amazing: 日本(nippon)国(koku). Remove the mode if you just want the plain romaji: nippon koku.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr("ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ")
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.