Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a writing system that you’re not familiar with. You want to be able to differentiate the items, but this is hard when you can’t read what they say. Well, let’s turn them into Latin characters (which, assuming you’re reading this in English, you are able to read)!

>>> import icu
>>> tr = icu.Transliterator.createInstance("Any-Latin; Title").transliterate
>>> tr("Αριστοτέλης, Πλάτων, Σωκράτης")
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even though you probably still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

"Any-Latin; Title" means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Caveats

As previously aluded to, do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible; that’s not how the word is normally romanised, and if you try to pronounce it like that nobody will understand you.

Another issue is that the transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. Considering that Japanese writing makes extensive use of these characters, using ICU to transliterate Japanese texts may be a bad idea depending on your use case.

>>> tr("日本国")  # "Nippon-koku" in Japanese, meaning "Japan"
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you need to properly deal with Japanese text.

The JavaScript kuroshiro.js module seems to do Kanji romanisation really well; not sure if there is a Python equivalent.

> const kuroshiro = require("kuroshiro");
> kuroshiro.init(() => {});
> kuroshiro.convert("日本国", { to: "romaji", mode: "furigana" });
'<ruby>日本<rp>(</rp><rt>nippon</rt><rp>)</rp>国<rp>(</rp><rt>koku</rt><rp>)</rp></ruby>'

It looks amazing: 日本(nippon)国(koku). Remove the mode if you just want the plain romaji: nippon koku.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr("ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ")
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Using GitLab’s CI server

GitLab provides a continuous integration service, which is pretty nice for building, testing, and packaging your software and having all the UI integrated in GitLab. If you’re just using the free GitLab.com hosting, you get to utilise their Docker-based runner. (If your build process requires a non-Linux OS you’ll have to provide your own runner.)

Getting a basic build up and running is pretty simple. For example, here’s one job named test that only runs make check:

# .gitlab-ci.yml
test:
  script:
    - make check

If your test suite can measure code coverage, GitLab can also show it in the UI. At the moment this feature is rather rudimentary and requires you to go to the project settings and enter a regular expression to find the coverage amount in the build output.

The following is an example that works with coverage.py when you only have a single Python file. I haven’t tried it with multiple files; it may require a wrapper script that calculates the total coverage amount.

# .gitlab-ci.yml
test:
  image: python:3-alpine
  script:
    - pip install coverage
    - coverage run foo.py
    - coverage report -m
# Regex that matches the coverage amount:
# ^\S+\.py\s+\d+\s+\d+\s+(\d+\%)

A few lessons learnt from setting up test hooks for a small Python app:

~~There is no way to test changes to your build script without pushing a commit. And then the build results will stay in your project page forever with no way to clean up the irrelevant ones.~~ You can run builds locally with gitlab-runner, e.g. gitlab-runner exec shell test to run the test job on the local shell (replace shell with docker to use Docker). (Thanks Evan Felix for the info about gitlab-runner.)
GitLab.com’s default Docker image (ruby:2.1 at the time of writing) is really fast to spin up, possibly because it’s cached. However, you should still explicitly name a Docker image in case the default changes.
Installing packages is slower than downloading a Docker image. It’s not worth going out of your way to use the default image if you then have to call apt-get. See if your compiler has an official Docker image that has all the packages you need (please don’t run Docker images by random people). That said, we’re talking about differences of about ten seconds, so just choose the method that is most convenient.
The ruby:2.1 image has Python 2 but not Python 3.
The official Python repository on Docker Hub lists a number of Alpine-based images. As you would expect, these are smaller and slightly faster to download than the other (Debian-based) images.
The coverage regex requires the percent sign to be escaped (\%).

CMake’s ugly programming language

I’ve just discovered Rosetta Code not long ago, and found it quite fun to browse around in. It shows you the code for various programming tasks in different programming languages. While looking at the Quicksort page, I noticed that it didn’t have a CMake version, so I decided to try writing one.

function (quicksort array_var)
    set (array ${${array_var}})
    if ("${array}" STREQUAL "")
        return ()
    endif ()

    set (less)
    set (equal)
    set (greater)
    list (GET array 0 pivot)

    foreach (x ${array})
        if (x LESS pivot)
            list (APPEND less "${x}")
        elseif (x EQUAL pivot)
            list (APPEND equal "${x}")
        else ()
            list (APPEND greater "${x}")
        endif ()
    endforeach ()

    set (array)
    if (NOT less STREQUAL "")
        quicksort (less)
        list (APPEND array ${less})
    endif ()
    list (APPEND array ${equal})
    if (NOT greater STREQUAL "")
        quicksort (greater)
        list (APPEND array ${greater})
    endif ()
    set ("${array_var}" ${array} PARENT_SCOPE)
endfunction ()

set (a 4 65 2 -31 0 99 83 782 1)
quicksort (a)
message ("${a}")

I’ve worked with CMake for years, and I think it’s a good build system, but I really wish it had switched to a saner language. The CMake language is actually pretty simple and consistent at the syntax level: everything is in the form command(string), where the string syntax is slightly confusing but still rather understandable once you’ve figured out the quoting and variable expansion mechanisms. It’s how that string argument is used that can be messy, inconsistent, and ambiguous. Effectively, it’s as if every command had its own syntax.

Around 2008, there was an experiment to allow writing CMake scripts in Lua. The project never caught on and was abandoned. I think part of the reason was that the thread discussing it in Lua’s mailing list was single-handedly derailed into pointless bickering (which reminds me of the poisonous people talk).

CMake is stuck with a mediocre programming language for the foreseeable future. It’s not as bad as it sounds, though. The simplicity of the syntax has its advantages, and writing CMake buildfiles rarely gets frustrating. It would make a terrible general programming language, but as a build system scripting language it’s workable.

Exaile switched to Git, and my two cents about the DVCS race

Exaile recently switched from Bazaar to Git. Dustin Spicuzza, who has been leading the Exaile project recently, initiated this plan and the rest of the team were in full agreement. But why did we switch? Put simply, it was because all of us prefer Git to Bazaar. However, that’s an oversimplification of my stance on the whole DVCS thing.

After the BitKeeper debacle, distributed version control system became the hottest buzzword around, and in a short while the three major contenders became obvious: Bazaar, Git, and Mercurial. At the time I already had experience with SVK (a semi-DVCS that extends Subversion), which I had used on Ruby.NET and Exaile, but when Adam moved Exaile to Bazaar, I went all-in on Bazaar.

I loved nearly everything about Bazaar, and I instantly “got” it. This was in contrast to Mercurial, which I gave up trying to understand, and to Git, which I wouldn’t have started using if I didn’t have to use it at work. I really wanted Bazaar to be popular. I still agree with the general sentiment that it has excellent usability, especially compared to Git.

But its modest popularity crashed under Git’s dominance; Git was getting all the attention and all the improvements while Bazaar stagnated. Bazaar’s biggest promise from the start, that it would eventually be optimised for speed, either came too late or never materialised—I still don’t know which. Its code documentation languished, and I had to go digging around just to write a simple commit bot.

During casual chat among the active developers at #exaile, we found that our collective knowledge of Bazaar had deteriorated so much that we couldn’t remember how to perform some relatively simple operations. We decided to move on. Right now most of our services have moved to GitHub. We’re still keeping the Launchpad site to refer to old bugs and for the web translation service (the latter is not available on GitHub).

Note: I wrote this article in full in 2014 but only decided to publish it in 2019, by which time it was obviously not timely anymore. It still expresses my thoughts quite well, so I’m publishing it belatedly, but backdated in order to not confuse people.

Authorisation UIs: design issues and going towards fixing them in Wayland

I was browsing the xfce4-dev mailing list when I stumbled on a request for comment by Steve Dodier-Lazaro (who I knew in the past through the #exaile IRC room) on his article regarding the design of authorisation UIs.

The article covers a lot of ground, from current issues plaguing these UIs (in Windows UAC dialog, gksu, etc.), to the types of operations in Wayland that are planned to require authorisation, to brainstorming ideas for moving forward. A lot of issues are still unsolved and I suspect will be years away from being solved, but I’m glad that people are talking about them. The article also calls for further academic research on fixing these issues.

Johannes Sasongko’s blog

Archive