Johannes Sasongko’s blog

Posts tagged python

Notes on Python’s dbm.sqlite3 module

Python provides a simple key-value database API in its dbm namespace, and since version 3.13 this includes an SQLite backend. As SQLite is neither simple nor merely a key-value database, there are a few quirks to this dbm.sqlite3 library that I’m documenting in this post.

Schema and data types

You can see the SQL schema of a dbm.sqlite3 database at the top of CPython’s Lib/dbm/sqlite3.py. In summary:

The Dict table is created with default settings. In SQLite that means the table has rowid and does not have STRICT typing.

The key and value columns are declared with BLOB affinity, and because the table is non-STRICT, that means in theory they can contain any SQLite data type (null excluded due to the NON NULL declaration). However, dbm.sqlite3 queries always coerce keys and values to BLOBs first; as long as you always edit the database using dbm.sqlite3, all keys and values will only contain BLOBs. This can lead to surprises, for example due to differences in floating-point precision:

>>> db = dbm.sqlite3.open("test.sqlite", "c")
>>> num = 2.000000000000002
>>> num
2.000000000000002
>>> db[num] = num
>>> dict(db)
{b'2.0': b'2.0'}

In practice, you should probably serialise keys and values yourself into Python bytes objects (which round-trip to SQLite BLOBs), because otherwise the database will do it for you in a way that you may not expect. Handling the serialisation yourself also makes you less likely to forget to deserialise the values you get back. If you use Python’s shelve module as a higher-level abstraction to the dbm database, value (de)serialisation is done automatically for you using pickle.

Using the database from multiple threads

Python’s sqlite3 module (which dbm.sqlite3 is built on top of) has a connection-level check_same_thread option that, by default, prevents the connection from being used from multiple threads. This option is not exposed by dbm.sqlite3. In other words, dbm.sqlite3 connections can only be used from the thread it is created on.

If your use case needs check_same_thread disabled, the only solution available at the moment is to copy the dbm.sqlite3 source code to your project and modify it there.

Depending on the value of the sqlite3.threadsafety constant, and especially when check_same_thread is disabled, you may need to regulate access to the database using a mutex. Refer to the sqlite3.threadsafety documentation to see the level of locking you need to perform. In practice, if you’re on a mainstream Linux distribution, SQLite is likely compiled with Serialized threading mode enabled, making it safe to use from multiple threads (with the exception of transactions).

The simplest way to handle the threading issue is to check that sqlite3.threadsafety is 3 (Serialized)—raising an exception otherwise—and hope that none of your users encounter this limitation. But that’s up to you; feel free to handle it according to your needs.

Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a writing system that you’re not familiar with. You want to be able to differentiate the items, but this is hard when you can’t read what they say. Well, let’s turn them into Latin characters (which, assuming you’re reading this in English, you are able to read)!

>>> import icu
>>> tr = icu.Transliterator.createInstance("Any-Latin; Title").transliterate
>>> tr("Αριστοτέλης, Πλάτων, Σωκράτης")
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even though you probably still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

"Any-Latin; Title" means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Caveats

As previously aluded to, do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible; that’s not how the word is normally romanised, and if you try to pronounce it like that nobody will understand you.

Another issue is that the transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. Considering that Japanese writing makes extensive use of these characters, using ICU to transliterate Japanese texts may be a bad idea depending on your use case.

>>> tr("日本国")  # "Nippon-koku" in Japanese, meaning "Japan"
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you need to properly deal with Japanese text.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr("ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ")
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Pango: Determine if a font is monospaced

If you have a GtkFontButton, finding out whether the chosen font is monospaced is quite a complicated process. Here is a complete walk-through.

(By the way, I will be using PyGTK’s Pango documentation because the C version is a mess.)

FontButton.get_font_name returns the font family (a.k.a. “font name”), style, and size; for example, “Liberation Serif Italic 14”. The first thing we need to do is pick just the family name. We do this by going through a PangoFontDescription.

desc_str = font_button.get_font_name()  # Liberation Serif Italic 14
desc = pango.FontDescription(desc_str)
family_name = desc.get_family()  # Liberation Serif

Next, check whether the font family describes a monospaced font. Here is where it gets dodgy. We need an arbitrary PangoContext, which can be obtained from a GtkWidget using Widget.get_pango_context. We then list all available font families and find the one with the appropriate name. Call FontFamily.is_monospace to finish the job.

(By the way, this is also a good place to show off Python’s for-else construct.)

context = widget.get_pango_context()  # widget can be any GtkWidget.
for family in context.list_families():
	if family.get_name() == family_name:
		break
else:  # Should not happen.
	assert False
family.is_monospace()  # False -- Liberation Serif is proportional.

EBML/Matroska parser for Python

This post explains a Python EBML parser that I wrote. (EBML is Matroska’s binary “markup language”.) It is implemented as a single-file library and is available under a free software licence.

Background

I’ve been working to implement Matroska (mka, mkv, webm) tag-reading support in Exaile. Mutagen—the tag library that we use—currently doesn’t have this feature, so I looked elsewhere.

Choices

Previously I had a working solution using hachoir-metadata, but it doesn’t really make sense to depend on another large tagging library when we’re already using Mutagen. To make matters worse, I accidentally deleted the branch during our recent Bazaar upgrade problem.

I started shopping around for other possible solutions and found videoparser, which seemed quite nice and compact. It’s still a different library, though, and it doesn’t seem to be packaged on Debian.

I was considering just using it anyway for yet another temporary hack when I chanced on MatroskaParser.pm (dead link), a Perl library written by “Omion (on HA)”. It’s only 816 lines of Perl; discounting the README and the Matroska elements table, we’re looking at less than 450.

Solution

I decided to translate MatroskaParser.pm into Python. Despite the horror stories out there about Perl, this particular code is written in a style that is extremely readable if you’re somewhat familiar with the language.

Well, I’ve finished the porting: 250 lines of EBML parser written in Python. Parts of MatroskaParser.pm that are not relevant—mainly the validity checker and the Block parser—have been removed, and the output data structure has been simplified. The next job is to actually extract tags out of the structure.

Matroska tags

Matroska tags are quite different from MP3 and Vorbis tags, in that they’re not just a flat list of key-value pairs. Consider the following snippet.

[{'SimpleTag': [{'TagName': ['TITLE'],
                 'TagString': ['Light + Shade']},
                {'TagName': ['ARTIST'],
                 'TagString': ['Mike Oldfield']}],
  'Targets': [{'TargetTypevalue': [50]}]},
 {'SimpleTag': [{'TagName': ['TITLE'],
                 'TagString': ['Surfing']}],
  'Targets': [{'TargetTypevalue': [30]}]}]

There are two types of tags in this example. The first (target type: 50) explains the album (title: Light + Shade, artist: Mike Oldfield), while the second (target type: 30) explains the track (title: Surfing). Translating this structure into tags that Exaile can understand is not hard, just needs a bit of planning.

By the way, notice that Matroska makes implementing album artists / compilation albums very intuitive: you can have an artist tag at album level, and another at track level. There are even other levels specified. As a further example, because Light + Shade consists of two CDs labelled Light and Shade respectively, you could use them as the titles at level 40 (between album and track); however, this is not common practice.

Another tricky part is getting the track length out of the structure. Under /Segment/Info, you’ll find something like

[{'Duration': [14821615.0],
  'TimecodeScale': [22674]}]

At first I randomly assumed that the duration was specified in seconds, and got around 171 days as output, which was obviously wrong. Apparently you need to apply this formula to get the length in seconds:

Length = Duration * TimecodeScale / 10^9

Note that TimecodeScale may be omitted; it is one of the few important elements that have default values (1,000,000 in its case).

Code

The code is now available in Exaile’s repository. It’s licensed under GPL 2+ with the standard Exaile exception, although I will consider relicensing it if there is interest.

Notice that the last 100-or-so lines make up the Matroska tagging part. Depending on your needs, you may need to expand the list of elements based on the Matroska specification. There are also 40 lines of code that subclasses the parser to use GIO to read the files; you may want to remove this chunk of code if it’s not relevant to you.

Future

Matroska read-only tag support will be in Exaile 0.3.2. Maybe one day I’ll add write support and integrate the whole thing into Mutagen, but don’t count on it. If anyone wants to do it, I’m more than happy to help.

What about WebM?

Funny how I made this post shortly before WebM was announced. Coincidence? Yes, unfortunately; I’m not as cool as the Mozilla and Opera people, who were let in on Google’s secret.

At this point, the WebM container is mostly just a subset of Matroska (the only incompatibility I’ve noticed is the change in doctype, from matroska to webm). As far as I know, they use the exact same EBML structure for tags, so there’s no reason Exaile or this code shouldn’t be able to read tags from a WebM file.