Posts tagged programming

Notes on Python’s `dbm.sqlite3` module

Python provides a simple key-value database API in its dbm namespace, and since version 3.13 this includes an SQLite backend. As SQLite is neither simple nor merely a key-value database, there are a few quirks to this dbm.sqlite3 library that I’m documenting in this post.

Schema and data types

You can see the SQL schema of a dbm.sqlite3 database at the top of CPython’s Lib/dbm/sqlite3.py. In summary:

Dict table
- key column (unique, non-null)
- value column (non-null)

The Dict table is created with default settings. In SQLite that means the table has rowid and does not have STRICT typing.

The key and value columns are declared with BLOB affinity, and because the table is non-STRICT, that means in theory they can contain any SQLite data type (null excluded due to the NON NULL declaration). However, dbm.sqlite3 queries always coerce keys and values to BLOBs first; as long as you always edit the database using dbm.sqlite3, all keys and values will only contain BLOBs. This can lead to surprises, for example due to differences in floating-point precision:

>>> db = dbm.sqlite3.open("test.sqlite", "c")
>>> num = 2.000000000000002
>>> num
2.000000000000002
>>> db[num] = num
>>> dict(db)
{b'2.0': b'2.0'}

In practice, you should probably serialise keys and values yourself into Python bytes objects (which round-trip to SQLite BLOBs), because otherwise the database will do it for you in a way that you may not expect. Handling the serialisation yourself also makes you less likely to forget to deserialise the values you get back. If you use Python’s shelve module as a higher-level abstraction to the dbm database, value (de)serialisation is done automatically for you using pickle.

Using the database from multiple threads

Python’s sqlite3 module (which dbm.sqlite3 is built on top of) has a connection-level check_same_thread option that, by default, prevents the connection from being used from multiple threads. This option is not exposed by dbm.sqlite3. In other words, dbm.sqlite3 connections can only be used from the thread it is created on.

If your use case needs check_same_thread disabled, the only solution available at the moment is to copy the dbm.sqlite3 source code to your project and modify it there.

Depending on the value of the sqlite3.threadsafety constant, and especially when check_same_thread is disabled, you may need to regulate access to the database using a mutex. Refer to the sqlite3.threadsafety documentation to see the level of locking you need to perform. In practice, if you’re on a mainstream Linux distribution, SQLite is likely compiled with Serialized threading mode enabled, making it safe to use from multiple threads (with the exception of transactions).

The simplest way to handle the threading issue is to check that sqlite3.threadsafety is 3 (Serialized)—raising an exception otherwise—and hope that none of your users encounter this limitation. But that’s up to you; feel free to handle it according to your needs.

Saving 1 KiB on Rust executables targetting windows-gnu

While comparing a Rust executable for Windows targetting x86_64-pc-windows-gnu and one targetting x86_64-pc-windows-msvc, I noticed that the -gnu one included an embedded application manifest resource. This particular manifest does two things: setting requestedExecutionLevel to asInvoker, and setting supportedOSes from Windows Vista to Windows 10.

As far as I can tell, the first part attempts to disable the Windows installer detection heuristics. However, the documentation appears to indicate that these heuristics are only used on 32-bit binaries, and the fact that the -msvc executable doesn’t have the manifest reinforces the idea that it’s not needed.

The second part of the manifest is only useful if you want to indicate that you don’t support Windows versions prior to Vista. I think for most people that would be the default assumption these days.

These things considered, it looks to me that removing the manifest shouldn’t cause any issues. The problem is that there doesn’t seem to be any built-in way to do this provided by either the OS or the compiler toolchain. You may have to rely on a third-party tool to do this.

If you don’t mind deleting all embedded resources in the executable—by default there will just be the application manifest—you can use this simple C code (replace file.exe with your executable path):

#include <windows.h>
int main(void) {
	return 1 - EndUpdateResourceW(BeginUpdateResourceW(L"file.exe", 1), 0);
}

Less-barebones Rust alternative:

use std::{
	ffi::{c_int, c_void},
	os::windows::ffi::OsStrExt as _,
};

unsafe extern "system" {
	fn BeginUpdateResourceW(pFileName: *const u16, bDeleteExistingResources: c_int) -> *mut c_void;
	fn EndUpdateResourceW(hUpdate: *const c_void, fDiscard: c_int) -> c_int;
}

fn main() {
	for path in std::env::args_os().skip(1) {
		let mut encoded = Vec::from_iter(path.encode_wide());
		encoded.push(0);
		let handle = unsafe { BeginUpdateResourceW(encoded.as_ptr(), 1) };
		if handle.is_null() || unsafe { EndUpdateResourceW(handle, 0) } == 0 {
			eprintln!(
				"Failed removing resources from \"{}\"",
				path.to_string_lossy(),
			);
			std::process::exit(1);
		}
	}
}

Or in Python:

import ctypes
from ctypes import wintypes
import sys

kernel32 = ctypes.windll.kernel32

def check_trueish(result, *_):
	if result:
		return result
	raise ctypes.WinError()

BeginUpdateResource = kernel32.BeginUpdateResourceW
BeginUpdateResource.restype = wintypes.HANDLE
BeginUpdateResource.argtypes = (ctypes.c_wchar_p, wintypes.BOOL)
BeginUpdateResource.errcheck = check_trueish

EndUpdateResource = kernel32.EndUpdateResourceW
EndUpdateResource.restype = wintypes.BOOL
EndUpdateResource.argtypes = (wintypes.HANDLE, wintypes.BOOL)
EndUpdateResource.errcheck = check_trueish

for arg in sys.argv[1:]:
	file_name = ctypes.create_unicode_buffer(arg)
	handle = BeginUpdateResource(file_name, True)
	EndUpdateResource(handle, False)

The slightly bad news here is that, in my testing, removing this manifest only reduces the executable size by exactly 1024 bytes. Considering the x86_64-pc-windows-gnu target generally produces executables in the hundreds of kilobytes at least, this is a fairly inconsequential saving which I probably won’t bother with.

On the search for a Make replacement

Make is great, but there are some issues with it that are probably impossible to fix now. So I’ve been looking for a replacement that I can use for simple task automation; surely in the 40+ years of Make’s lifetime someone has written something better, right?

These are the notes I made while evaluating the different options that I explored. I’m interested to see if anyone has comments, corrections, or other suggestions.

Make

Implemented in: Various programming languages (probably mainly C).
Script language: Make (various dialects).
Metaprogramming: Yes (major implementations).
Biggest issue: Stringly typed.

Good old Make. Very nice for all sorts of tasks, until you need to deal with files containing space and/or quote characters, where things start to go downhill.

Ninja

Implemented in: C++.
Script language: Ninja.
Metaprogramming: No.
Biggest issue: Not programmable.

This is only mentioned for completion because Ninja is in a completely different ballpark. It’s meant to be a target language for buildfile generators like CMake and Meson, so by itself it has zero programmability, and you wouldn’t really want to write it by hand.

Just

Implemented in: Rust.
Script language: Just (Make-like).
Metaprogramming: No.
Biggest issue: Stringly typed.

This feels like a Make variant with fewer features than most Make implementations. I don’t see this as a practical choice for any project, at least right now.

Task

Implemented in: Go.
Script language: YAML + Go template.
Metaprogramming: No.
Biggest issue: Stringly typed.

I actually really like the idea of Task. For very simple use cases it’s very elegant, because its whole syntax is just 100% valid YAML with a bit of string templating. The templating sometimes gets in the way, though, because the use of {{.VAR}} for variables conflicts with YAML’s {a: b} map syntax, forcing you to waste one level of string quoting on it.

A bigger flaw is that there’s no easy way to override variables from the command line. I think you can work around this by jumping through scripting hoops, but then you lose a ton of elegance points.

And the biggest flaw: it’s still stringly typed just like Make, so you’ll have trouble separating strings from lists.

Gulp

Implemented in: JavaScript.
Script language: JavaScript.
Metaprogramming: Yes.
Biggest issue: Doesn’t track outputs?

This looked promising until I noticed that outputs didn’t seem to be tracked anywhere, which means everything gets rebuilt all the time. Is this really the case or am I missing something?

Grunt

Implemented in: JavaScript.
Script language: JavaScript.
Metaprogramming: Yes.
Biggest issue: Convoluted just to get started.

Grunt’s documentation is rather bad, and the examples they have all throw you in the deep, ugly end. Skimming these introductory materials, I couldn’t figure out how to write the simplest build file, which seems a bad sign.

Rake

Implemented in: Ruby.
Script language: Ruby.
Metaprogramming: Yes.
Biggest issue: Ruby’s shell module is broken on Windows.

This one looked very promising. I’m not a fan of Ruby, but was willing to put up with it because Rake seemed to do all I wanted. But then I discovered that Ruby’s most reasonable subprocess handler, the shell module, breaks on Windows. Without it, you’re back to various ugly half-baked APIs, each with their own limitations.

SCons and Waf

Implemented in: Python.
Script language: Python.
Metaprogramming: Yes.
Biggest issue: Not designed for simple tasks.

Two competing Python-based build systems. These seem too complicated for my use cases. I think making them suit simple tasks would be a significant undertaking. Or perhaps I’m just missing a documentation that is not mainly targeted at people trying to create a build pipeline for their C projects.

doit

Implemented in: Python.
Script language: Python.
Metaprogramming: Yes.
Biggest issue: Verbose.

With this one you end up with lots of boilerplate because rather than writing tasks, you’re writing task creators. It makes sense, but it feels like doing things at a too-low level when you want it to be a simple Make alternative.

The author of doit suggests several high-level interfaces that can be implemented on top of doit. They do limit what you can do, but you can always write normal doit task creators in addition to the simplified versions. I think this is a reasonable compromise and I particularly like the decorator version.

The only remaining problem, then, is that Python’s subprocess handling is very cumbersome. There are two libraries I know of that can rectify this: sh and Plumbum. sh, in my opinion, is not suitable for use in a Make replacement use case. The way it does piping by default is not in line with what we expect, coming from Make. Plumbum is not perfect but better (you still have to end everything with .run_fg() or the magical & FG).

A quirk of doit is that it creates a cache file (or files) alongside your build file. Depending on the exact database backend used, it can create up to three files, which I’d say is not ideal.

Conclusion

I have for now settled on doit + Plumbum with around 100 lines of support code. I’m not fully happy with this, and I’m not sure it can cover all my use cases, but I think it’s time for me to put my ideas and investigations out there and seek comments.

Rake is almost what I need, if not for what I believe is a bug in Ruby’s standard library. But even if it’s fixed, I’d prefer to stick with a Python-based solution if possible.

Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a writing system that you’re not familiar with. You want to be able to differentiate the items, but this is hard when you can’t read what they say. Well, let’s turn them into Latin characters (which, assuming you’re reading this in English, you are able to read)!

>>> import icu
>>> tr = icu.Transliterator.createInstance("Any-Latin; Title").transliterate
>>> tr("Αριστοτέλης, Πλάτων, Σωκράτης")
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even though you probably still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

"Any-Latin; Title" means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Caveats

As previously aluded to, do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible; that’s not how the word is normally romanised, and if you try to pronounce it like that nobody will understand you.

Another issue is that the transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. Considering that Japanese writing makes extensive use of these characters, using ICU to transliterate Japanese texts may be a bad idea depending on your use case.

>>> tr("日本国")  # "Nippon-koku" in Japanese, meaning "Japan"
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you need to properly deal with Japanese text.

The JavaScript kuroshiro.js module seems to do Kanji romanisation really well; not sure if there is a Python equivalent.

> const kuroshiro = require("kuroshiro");
> kuroshiro.init(() => {});
> kuroshiro.convert("日本国", { to: "romaji", mode: "furigana" });
'<ruby>日本<rp>(</rp><rt>nippon</rt><rp>)</rp>国<rp>(</rp><rt>koku</rt><rp>)</rp></ruby>'

It looks amazing: 日本(nippon)国(koku). Remove the mode if you just want the plain romaji: nippon koku.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr("ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ")
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Using GitLab’s CI server

GitLab provides a continuous integration service, which is pretty nice for building, testing, and packaging your software and having all the UI integrated in GitLab. If you’re just using the free GitLab.com hosting, you get to utilise their Docker-based runner. (If your build process requires a non-Linux OS you’ll have to provide your own runner.)

Getting a basic build up and running is pretty simple. For example, here’s one job named test that only runs make check:

# .gitlab-ci.yml
test:
  script:
    - make check

If your test suite can measure code coverage, GitLab can also show it in the UI. At the moment this feature is rather rudimentary and requires you to go to the project settings and enter a regular expression to find the coverage amount in the build output.

The following is an example that works with coverage.py when you only have a single Python file. I haven’t tried it with multiple files; it may require a wrapper script that calculates the total coverage amount.

# .gitlab-ci.yml
test:
  image: python:3-alpine
  script:
    - pip install coverage
    - coverage run foo.py
    - coverage report -m
# Regex that matches the coverage amount:
# ^\S+\.py\s+\d+\s+\d+\s+(\d+\%)

A few lessons learnt from setting up test hooks for a small Python app:

~~There is no way to test changes to your build script without pushing a commit. And then the build results will stay in your project page forever with no way to clean up the irrelevant ones.~~ You can run builds locally with gitlab-runner, e.g. gitlab-runner exec shell test to run the test job on the local shell (replace shell with docker to use Docker). (Thanks Evan Felix for the info about gitlab-runner.)
GitLab.com’s default Docker image (ruby:2.1 at the time of writing) is really fast to spin up, possibly because it’s cached. However, you should still explicitly name a Docker image in case the default changes.
Installing packages is slower than downloading a Docker image. It’s not worth going out of your way to use the default image if you then have to call apt-get. See if your compiler has an official Docker image that has all the packages you need (please don’t run Docker images by random people). That said, we’re talking about differences of about ten seconds, so just choose the method that is most convenient.
The ruby:2.1 image has Python 2 but not Python 3.
The official Python repository on Docker Hub lists a number of Alpine-based images. As you would expect, these are smaller and slightly faster to download than the other (Debian-based) images.
The coverage regex requires the percent sign to be escaped (\%).