My experience writing an mdoc manpage

This is an account of my experience using the mdoc format to write a manpage, also touching upon the ecosystem around writing manpages in general.

man or mdoc

When writing a manpage, the first thing you must do is choose between two formats: man or mdoc. The code for the two formats look somewhat similar because they are both extensions to the roff macro language, but they use completely different sets of macros.

In the Linux world, the man format is more commonly seen, and the manpage renderer of choice is usually GNU roff (groff). The newer mdoc format is more often used by BSD operating systems, usually rendered with mandoc. Both groff and mandoc support man and mdoc; there are minor differences—more on those later—but for the most part both formats work on both renderers, as well as on others such as Heirloom doctools.

The way the two formats are usually described is that man is more presentation-oriented while mdoc is more semantic-oriented.

Due to its presentational nature, there are countless ways people have written man pages for even very simple programs. Here is just one example:

.TH MYPROG 1 1970-01-01 "MyProg 2.0"
.SH NAME
myprog \- does something
.SH SYNOPSIS
.B myprog
.RB [ \-o
.IR output ]

And here is the equivalent in mdoc:

.Dd January 1, 1970
.Dt MYPROG 1
.Os MyProg 2.0
.Sh NAME
.Nm myprog
.Nd does something
.Sh SYNOPSIS
.Nm myprog
.Op Fl o Ar output

Both may be rendered into:

Volume 1: User commands — MYPROG

NAME

myprog — does something

SYNOPSIS

myprog [-o output]

MyProg 2.0 — 1970‒01‒01

On being semantic-oriented

mdoc makes you think less about things that don’t matter and lets you just write the stupid manpage and get it over with.

It may not be as flexible as man, but as a result you don’t ever have to think about whether a command-line flag should be written in bold or italics. The renderer will pick one formatting and you just have to trust that it will look acceptable.

mdoc does not always succeed in being semantic-oriented, though; the biggest offender being…

The `SYNOPSIS` problem

The traditional “synopsis” manpage section is treated in a special way in mdoc. For this special treatment to work, the title has to be written exactly SYNOPSIS, in English, in all-caps.

These two sections will be rendered very differently despite the bodies being identical in the source:

.Sh SYNOPSIS
.Nm pkg Cm install Ar package
.Nm pkg Cm uninstall Oo Fl \-purge Oc Ar package
.
.Sh Not synopsis
.Nm pkg Cm install Ar package
.Nm pkg Cm uninstall Oo Fl \-purge Oc Ar package

Result:

SYNOPSIS

pkg install package
pkg uninstall [--purge] package

Not synopsis

pkg install package pkg uninstall [--purge] package

Luckily, groff at least accepts Synopsis, and the mandoc author has said that they want to support that as well in the future. I’m not sure if either project supports or plans to support non-English section titles. Frankly I find it extremely questionable for macros to behave differently based on the title of the section they are in.

roff-isms

Despite being not horrible in terms of readability, mdoc is still just a roff extension and you still sometimes need to deal with oddities stemming from the roff syntax.

The \& dummy escape sequence is one such oddity that you may find in several instances:

To prevent a period (.) at the end of a word from being considered the end of a sentence. For example, to write “fruits, e.g. apples” you have to write fruits, e.g.\& apples; otherwise the renderer will put two space characters after “e.g.” when rendering to the terminal.
To prevent a period (.) or a single quote (') at the start of a line from being considered the start of a roff command. For example, to write “...” at the start of a paragraph you have to write \&....
Not actually a roff-ism, as macro nesting is an mdoc feature, but \& is also used to prevent a string in a macro line from being considered a nested macro invocation. For example, .Em At midnight prints out “AT&T UNIX midnight”, not “At midnight” with emphasis. To avoid the “At” being interpreted as a macro call, you have to write .Em \&At midnight.

Other escape sequences often encountered:

\e, \(rs, or \[rs] for \, which is the (default) escape character.
\(aq or \[aq] for '. The plain ' character itself may be rendered as ’ to cater for English contractions such as “can’t”.
\(dq or \[dq] for ". The plain " character is used on roff macro lines to quote string arguments.
\- for ASCII -, as opposed to some other dash or minus character that the renderer may decide to turn plain - into. In practice I don’t think you need this on groff and mandoc.

Editor support

Emacs, Vim, and Neovim support roff syntax highlighting but do not understand mdoc macros. And that’s before even talking about code completion, error checking, or any form of in-editor documentation; none exist as far as I can tell. There are a lot of mdoc macros, all with very terse 2- or 3-character names, some with very particular syntaxes, and without proper editor support I found myself constantly looking things up in the documentation.

To help make things more convenient for myself, and perhaps eventually others, I started writing a (Web-based) mdoc editor with some of these features. But because that ended up being a huge digression from what I was originally trying to do—which was simply to write a manpage for my program—I’ve shelved this editor project for now. It’s in a usable state, though, and it was quite helpful for me when writing my manpage, so maybe I’ll clean it up and publish it one day.

This all looks so annoying; can’t I just use Markdown?

That was my thought at various points in this journey, and I kept going back and forth between “I should write this in mdoc”, “I should write this in man”, and “I should write this in 〈some intermediate format〉”.

There are several projects that let you write manpages in Markdown or similar ‘human-readable’ markup languages and compile them into man or mdoc format. The Git project, for example, uses AsciiDoc, which is not Markdown but has a similar goal.

There are also a few XML-based solutions that you can try if that’s more your style. DocBook would be the most well-known option, although my experience using it to produce a manpage has been very poor.

I’ve looked into most of these manpage transpilers, but for the particular project I was working on I decided to stick with plain mdoc for the time being. Even so, I encourage you to try them and see if there is one that fits your project. The simpler you make the documentation process, the more you and your peers will want to write documentation.

Creating a zip implementation for Node.js

It has always bothered me that the Node.js standard library doesn’t provide an easy way to create ZIP files. What the standard library does have is an implementation of the Deflate algorithm, which is the compression scheme normally used in ZIP files. This means that the most complicated and performance-sensitive part of job has already been done for us, and all we need to do is add the wrapper. So I decided to write it for fun: a ZIP implementation for Node.js that uses the standard library’s Deflate implementation.

This post contains some (very unorganised) notes that document the trickier parts of this process. Even though my implementation only covers the compression side (zip, not unzip), some of these notes are applicable to the decompression side as well.

Note: In this post, Explorer refers to Windows Explorer on Windows 10, 7-Zip refers to the official 7-Zip program (not p7zip), and Info-ZIP is the project supplying the zip and unzip programs included on most Linux distributions and on MSYS2/Cygwin.

What we need

First we need the ZIP specification. That’s the easiest part because the specification is open and freely implementable by anyone. PKWARE, the creators of the ZIP format, have continually published it in a single text file named APPNOTE.TXT. What follows are a number of decisions you need to make depending on how complete you want your implementation to be.

There are two types of ZIP files: (normal) ZIP and ZIP64. The biggest difference is that a normal ZIP file uses 32-bit values for file sizes and offsets, while ZIP64 uses 64-bit values. The details of ZIP64 can be a bit weird due to it being shoehorned into an existing format that was not designed to accommodate such a change. You can opt to ignore ZIP64, but that will mean that you won’t be able to handle any individual file greater than 4 GiB—this includes any uncompressed file size, any compressed file size, or (roughly) the size of the ZIP file itself.

We don’t actually need compression! ZIP files can store uncompressed files, though obviously that would greatly limit the applicable use cases. Technically the ZIP specification supports many different compression formats, including modern ones like xz and Zstandard, but in practice you should probably stick to Deflate and maybe Deflate64 because those are the ones with widespread support (and by that I mean they are the only ones supported by Explorer). As mentioned previously, Node.js includes a Deflate implementation; this is available in the zlib module.

ZIP files use CRC-32 for error detection. Node.js again has our back here, as the zlib module also contains a CRC-32 implementation. If you want to write your own implementation, the code listing in RFC 1952 is very easy to follow.

File sizes (compressed and uncompressed) can be indicated either before or after each file payload. Putting them before the payload results in slightly simpler ZIP file structure. However, if your use case involves live streaming of the ZIP file, putting file sizes after each payload may be your only option.

A proper implementation might include concurrency/multithreading capability. The simplest way probably is to run the compression procedure on multiple input files at once, outputting to temporary files, and then combining them into the actual ZIP file.

High-level view

The ZIP specification is fairly good in terms of giving a general overview of the format. Refer to APPNOTE.TXT section 4.3 “General Format of a .ZIP file”.

Ignoring some of the fancier features like encryption, a ZIP file consists of a bunch of file sections (each including the contents of a file and some metadata), followed by a bunch of “central directory records”, and finally followed by an “end of central directory record” section. Each central directory record partially duplicates some of the metadata specified in a file section. The purpose of the central directory is to locate the file sections within the archive (which, back when people regularly had to use split/multi-volume archives, could be in a different disk).

File metadata times

By default, each file in a ZIP archive has limited metadata: name, size, modification time, and OS-specific attribute bits. The modification time is in the horrific MS-DOS date/time format, which can only represent years from 1980 to 2107 (inclusive), has a granularity of two seconds, and doesn’t specify a timezone (most implementations will use the local timezone, meaning that the output is not portable). Use extended metadata if you want to be more accurate than that, or ignore the problem and live like it’s 1989, when ZIP was first released.

The choice of file attribute format

The external file attributes section of a file entry header allows specifying file attributes based on the host operating system. For example, Windows file attributes include the Archive flag while Unix file attributes include permission bits. If you want to only ever write in one format, though, the Unix format may be the best choice because that’s the platform where users often have to deal with file permissions.

Info-ZIP, being normally a Unix-only program, uses the Unix format even on Windows. The way it determines the mode of a file comes directly from the Cygwin emulation layer, which sets the executable (x) permission bits on a file if any of these applies:

The file name ends with .exe.
The file name ends with .lnk (Windows shortcut file).
The file content starts with MZ (Portable Executable (PE) file, generally what .exe and .dll files are).
The file content starts with #! (Unix shebang).
The file content starts with : followed by a carriage return (byte 13). I have no idea what this is.

However, you probably shouldn’t copy this behaviour, because the executable bit is irrelevant when extracting on Windows. What you should do instead is think about what could be considered executable on Unix. Obviously you want to keep the shebang detection. You may want to add ELF file detection (0x7f followed by ELF) because, for example, the user could be cross-compiling from Windows to Linux and wants to zip up the output. The .exe and/or MZ heuristics are still valuable due to things like Wine, Mono, and .NET Core combined with binfmt_misc.

Of course, you can also ignore all this and store every file with no executable permissions. Or you can use the Windows file attributes format when writing on Windows; this is not something I explored so I can’t comment much on it.

How various zip and unzip programs deal with symbolic links on Windows

Windows distinguishes between a file symlink and a directory symlink—you can make a file symlink pointing to a directory or vice versa if you wanted to for some reason, and when creating a symlink to a nonexistent location you must decide which type you expect the target to be. This distinction does not exist on Unix and cannot be made when the file metadata is recorded in Unix format.

zip:

Explorer and 7-Zip (on Windows at least) do not support recording symlinks in ZIP files. Both will only store file copies.
Info-ZIP will normally also store file copies unless given the --symlinks flag, in which case it will store symlinks as actual symlinks. Since Info-ZIP always uses the Unix metadata format, the distinction between file symlink and directory symlink on Windows is lost.

unzip:

Explorer will straight up ignore the Unix symlink flag and extract the symlink as a text file containing the target path.
On Windows, 7-Zip will always extract a Unix symlink as file symlink.
On Windows, Info-ZIP will extract a Unix symlink according to whether the target path points to a file or a directory; if the target is nonexistent, it reports an error and skips the symlink.

Preventing security issues from very large extraction sizes

This is absolutely irrelevant to creating ZIP files (only to extracting ZIP files), but I’ve included it here because it’s a fascinating topic.

A ZIP file that is designed to cause resource (CPU/disk/memory/bandwidth/stack/filedesc) exhaustion is called a ZIP bomb. They employ one or several forms of nasty tricks; this is a summary of what you need to look out for, but there may be others I’m not aware of.

It’s actually very easy to achieve compression ratio of millions (yes, 1000000-to-1 or higher) naturally without any special tricks if you have highly-compressible data, such as a long string of identical characters, leading to surprisingly large decompressed files. When handling untrusted data, enforce upper limits on the amount of memory and disk space you can use for extraction. If you are developing a ZIP extractor, give users the ability to set these limits.
The uncompressed size declared in a file header may not match the actual decompressed size of its payload. While decompressing a file, check that it hasn’t exceeded the stated uncompressed size. Also note that the uncompressed size for a file is listed in two places (local file header and central directory), neither of which can be trusted.
If you are developing software that wants to analyse ZIP files recursively (inside one another)—for example an antivirus program—be aware that a tiny ZIP file at the top level can lead to absolutely gargantuan files a few levels deeper, and that a ZIP file can be crafted to nest infinitely (!). Do not analyse or extract ZIP files recursively in an unbounded manner.
A file entry can be referred to multiple times from the central directory; similarly, it’s also possible to nest or overlap file entries. This results in the same payload potentially being extracted many times. A ZIP extractor needs to scan or keep track of file entry start and end points and make sure that none of them duplicate or overlap.

The result

This is really nothing to write home about, but here is the code of my Node.js-based ZIP archiver if you are interested. As the project name suggests, it’s a basic implementation that takes shortcuts to produce the simplest (but still compressed) ZIP files. The code is licensed under AGPL-3.0-or-later.

Saving 1 KiB on Rust executables targetting windows-gnu

While comparing a Rust executable for Windows targetting x86_64-pc-windows-gnu and one targetting x86_64-pc-windows-msvc, I noticed that the -gnu one included an embedded application manifest resource. This particular manifest does two things: setting requestedExecutionLevel to asInvoker, and setting supportedOSes from Windows Vista to Windows 10.

As far as I can tell, the first part attempts to disable the Windows installer detection heuristics. However, the documentation appears to indicate that these heuristics are only used on 32-bit binaries, and the fact that the -msvc executable doesn’t have the manifest reinforces the idea that it’s not needed.

The second part of the manifest is only useful if you want to indicate that you don’t support Windows versions prior to Vista. I think for most people that would be the default assumption these days.

These things considered, it looks to me that removing the manifest shouldn’t cause any issues. The problem is that there doesn’t seem to be any built-in way to do this provided by either the OS or the compiler toolchain. You may have to rely on a third-party tool to do this.

If you don’t mind deleting all embedded resources in the executable—by default there will just be the application manifest—you can use this simple C code (replace file.exe with your executable path):

#include <windows.h>
int main(void) {
	return 1 - EndUpdateResourceW(BeginUpdateResourceW(L"file.exe", 1), 0);
}

Here’s a slightly more robust but longer Python alternative:

import ctypes
from ctypes import wintypes
import sys

kernel32 = ctypes.windll.kernel32

def check_trueish(result, *_):
	if result:
		return result
	raise ctypes.WinError()

BeginUpdateResource = kernel32.BeginUpdateResourceW
BeginUpdateResource.restype = wintypes.HANDLE
BeginUpdateResource.argtypes = (ctypes.c_wchar_p, wintypes.BOOL)
BeginUpdateResource.errcheck = check_trueish

EndUpdateResource = kernel32.EndUpdateResourceW
EndUpdateResource.restype = wintypes.BOOL
EndUpdateResource.argtypes = (wintypes.HANDLE, wintypes.BOOL)
EndUpdateResource.errcheck = check_trueish

for arg in sys.argv[1:]:
	file_name = ctypes.create_unicode_buffer(arg)
	handle = BeginUpdateResource(file_name, True)
	EndUpdateResource(handle, False)

The slightly bad news here is that, in my testing, removing this manifest only reduces the executable size by exactly 1024 bytes. Considering the x86_64-pc-windows-gnu target generally produces executables in the hundreds of kilobytes at least, this is a fairly inconsequential saving which I probably won’t bother with.

Hand-programmed blog

Here’s a new blog, replacing my old one at wordpress.com. I promise this has nothing to do with the recent WordPress drama; I have no horse in that race (neither side looks good), and I abandoned the old blog years ago due to my own dislike of WordPress and wordpress.com.

Back then I spent a lot of time trying a few SSGs that seemed to suit my preferences, but didn’t find any that I really liked. At the time I already considered writing my own generator but the project didn’t get far. Now that I’m finally making the move happen, I had another look at some of the options and tried a couple of others, and still none of them made me happy (Astro was close). There’s probably the right SSG for me out there, but I’m not going to spend months trying all of them.

So I wrote my own static site generator. I have no ambitions for this to become a generic tool, so that saves me the pressure of making the code elegant. The code is written in TypeScript, with JSX for templating and Markdown for writing content. Rather than going the usual route of reusing the JSX runtime from Preact or other similar projects, I wrote my own JSX runtime from scratch just to learn how.

The biggest problem with this whole setup is the uncomfortably slow speed. I did a bit of poking around and the problem comes down to:

Fairly slow startup. I suspect this is because of the compilation from TypeScript+JSX to JavaScript. I might try a different compiler, or a runtime like Deno or Bun that has direct TypeScript support.
Very slow syntax highlighting. For this one, I guess I can cache the HTML output or change to a less accurate highlighter.

I’m quite happy with the result so I’m going live with this. There are nice-to-have features that I want like syndication feed (RSS/Atom) and search, but they will have to come in the future.

I’ll slowly migrate my old WordPress posts here while updating the contents if necessary. I don’t plan to add comment support to this blog, so I’m thinking of linking to Mastodon, for which I’ve just created an account as well (see the About page).

Update: After more testing, the highlighter that I thought was better wasn’t actually that good; I’ve switched to a faster one and now highlighting speed is at a reasonable level, and quality seems fine so far.

Update 2: I’ve added a rather bare-bones Atom feed that doesn’t include the article contents inline. Some feed readers can automatically download articles as you view them, but unfortunately for many readers you’ll have to click through to the site.

On the search for a Make replacement

Make is great, but there are some issues with it that are probably impossible to fix now. So I’ve been looking for a replacement that I can use for simple task automation; surely in the 40+ years of Make’s lifetime someone has written something better, right?

These are the notes I made while evaluating the different options that I explored. I’m interested to see if anyone has comments, corrections, or other suggestions.

Make

Implemented in: Various programming languages (probably mainly C).
Script language: Make (various dialects).
Metaprogramming: Yes (major implementations).
Biggest issue: Stringly typed.

Good old Make. Very nice for all sorts of tasks, until you need to deal with files containing space and/or quote characters, where things start to go downhill.

Ninja

Implemented in: C++.
Script language: Ninja.
Metaprogramming: No.
Biggest issue: Not programmable.

This is only mentioned for completion because Ninja is in a completely different ballpark. It’s meant to be a target language for buildfile generators like CMake and Meson, so by itself it has zero programmability, and you wouldn’t really want to write it by hand.

Just

Implemented in: Rust.
Script language: Just (Make-like).
Metaprogramming: No.
Biggest issue: Stringly typed.

This feels like a Make variant with fewer features than most Make implementations. I don’t see this as a practical choice for any project, at least right now.

Task

Implemented in: Go.
Script language: YAML + Go template.
Metaprogramming: No.
Biggest issue: Stringly typed.

I actually really like the idea of Task. For very simple use cases it’s very elegant, because its whole syntax is just 100% valid YAML with a bit of string templating. The templating sometimes gets in the way, though, because the use of {{.VAR}} for variables conflicts with YAML’s {a: b} map syntax, forcing you to waste one level of string quoting on it.

A bigger flaw is that there’s no easy way to override variables from the command line. I think you can work around this by jumping through scripting hoops, but then you lose a ton of elegance points.

And the biggest flaw: it’s still stringly typed just like Make, so you’ll have trouble separating strings from lists.

Gulp

Implemented in: JavaScript.
Script language: JavaScript.
Metaprogramming: Yes.
Biggest issue: Doesn’t track outputs?

This looked promising until I noticed that outputs didn’t seem to be tracked anywhere, which means everything gets rebuilt all the time. Is this really the case or am I missing something?

Grunt

Implemented in: JavaScript.
Script language: JavaScript.
Metaprogramming: Yes.
Biggest issue: Convoluted just to get started.

Grunt’s documentation is rather bad, and the examples they have all throw you in the deep, ugly end. Skimming these introductory materials, I couldn’t figure out how to write the simplest build file, which seems a bad sign.

Rake

Implemented in: Ruby.
Script language: Ruby.
Metaprogramming: Yes.
Biggest issue: Ruby’s shell module is broken on Windows.

This one looked very promising. I’m not a fan of Ruby, but was willing to put up with it because Rake seemed to do all I wanted. But then I discovered that Ruby’s most reasonable subprocess handler, the shell module, breaks on Windows. Without it, you’re back to various ugly half-baked APIs, each with their own limitations.

SCons and Waf

Implemented in: Python.
Script language: Python.
Metaprogramming: Yes.
Biggest issue: Not designed for simple tasks.

Two competing Python-based build systems. These seem too complicated for my use cases. I think making them suit simple tasks would be a significant undertaking. Or perhaps I’m just missing a documentation that is not mainly targeted at people trying to create a build pipeline for their C projects.

doit

Implemented in: Python.
Script language: Python.
Metaprogramming: Yes.
Biggest issue: Verbose.

With this one you end up with lots of boilerplate because rather than writing tasks, you’re writing task creators. It makes sense, but it feels like doing things at a too-low level when you want it to be a simple Make alternative.

The author of doit suggests several high-level interfaces that can be implemented on top of doit. They do limit what you can do, but you can always write normal doit task creators in addition to the simplified versions. I think this is a reasonable compromise and I particularly like the decorator version.

The only remaining problem, then, is that Python’s subprocess handling is very cumbersome. There are two libraries I know of that can rectify this: sh and Plumbum. sh, in my opinion, is not suitable for use in a Make replacement use case. The way it does piping by default is not in line with what we expect, coming from Make. Plumbum is not perfect but better (you still have to end everything with .run_fg() or the magical & FG).

A quirk of doit is that it creates a cache file (or files) alongside your build file. Depending on the exact database backend used, it can create up to three files, which I’d say is not ideal.

Conclusion

I have for now settled on doit + Plumbum with around 100 lines of support code. I’m not fully happy with this, and I’m not sure it can cover all my use cases, but I think it’s time for me to put my ideas and investigations out there and seek comments.

Rake is almost what I need, if not for what I believe is a bug in Ruby’s standard library. But even if it’s fixed, I’d prefer to stick with a Python-based solution if possible.