This is an account of my experience using the mdoc format to write a manpage, also touching upon the
ecosystem around writing manpages in general.
man or mdoc
When writing a manpage, the first thing you must do is choose between two formats: man or mdoc.
The code for the two formats look somewhat similar because they are both extensions to the roff macro
language, but they use completely different sets of macros.
In the Linux world, the man format is more commonly seen, and the manpage renderer of choice is usually
GNU roff (groff). The newer mdoc format is more often
used by BSD operating systems, usually rendered with mandoc. Both groff and
mandoc support man and mdoc; there are minor differences—more on those later—but for the most
part both formats work on both renderers, as well as on others such as
Heirloom doctools.
The way the two formats are usually described is that man is more presentation-oriented while
mdoc is more semantic-oriented.
Due to its presentational nature, there are countless ways people have written man pages for even very
simple programs. Here is just one example:
.Dd January 1, 1970
.Dt MYPROG 1
.Os MyOS 2.0
.Sh NAME
.Nm myprog
.Nd does something
.Sh SYNOPSIS
.Nm myprog
.Op Fl o Ar output
Both may be rendered into:
Volume 1: User commands — MYPROG
NAME
myprog — does something
SYNOPSIS
myprog [-ooutput]
On being semantic-oriented
mdoc makes you think less about things that don’t matter and lets you just write the stupid manpage and
get it over with.
It may not be as flexible as man, but as a result you don’t ever have to think about whether a
command-line flag should be written in bold or italics. The renderer will pick one formatting and you just have to
trust that it will look acceptable.
mdoc does not always succeed in being semantic-oriented, though; the biggest offender being…
The SYNOPSIS problem
The traditional “synopsis” manpage section is treated in a special way in mdoc. For this special treatment
to work, the title has to be written exactly SYNOPSIS, in English, in all-caps.
These two sections will be rendered very differently, despite the bodies being identical in the source:
Luckily, groff at least accepts Synopsis, and the mandoc author has said that they want to support
that as well in the future. I’m not sure if either project supports or plans to support non-English section titles.
roff-isms
Despite being not horrible in terms of readability, mdoc is still just a roff extension and you
still sometimes need to deal with oddities stemming from the roff syntax.
The \& dummy escape sequence is one such oddity that you may find in two instances:
To prevent a period (.) at the end of a word from being considered the end of a sentence. For
example, to write “fruits, e.g. apples” you have to write fruits, e.g.\& apples; otherwise the
renderer will put two space characters after “e.g.” when rendering to the terminal.
To prevent a period (.) or a single quote (') at the start of a line from being
considered the start of a roff command. For example, to write “...” at the start of a paragraph you have
to write \&....
Other escape sequences often encountered:
\e, \(rs, or \[rs] for \, which is the (default) escape
character.
\(aq or \[aq] for '. The plain ' character itself may be
rendered as ’ to cater for English contractions such as “can’t”.
\(dq or \[dq] for ". The plain " character is used on
roff macro lines to quote string arguments.
\- for ASCII -, as opposed to some other dash or minus character that the renderer may
decide to turn plain - into. In practice I don’t think you need this on groff and mandoc.
Editor support
Emacs, Vim, and Neovim support roff syntax highlighting but do not understand mdoc macros. And
that’s before even talking about code completion, error checking, or any form of in-editor documentation; none
exist as far as I can tell. There are a lot of mdoc macros, all with very terse 2- or 3-character
names, some with very particular syntaxes, and without proper editor support I found myself constantly looking
things up in the documentation.
To help make things more convenient for myself, and perhaps eventually others, I started writing a (Web-based)
mdoc editor with some of these features. But because that ended up being a huge digression from what I was
originally trying to do—which was simply to write a manpage for my program—I’ve shelved this editor project for
now. It’s in a usable state, though, and it was quite helpful for me when writing my manpage, so maybe I’ll clean
it up and publish it one day.
This all seems so annoying; can’t I just use Markdown?
That was my thought at various points in this journey, and I kept going back and forth between “I should write this
in mdoc”, “I should write this in man”, and “I should write this in 〈some intermediate
format〉”.
There are several projects that let you write manpages in Markdown or similar ‘human-readable’ markup languages and
compile them into man or mdoc format. The Git project, for example, uses AsciiDoc, which is not
Markdown but has a similar goal.
There are also a few XML-based solutions that you can try if that’s more your style. DocBook would be the most
well-known option, although my experience using it to produce a manpage has been very poor.
I’ve looked into most of these manpage transpilers, but for the particular project I was working on I decided to
stick with plain mdoc for the time being. Even so, I encourage you to try them and see if there is one
that fits your project. The simpler you make the documentation process, the more you and your peers will want to
write documentation.
It has always bothered me that the Node.js standard library doesn’t provide an easy way to create ZIP files. What
the standard library does have is an implementation of the Deflate algorithm, which is the compression scheme
normally used in ZIP files. This means that the most complicated and performance-sensitive part of job has already
been done for us, and all we need to do is add the wrapper. So I decided to write it for fun: a ZIP implementation
for Node.js that uses the standard library’s Deflate implementation.
This post contains some (very unorganised) notes that document the trickier parts of this process. Even though my
implementation only covers the compression side (zip, not unzip), some of these notes are applicable to the
decompression side as well.
Note: In this post, Explorer refers to Windows Explorer on Windows 10, 7-Zip refers to the
official 7-Zip program (not p7zip), and Info-ZIP is the project supplying the zip and
unzip programs included on most Linux distributions and on MSYS2/Cygwin.
What we need
First we need the ZIP specification. That’s the easiest part because the specification is open and freely
implementable by anyone. PKWARE, the creators of the ZIP format, have continually published it in a single text
file named APPNOTE.TXT. What follows are a number of decisions you need to make depending on how complete you want your implementation
to be.
There are two types of ZIP files: (normal) ZIP and ZIP64. The biggest difference is that a normal ZIP file uses
32-bit values for file sizes and offsets, while ZIP64 uses 64-bit values. The details of ZIP64 can be a bit weird
due to it being shoehorned into an existing format that was not designed to accommodate such a change. You can opt
to ignore ZIP64, but that will mean that you won’t be able to handle any individual file greater than 4
GiB—this includes any uncompressed file size, any compressed file size, or (roughly) the size of the
ZIP file itself.
We don’t actually need compression! ZIP files can store uncompressed files, though obviously that would greatly
limit the applicable use cases. Technically the ZIP specification supports many different compression formats,
including modern ones like xz and Zstandard, but in practice you should probably stick to Deflate and maybe
Deflate64 because those are the ones with widespread support (and by that I mean they are the only ones supported
by Explorer). As mentioned previously, Node.js includes a Deflate implementation; this is available in the
zlib module.
ZIP files use CRC-32 for error detection. Node.js again has our back here, as the zlib module also
contains a CRC-32 implementation. If you want to write your own implementation, the code listing in
RFC 1952 is very easy to follow.
File sizes (compressed and uncompressed) can be indicated either before or after each file payload. Putting them
before the payload results in slightly simpler ZIP file structure. However, if your use case involves live
streaming of the ZIP file, putting file sizes after each payload may be your only option.
A proper implementation might include concurrency/multithreading capability. The simplest way probably is to run
the compression procedure on multiple input files at once, outputting to temporary files, and then combining them
into the actual ZIP file.
High-level view
The ZIP specification is fairly good in terms of giving a general overview of the format. Refer to
APPNOTE.TXT section 4.3
“General Format of a .ZIP file”.
Ignoring some of the fancier features like encryption, a ZIP file consists of a bunch of file sections (each
including the contents of a file and some metadata), followed by a bunch of “central directory records”, and
finally followed by an “end of central directory record” section. Each central directory record partially
duplicates some of the metadata specified in a file section. The purpose of the central directory is to locate the
file sections within the archive (which, back when people regularly had to use split/multi-volume archives, could
be in a different disk).
File metadata times
By default, each file in a ZIP archive has limited metadata: name, size, modification time, and OS-specific
attribute bits. The modification time is in the horrific
MS-DOS date/time format, which can only represent years from 1980 to 2107 (inclusive), has a granularity of two seconds, and doesn’t
specify a timezone (most implementations will use the local timezone, meaning that the output is not portable). Use
extended metadata if you want to be more accurate than that, or ignore the problem and live like it’s 1989, when
ZIP was first released.
The choice of file attribute format
The external file attributes section of a file entry header allows specifying file attributes based on the
host operating system. For example, Windows file attributes include the Archive flag while Unix file attributes
include permission bits. If you want to only ever write in one format, though, the Unix format may be the best
choice because that’s the platform where users often have to deal with file permissions.
Info-ZIP, being normally a Unix-only program, uses the Unix format even on Windows. The way it determines the mode
of a file comes directly from the Cygwin emulation layer, which sets the executable (x) permission bits on a file
if any of these applies:
The file name ends with .exe.
The file name ends with .lnk (Windows shortcut file).
The file content starts with MZ (Portable Executable (PE) file, generally what .exe and
.dll files are).
The file content starts with #! (Unix shebang).
The file content starts with : followed by a carriage return (byte 13). I have no idea what this is.
However, you probably shouldn’t copy this behaviour, because the executable bit is irrelevant when extracting on
Windows. What you should do instead is think about what could be considered executable on Unix. Obviously you want
to keep the shebang detection. You may want to add ELF file detection (0x7f followed by ELF) because,
for example, the user could be cross-compiling from Windows to Linux and wants to zip up the output. The
.exe and/or MZ heuristics are still valuable due to things like Wine, Mono, and .NET Core
combined with binfmt_misc.
Of course, you can also ignore all this and store every file with no executable permissions. Or you can use the
Windows file attributes format when writing on Windows; this is not something I explored so I can’t comment much on
it.
How various zip and unzip programs deal with symbolic links on Windows
Windows distinguishes between a file symlink and a directory symlink—you can make a file symlink pointing to a
directory or vice versa if you wanted to for some reason, and when creating a symlink to a nonexistent location you
must decide which type you expect the target to be. This distinction does not exist on Unix and cannot be made when
the file metadata is recorded in Unix format.
zip:
Explorer and 7-Zip (on Windows at least) do not support recording symlinks in ZIP files. Both will only store file
copies.
Info-ZIP will normally also store file copies unless given the --symlinks flag, in which case it will
store symlinks as actual symlinks. Since Info-ZIP always uses the Unix metadata format, the distinction between
file symlink and directory symlink on Windows is lost.
unzip:
Explorer will straight up ignore the Unix symlink flag and extract the symlink as a text file containing the
target path.
On Windows, 7-Zip will always extract a Unix symlink as file symlink.
On Windows, Info-ZIP will extract a Unix symlink according to whether the target path points to a file or a
directory; if the target is nonexistent, it reports an error and skips the symlink.
Preventing security issues from very large extraction sizes
This is absolutely irrelevant to creating ZIP files (only to extracting ZIP files), but I’ve included it
here because it’s a fascinating topic.
A ZIP file that is designed to cause resource (CPU/disk/memory/bandwidth/stack/filedesc) exhaustion is called a
ZIP bomb. They employ one or several forms of nasty
tricks; this is a summary of what you need to look out for, but there may be others I’m not aware of.
It’s actually very easy to achieve compression ratio of millions (yes, 1000000-to-1 or higher) naturally without
any special tricks if you have highly-compressible data, such as a long string of identical characters, leading
to surprisingly large decompressed files. When handling untrusted data, enforce upper limits on the amount of
memory and disk space you can use for extraction. If you are developing a ZIP extractor, give users the ability
to set these limits.
The uncompressed size declared in a file header may not match the actual decompressed size of its payload. While
decompressing a file, check that it hasn’t exceeded the stated uncompressed size. Also note that the uncompressed
size for a file is listed in two places (local file header and central directory), neither of which can be
trusted.
If you are developing software that wants to analyse ZIP files recursively (inside one another)—for example an
antivirus program—be aware that a tiny ZIP file at the top level can lead to absolutely gargantuan files a few
levels deeper, and that a ZIP file can be crafted to nest infinitely (!). Do not analyse or extract ZIP files
recursively in an unbounded manner.
A file entry can be referred to multiple times from the central directory; similarly, it’s also possible to nest
or overlap file entries. This results in the same payload potentially being extracted many times. A ZIP extractor
needs to scan or keep track of file entry start and end points and make sure that none of them duplicate or
overlap.
The result
This is really nothing to write home about, but here is
the code of my Node.js-based ZIP archiver if you are
interested. As the project name suggests, it’s a basic implementation that takes shortcuts to produce the simplest
(but still compressed) ZIP files. The code is licensed under AGPL-3.0-or-later.
While comparing a Rust executable for Windows targetting x86_64-pc-windows-gnu and one targetting
x86_64-pc-windows-msvc, I noticed that the -gnu one included an embedded
application manifest
resource. This particular manifest does two things: setting
requestedExecutionLevel
to asInvoker, and setting
supportedOSes from Windows Vista to Windows 10.
As far as I can tell, the first part attempts to disable the Windows
installer detection heuristics. However, the documentation appears to indicate that these heuristics are only used on 32-bit binaries, and the
fact that the -msvc executable doesn’t have the manifest reinforces the idea that it’s not needed.
The second part of the manifest is only useful if you want to indicate that you don’t support Windows versions
prior to Vista. I think for most people that would be the default assumption these days.
These things considered, it looks to me that removing the manifest shouldn’t cause any issues. The problem is that
there doesn’t seem to be any built-in way to do this provided by either the OS or the compiler toolchain. You may
have to rely on a third-party tool to do this.
If you don’t mind deleting all embedded resources in the executable—by default there will just be the application
manifest—you can use this simple C code (replace file.exe with your executable path):
The slightly bad news here is that, in my testing, removing this manifest only reduces the executable size by
exactly 1024 bytes. Considering the x86_64-pc-windows-gnu target generally produces executables in the
hundreds of kilobytes at least, this is a fairly inconsequential saving which I probably won’t bother with.
Here’s a new blog, replacing my old one at wordpress.com. I promise this has nothing to do with the recent
WordPress drama; I have no horse in that race (neither side looks good), and I abandoned the old blog years ago due
to my own dislike of WordPress and wordpress.com.
Back then I spent a lot of time trying a few SSGs that seemed to suit
my preferences, but didn’t find any that I really liked. At the time I already considered writing my own generator
but the project didn’t get far. Now that I’m finally making the move happen, I had another look at some of the
options and tried a couple of others, and still none of them made me happy (Astro was close). There’s probably the
right SSG for me out there, but I’m not going to spend months trying
all of them.
So I wrote my own static site generator. I have no ambitions for this to become a generic tool, so that saves me
the pressure of making the code elegant. The code is written in TypeScript, with JSX for templating and Markdown
for writing content. Rather than going the usual route of reusing the JSX runtime from Preact or other similar
projects, I wrote my own JSX runtime from scratch just to learn how.
The biggest problem with this whole setup is the uncomfortably slow speed. I did a bit of poking around and the
problem comes down to:
Fairly slow startup. I suspect this is because of the compilation from TypeScript+JSX to JavaScript. I
might try a different compiler, or a runtime like Deno or Bun that has direct TypeScript support.
Very slow syntax highlighting. For this one, I guess I can cache the HTML output or change to a less
accurate highlighter.
I’m quite happy with the result so I’m going live with this. There are nice-to-have features that I want like
syndication feed (RSS/Atom) and search, but they will have to come in the future.
I’ll slowly migrate my old WordPress posts here while updating the contents if necessary. I don’t plan to add
comment support to this blog, so I’m thinking of linking to Mastodon, for which I’ve just created an account as
well (see the About page).
Update: After more testing, the highlighter that I thought was better wasn’t actually that good;
I’ve switched to a faster one and now highlighting speed is at a reasonable level, and quality seems fine so far.
Update 2: I’ve added a rather bare-bones Atom feed that doesn’t include the article contents
inline. Some feed readers can automatically download articles as you view them, but unfortunately for many readers
you’ll have to click through to the site.
Make is great, but there are some issues with it that
are probably impossible to fix now. So I’ve been looking for a replacement that I can use for simple task
automation; surely in the 40+ years of Make’s lifetime someone has written something better, right?
These are the notes I made while evaluating the different options that I explored. I’m interested to see if anyone
has comments, corrections, or other suggestions.
Make
Implemented in: Various programming languages (probably mainly C).
Script language: Make (various dialects).
Metaprogramming: Yes (major implementations).
Biggest issue: Stringly typed.
Good old Make. Very nice for all sorts of tasks, until you need to deal with files containing space and/or quote
characters, where things start to go downhill.
This is only mentioned for completion because Ninja is in a completely different ballpark. It’s meant to be a
target language for buildfile generators like CMake and Meson, so by itself it has zero programmability, and you
wouldn’t really want to write it by hand.
This feels like a Make variant with fewer features than most Make implementations. I don’t see this as a practical
choice for any project, at least right now.
I actually really like the idea of Task. For very simple use cases it’s very elegant, because its whole syntax is
just 100% valid YAML with a bit of string templating. The templating sometimes gets in the way, though, because the
use of {{.VAR}} for variables
conflicts with YAML’s
{a: b} map syntax, forcing you to waste one level of string quoting on it.
A bigger flaw is that there’s no easy way to override variables from the command line. I think you can work around
this by jumping through scripting hoops, but then you lose a ton of elegance points.
And the biggest flaw: it’s still stringly typed just like Make, so you’ll have trouble separating strings from
lists.
This looked promising until I noticed that outputs didn’t seem to be tracked anywhere, which means everything gets
rebuilt all the time. Is this really the case or am I missing something?
Grunt’s documentation is rather bad, and the examples they have all throw you in the deep, ugly end. Skimming these
introductory materials, I couldn’t figure out how to write the simplest build file, which seems a bad sign.
Biggest issue: Ruby’s shell module is broken on Windows.
This one looked very promising. I’m not a fan of Ruby, but was willing to put up with it because Rake
seemed to do all I wanted. But then I discovered that Ruby’s most reasonable subprocess handler, the
shell module, breaks on Windows. Without it, you’re back to various
ugly half-baked APIs, each
with their own limitations.
Two competing Python-based build systems. These seem too complicated for my use cases. I think making them suit
simple tasks would be a significant undertaking. Or perhaps I’m just missing a documentation that is not mainly
targeted at people trying to create a build pipeline for their C projects.
With this one you end up with lots of boilerplate because rather than writing tasks, you’re writing
task creators. It makes sense, but it feels like doing things at a too-low level when you want it to be a
simple Make alternative.
The author of doit
suggests several high-level interfaces
that can be implemented on top of doit. They do limit what you can do, but you can always write normal doit task
creators in addition to the simplified versions. I think this is a reasonable compromise and I particularly like
the decorator version.
The only remaining problem, then, is that Python’s subprocess handling is very cumbersome. There are two libraries
I know of that can rectify this: sh and Plumbum. sh, in my opinion, is
not suitable for use in a Make replacement use case. The way it does piping by default is not in line with what we
expect, coming from Make. Plumbum is not perfect but better (you still
have to end everything with .run_fg() or the magical & FG).
A quirk of doit is that it creates a cache file (or files) alongside your build file. Depending on the exact
database backend used, it can create up to three files, which I’d say is not ideal.
Conclusion
I have for now settled on doit + Plumbum with around 100 lines of support code. I’m not fully happy with this, and
I’m not sure it can cover all my use cases, but I think it’s time for me to put my ideas and investigations out
there and seek comments.
Rake is almost what I need, if not for what I believe is a bug in Ruby’s standard library. But even if it’s fixed,
I’d prefer to stick with a Python-based solution if possible.