Creating a zip implementation for Node.js

It has always bothered me that the Node.js standard library doesn’t provide an easy way to create ZIP files. What the standard library does have is an implementation of the Deflate algorithm, which is the compression scheme normally used in ZIP files. This means that the most complicated and performance-sensitive part of job has already been done for us, and all we need to do is add the wrapper. So I decided to write it for fun: a ZIP implementation for Node.js that uses the standard library’s Deflate implementation.

This post contains some (very unorganised) notes that document the trickier parts of this process. Even though my implementation only covers the compression side (zip, not unzip), some of these notes are applicable to the decompression side as well.

Note: In this post, Explorer refers to Windows Explorer on Windows 10, 7-Zip refers to the official 7-Zip program (not p7zip), and Info-ZIP is the project supplying the zip and unzip programs included on most Linux distributions and on MSYS2/Cygwin.

What we need

First we need the ZIP specification. That’s the easiest part because the specification is open and freely implementable by anyone. PKWARE, the creators of the ZIP format, have continually published it in a single text file named APPNOTE.TXT. What follows are a number of decisions you need to make depending on how complete you want your implementation to be.

There are two types of ZIP files: (normal) ZIP and ZIP64. The biggest difference is that a normal ZIP file uses 32-bit values for file sizes and offsets, while ZIP64 uses 64-bit values. The details of ZIP64 can be a bit weird due to it being shoehorned into an existing format that was not designed to accommodate such a change. You can opt to ignore ZIP64, but that will mean that you won’t be able to handle any individual file greater than 4 GiB—this includes any uncompressed file size, any compressed file size, or (roughly) the size of the ZIP file itself.

We don’t actually need compression! ZIP files can store uncompressed files, though obviously that would greatly limit the applicable use cases. Technically the ZIP specification supports many different compression formats, including modern ones like xz and Zstandard, but in practice you should probably stick to Deflate and maybe Deflate64 because those are the ones with widespread support (and by that I mean they are the only ones supported by Explorer). As mentioned previously, Node.js includes a Deflate implementation; this is available in the zlib module.

ZIP files use CRC-32 for error detection. Node.js again has our back here, as the zlib module also contains a CRC-32 implementation. If you want to write your own implementation, the code listing in RFC 1952 is very easy to follow.

File sizes (compressed and uncompressed) can be indicated either before or after each file payload. Putting them before the payload results in slightly simpler ZIP file structure. However, if your use case involves live streaming of the ZIP file, putting file sizes after each payload may be your only option.

A proper implementation might include concurrency/multithreading capability. The simplest way probably is to run the compression procedure on multiple input files at once, outputting to temporary files, and then combining them into the actual ZIP file.

High-level view

The ZIP specification is fairly good in terms of giving a general overview of the format. Refer to APPNOTE.TXT section 4.3 “General Format of a .ZIP file”.

Ignoring some of the fancier features like encryption, a ZIP file consists of a bunch of file sections (each including the contents of a file and some metadata), followed by a bunch of “central directory records”, and finally followed by an “end of central directory record” section. Each central directory record partially duplicates some of the metadata specified in a file section. The purpose of the central directory is to locate the file sections within the archive (which, back when people regularly had to use split/multi-volume archives, could be in a different disk).

File metadata times

By default, each file in a ZIP archive has limited metadata: name, size, modification time, and OS-specific attribute bits. The modification time is in the horrific MS-DOS date/time format, which can only represent years from 1980 to 2107 (inclusive), has a granularity of two seconds, and doesn’t specify a timezone (most implementations will use the local timezone, meaning that the output is not portable). Use extended metadata if you want to be more accurate than that, or ignore the problem and live like it’s 1989, when ZIP was first released.

The choice of file attribute format

The external file attributes section of a file entry header allows specifying file attributes based on the host operating system. For example, Windows file attributes include the Archive flag while Unix file attributes include permission bits. If you want to only ever write in one format, though, the Unix format may be the best choice because that’s the platform where users often have to deal with file permissions.

Info-ZIP, being normally a Unix-only program, uses the Unix format even on Windows. The way it determines the mode of a file comes directly from the Cygwin emulation layer, which sets the executable (x) permission bits on a file if any of these applies:

The file name ends with .exe.
The file name ends with .lnk (Windows shortcut file).
The file content starts with MZ (Portable Executable (PE) file, generally what .exe and .dll files are).
The file content starts with #! (Unix shebang).
The file content starts with : followed by a carriage return (byte 13). I have no idea what this is.

However, you probably shouldn’t copy this behaviour, because the executable bit is irrelevant when extracting on Windows. What you should do instead is think about what could be considered executable on Unix. Obviously you want to keep the shebang detection. You may want to add ELF file detection (0x7f followed by ELF) because, for example, the user could be cross-compiling from Windows to Linux and wants to zip up the output. The .exe and/or MZ heuristics are still valuable due to things like Wine, Mono, and .NET Core combined with binfmt_misc.

Of course, you can also ignore all this and store every file with no executable permissions. Or you can use the Windows file attributes format when writing on Windows; this is not something I explored so I can’t comment much on it.

How various zip and unzip programs deal with symbolic links on Windows

Windows distinguishes between a file symlink and a directory symlink—you can make a file symlink pointing to a directory or vice versa if you wanted to for some reason, and when creating a symlink to a nonexistent location you must decide which type you expect the target to be. This distinction does not exist on Unix and cannot be made when the file metadata is recorded in Unix format.

zip:

Explorer and 7-Zip (on Windows at least) do not support recording symlinks in ZIP files. Both will only store file copies.
Info-ZIP will normally also store file copies unless given the --symlinks flag, in which case it will store symlinks as actual symlinks. Since Info-ZIP always uses the Unix metadata format, the distinction between file symlink and directory symlink on Windows is lost.

unzip:

Explorer will straight up ignore the Unix symlink flag and extract the symlink as a text file containing the target path.
On Windows, 7-Zip will always extract a Unix symlink as file symlink.
On Windows, Info-ZIP will extract a Unix symlink according to whether the target path points to a file or a directory; if the target is nonexistent, it reports an error and skips the symlink.

Preventing security issues from very large extraction sizes

This is absolutely irrelevant to creating ZIP files (only to extracting ZIP files), but I’ve included it here because it’s a fascinating topic.

A ZIP file that is designed to cause resource (CPU/disk/memory/bandwidth/stack/filedesc) exhaustion is called a ZIP bomb. They employ one or several forms of nasty tricks; this is a summary of what you need to look out for, but there may be others I’m not aware of.

It’s actually very easy to achieve compression ratio of millions (yes, 1000000-to-1 or higher) naturally without any special tricks if you have highly-compressible data, such as a long string of identical characters, leading to surprisingly large decompressed files. When handling untrusted data, enforce upper limits on the amount of memory and disk space you can use for extraction. If you are developing a ZIP extractor, give users the ability to set these limits.
The uncompressed size declared in a file header may not match the actual decompressed size of its payload. While decompressing a file, check that it hasn’t exceeded the stated uncompressed size. Also note that the uncompressed size for a file is listed in two places (local file header and central directory), neither of which can be trusted.
If you are developing software that wants to analyse ZIP files recursively (inside one another)—for example an antivirus program—be aware that a tiny ZIP file at the top level can lead to absolutely gargantuan files a few levels deeper, and that a ZIP file can be crafted to nest infinitely (!). Do not analyse or extract ZIP files recursively in an unbounded manner.
A file entry can be referred to multiple times from the central directory; similarly, it’s also possible to nest or overlap file entries. This results in the same payload potentially being extracted many times. A ZIP extractor needs to scan or keep track of file entry start and end points and make sure that none of them duplicate or overlap.

The result

This is really nothing to write home about, but here is the code of my Node.js-based ZIP archiver if you are interested. As the project name suggests, it’s a basic implementation that takes shortcuts to produce the simplest (but still compressed) ZIP files. The code is licensed under AGPL-3.0-or-later.