Johannes Sasongko’s blog

Posts tagged typescript

Creating a zip implementation for Node.js

It has always bothered me that the Node.js standard library doesn’t provide an easy way to create ZIP files. What the standard library does have is an implementation of the Deflate algorithm, which is the compression scheme normally used in ZIP files. This means that the most complicated and performance-sensitive part of job has already been done for us, and all we need to do is add the wrapper. So I decided to write it for fun: a ZIP implementation for Node.js that uses the standard library’s Deflate implementation.

This post contains some (very unorganised) notes that document the trickier parts of this process. Even though my implementation only covers the compression side (zip, not unzip), some of these notes are applicable to the decompression side as well.

Note: In this post, Explorer refers to Windows Explorer on Windows 10, 7-Zip refers to the official 7-Zip program (not p7zip), and Info-ZIP is the project supplying the zip and unzip programs included on most Linux distributions and on MSYS2/Cygwin.

What we need

First we need the ZIP specification. That’s the easiest part because the specification is open and freely implementable by anyone. PKWARE, the creators of the ZIP format, have continually published it in a single text file named APPNOTE.TXT. What follows are a number of decisions you need to make depending on how complete you want your implementation to be.

There are two types of ZIP files: (normal) ZIP and ZIP64. The biggest difference is that a normal ZIP file uses 32-bit values for file sizes and offsets, while ZIP64 uses 64-bit values. The details of ZIP64 can be a bit weird due to it being shoehorned into an existing format that was not designed to accommodate such a change. You can opt to ignore ZIP64, but that will mean that you won’t be able to handle any individual file greater than 4 GiB—this includes any uncompressed file size, any compressed file size, or (roughly) the size of the ZIP file itself.

We don’t actually need compression! ZIP files can store uncompressed files, though obviously that would greatly limit the applicable use cases. Technically the ZIP specification supports many different compression formats, including modern ones like xz and Zstandard, but in practice you should probably stick to Deflate and maybe Deflate64 because those are the ones with widespread support (and by that I mean they are the only ones supported by Explorer). As mentioned previously, Node.js includes a Deflate implementation; this is available in the zlib module.

ZIP files use CRC-32 for error detection. Node.js again has our back here, as the zlib module also contains a CRC-32 implementation. If you want to write your own implementation, the code listing in RFC 1952 is very easy to follow.

File sizes (compressed and uncompressed) can be indicated either before or after each file payload. Putting them before the payload results in slightly simpler ZIP file structure. However, if your use case involves live streaming of the ZIP file, putting file sizes after each payload may be your only option.

A proper implementation might include concurrency/multithreading capability. The simplest way probably is to run the compression procedure on multiple input files at once, outputting to temporary files, and then combining them into the actual ZIP file.

High-level view

The ZIP specification is fairly good in terms of giving a general overview of the format. Refer to APPNOTE.TXT section 4.3 “General Format of a .ZIP file”.

Ignoring some of the fancier features like encryption, a ZIP file consists of a bunch of file sections (each including the contents of a file and some metadata), followed by a bunch of “central directory records”, and finally followed by an “end of central directory record” section. Each central directory record partially duplicates some of the metadata specified in a file section. The purpose of the central directory is to locate the file sections within the archive (which, back when people regularly had to use split/multi-volume archives, could be in a different disk).

File metadata times

By default, each file in a ZIP archive has limited metadata: name, size, modification time, and OS-specific attribute bits. The modification time is in the horrific MS-DOS date/time format, which can only represent years from 1980 to 2107 (inclusive), has a granularity of two seconds, and doesn’t specify a timezone (most implementations will use the local timezone, meaning that the output is not portable). Use extended metadata if you want to be more accurate than that, or ignore the problem and live like it’s 1989, when ZIP was first released.

The choice of file attribute format

The external file attributes section of a file entry header allows specifying file attributes based on the host operating system. For example, Windows file attributes include the Archive flag while Unix file attributes include permission bits. If you want to only ever write in one format, though, the Unix format may be the best choice because that’s the platform where users often have to deal with file permissions.

Info-ZIP, being normally a Unix-only program, uses the Unix format even on Windows. The way it determines the mode of a file comes directly from the Cygwin emulation layer, which sets the executable (x) permission bits on a file if any of these applies:

However, you probably shouldn’t copy this behaviour, because the executable bit is irrelevant when extracting on Windows. What you should do instead is think about what could be considered executable on Unix. Obviously you want to keep the shebang detection. You may want to add ELF file detection (0x7f followed by ELF) because, for example, the user could be cross-compiling from Windows to Linux and wants to zip up the output. The .exe and/or MZ heuristics are still valuable due to things like Wine, Mono, and .NET Core combined with binfmt_misc.

Of course, you can also ignore all this and store every file with no executable permissions. Or you can use the Windows file attributes format when writing on Windows; this is not something I explored so I can’t comment much on it.

How various zip and unzip programs deal with symbolic links on Windows

Windows distinguishes between a file symlink and a directory symlink—you can make a file symlink pointing to a directory or vice versa if you wanted to for some reason, and when creating a symlink to a nonexistent location you must decide which type you expect the target to be. This distinction does not exist on Unix and cannot be made when the file metadata is recorded in Unix format.

zip:

unzip:

Preventing security issues from very large extraction sizes

This is absolutely irrelevant to creating ZIP files (only to extracting ZIP files), but I’ve included it here because it’s a fascinating topic.

A ZIP file that is designed to cause resource (CPU/disk/memory/bandwidth/stack/filedesc) exhaustion is called a ZIP bomb. They employ one or several forms of nasty tricks; this is a summary of what you need to look out for, but there may be others I’m not aware of.

The result

This is really nothing to write home about, but here is the code of my Node.js-based ZIP archiver if you are interested. As the project name suggests, it’s a basic implementation that takes shortcuts to produce the simplest (but still compressed) ZIP files. The code is licensed under AGPL-3.0-or-later.