Doorgaan naar hoofdcontent

Walkthrough: ZIP file format

A very popular format for compressing and archiving files is the ZIP format. Its history is well documented at the PKZIP page on Wikipedia. While on Unix systems (such as Linux) other compression tools (tar for archiving, combined with gzip for compression) have historically been more popular it is the dominant archiving and compression format in the rest of the world. Other systems, like Java and Android are also using ZIP extensively (in JAR and APK files respectively) and other formats, such as Open Document Format (ODF) are also based on ZIP.

The specifications for ZIP have been open for years and can be freely (re)implemented and every decent programming language has good support for working with ZIP files, although not every implementation fully implements all functionality. For example, Python 3 has is a fairly complete implementation with support for LZMA and bzip2 compression and large files (ZIP64), but not for multi-file ZIP. The references in the rest of this blog post are for these specifications.

ZIP file format

The ZIP file format is described in the ZIP specifications in section 4.3.6. Basically it consists of a series of headers for files plus the actual file data, followed by a lookup table (called "central directory") for quick access to the locations in the files where the data can be found.

The metadata for files include things like a CRC32 checksum, modification date, and so on, but possibly also operating system specific extensions such as UNIX uid and guid.

Carving ZIP files from a larger file

The central directory is very convenient for quick acces to files in the ZIP file and it is what is used by programs and libraries and it makes unpacking ZIP files very easy. Or at least, when you know that there is just one ZIP file, and the central directory is the very last part of the file. If this is not the case, then the ZIP file gets quite a bit more complex.

Let's take the situation when two ZIP files have been concatenated, possibly with data in between. The examples below are done with real data, so it should be trivial to recreate on any Linux system. I used two well known binaries (ls and vim) and created a ZIP file for both of them:

$ zip -r vim.zip vim
  adding: vim (deflated 49%)

$ zip -r ls.zip ls
  adding: ls (deflated 53%)


I then concatenated the files:

$ cat vim.zip ls.zip > test.zip


The central directory of the file ls.zip is now at the end of the new ZIP file. When running unzip the following happens:

$ unzip -l test.zip 
Archive:  test.zip
warning [test.zip]:  1539848 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
   133096  07-12-2018 14:49   ls
---------                     -------
   133096                     1 file


As you can see the data from vim.zip is nowhere to be found. That is because the unzip program looks at the end of the file to see if it finds a central directory and works from there.

This can also be seen when appending data to a ZIP file:

$ cat vim.zip /bin/vim > test.zip


and then running unzip again:

$ unzip -l test.zip
Archive:  test.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of test.zip or
        test.zip.zip, and cannot find test.zip.ZIP, period.


This is because the file no longer ends with the central directory from the ZIP file. Python's zipfile module behaves in a similar way. This means that to properly carve and unpack ZIP files from files that contain multiple ZIP files (possibly with other data in between the ZIP files), or where the ZIP file is followed by other data you need to make sure that you know where the ZIP file starts, and where it ends.

For regular ZIP files this is not much of a problem, and you can do the following (simplified):

  1. search for a local file header (section 4.3.7) and parse it. Especially the compressed file size is important, as this allows you to skip the compressed data.
  2. do step 1 for all files, until you hit the first central directory record (section 4.3.12). There will be one for every file and/or directory stored in the ZIP file. The data in the central directory is largely a copy of the local file header, so ideal for verifying data consistency.
  3. do step 2 for all files, until you encounter the end of central directory record (section 4.3.16). Parse this to make sure you also catch things like the ZIP comment.
and that's it for a simple ZIP file. But then there are of course all the exceptions.

ZIP64

When packing large files the 8 byte size fields in the local file header and central directory might not be enough to hold the size of the compressed file or uncompressed file. There is an extension called ZIP64, which allows storing larger files. For this the data is stored somewhere else in the file. In the central directory special purpose ZIP64 fields (section 4.3.14 and 4.3.15) are included, but this only helps when working backwards from the central directory, not from the beginning of the file. Instead, the correct size will be stored in a so called "extra field" (sections 4.4.8, 4.4.9 and  4.5.3).

Data descriptor

There are cases where the local file header does not contain the file size. This is when a certain bit in the the general purpose bit flag is set to 1 (sections 4.4.4, 4.4.8, 4.4.9). Instead the size follows the file data. This is not a problem if you already have the central directory, as there you can simply find the correct size, but if you don't know where the central directory is (which is our assumption), then you need to do some more work and from the local file header work forward until you see known ZIP headers and then perform some checks. This can actually be quite tricky.

The best example is when a ZIP file contains another ZIP file. The ZIP program will not compress already compressed data, because there would not be a gain (and a file might actually become larger as a result). That's why sometimes you can see data that is not compressed. If a ZIP file contains another ZIP file there will be lots of false positives for ZIP headers. The workaround is to look at the 12 bytes just before the header: these should contain the compressed size, uncompressed size and CRC header (section 4.3.9) which can be used to verify if it is indeed the end of the compressed data.

Android signing block

According to the ZIP specifications there should not be any data between the file data and the central directory (at least, that is how I am reading it), but in practice this isn't always followed. Also, it doesn't matter, as programs unpacking ZIP files can simply rely on the central directory. Unless, of course, you are carving ZIP files, like I do.

Since Android 7 Google has been squeezing the APK signing block in between the ZIP file data and the central directory of the ZIP file as explained on the Android Signature Scheme page.

The Android APK Signing Block has a clear structure, so it is easy to verify whether or not there is an APK Signing Block or not and skip it.

And more

I skipped quite a few bits, like encryption, digital signatures, and multi-file archives. There might also be other variants of ZIP files that I simply haven't found about yet.

Need an unpacker for ZIP files?

In most cases regular unzipping tools like unzip or 7z will work. However, if you are encountering a weird ZIP file, with concatenated data that you need to unpack then please take a look at BANG.

Reacties

Populaire posts van deze blog

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Walkthrough: Apple resource fork files

For a long time Apple has stored structured metadata about files in special files called resource forks . These files tend to pop up in archives that were created or packed on an Apple computer. Typically you can find these files in a directory called __MACOSX :  $ file __MACOSX/test/._.DS_Store __MACOSX/test/._.DS_Store: AppleDouble encoded Macintosh file I try to recognize these files, tag them and then ignore them, as the information contained in it is not very useful for me Apple resource fork structure An Apple resource fork file consists of a header and then a number of descriptors of each entry. A full description of the values of descriptors can be found in Appendix A & B of RFC1740 . Apple resource fork header The header consists of: signature: 0x00 0x05 0x16 0x07 version number (4 bytes) filler (16 bytes) - these should all be 0x00 number of entries (2 bytes) - this is in big endian format  The minimum resource fork file is 4 + 4 + 16 + 2 = 26 b...