A very popular format for compressing and archiving files is the ZIP format. Its history is well documented at the PKZIP page on Wikipedia. While on Unix systems (such as Linux) other compression tools (tar for archiving, combined with gzip for compression) have historically been more popular it is the dominant archiving and compression format in the rest of the world. Other systems, like Java and Android are also using ZIP extensively (in JAR and APK files respectively) and other formats, such as Open Document Format (ODF) are also based on ZIP.
The specifications for ZIP have been open for years and can be freely (re)implemented and every decent programming language has good support for working with ZIP files, although not every implementation fully implements all functionality. For example, Python 3 has is a fairly complete implementation with support for LZMA and bzip2 compression and large files (ZIP64), but not for multi-file ZIP. The references in the rest of this blog post are for these specifications.
The metadata for files include things like a CRC32 checksum, modification date, and so on, but possibly also operating system specific extensions such as UNIX uid and guid.
Let's take the situation when two ZIP files have been concatenated, possibly with data in between. The examples below are done with real data, so it should be trivial to recreate on any Linux system. I used two well known binaries (ls and vim) and created a ZIP file for both of them:
The central directory of the file ls.zip is now at the end of the new ZIP file. When running unzip the following happens:
As you can see the data from vim.zip is nowhere to be found. That is because the unzip program looks at the end of the file to see if it finds a central directory and works from there.
This can also be seen when appending data to a ZIP file:
and then running unzip again:
This is because the file no longer ends with the central directory from the ZIP file. Python's zipfile module behaves in a similar way. This means that to properly carve and unpack ZIP files from files that contain multiple ZIP files (possibly with other data in between the ZIP files), or where the ZIP file is followed by other data you need to make sure that you know where the ZIP file starts, and where it ends.
For regular ZIP files this is not much of a problem, and you can do the following (simplified):
The best example is when a ZIP file contains another ZIP file. The ZIP program will not compress already compressed data, because there would not be a gain (and a file might actually become larger as a result). That's why sometimes you can see data that is not compressed. If a ZIP file contains another ZIP file there will be lots of false positives for ZIP headers. The workaround is to look at the 12 bytes just before the header: these should contain the compressed size, uncompressed size and CRC header (section 4.3.9) which can be used to verify if it is indeed the end of the compressed data.
Since Android 7 Google has been squeezing the APK signing block in between the ZIP file data and the central directory of the ZIP file as explained on the Android Signature Scheme page.
The Android APK Signing Block has a clear structure, so it is easy to verify whether or not there is an APK Signing Block or not and skip it.
The specifications for ZIP have been open for years and can be freely (re)implemented and every decent programming language has good support for working with ZIP files, although not every implementation fully implements all functionality. For example, Python 3 has is a fairly complete implementation with support for LZMA and bzip2 compression and large files (ZIP64), but not for multi-file ZIP. The references in the rest of this blog post are for these specifications.
ZIP file format
The ZIP file format is described in the ZIP specifications in section 4.3.6. Basically it consists of a series of headers for files plus the actual file data, followed by a lookup table (called "central directory") for quick access to the locations in the files where the data can be found.The metadata for files include things like a CRC32 checksum, modification date, and so on, but possibly also operating system specific extensions such as UNIX uid and guid.
Carving ZIP files from a larger file
The central directory is very convenient for quick acces to files in the ZIP file and it is what is used by programs and libraries and it makes unpacking ZIP files very easy. Or at least, when you know that there is just one ZIP file, and the central directory is the very last part of the file. If this is not the case, then the ZIP file gets quite a bit more complex.Let's take the situation when two ZIP files have been concatenated, possibly with data in between. The examples below are done with real data, so it should be trivial to recreate on any Linux system. I used two well known binaries (ls and vim) and created a ZIP file for both of them:
$ zip -r vim.zip vim adding: vim (deflated 49%) $ zip -r ls.zip ls adding: ls (deflated 53%)I then concatenated the files:
$ cat vim.zip ls.zip > test.zip
The central directory of the file ls.zip is now at the end of the new ZIP file. When running unzip the following happens:
$ unzip -l test.zip Archive: test.zip warning [test.zip]: 1539848 extra bytes at beginning or within zipfile (attempting to process anyway) Length Date Time Name --------- ---------- ----- ---- 133096 07-12-2018 14:49 ls --------- ------- 133096 1 file
As you can see the data from vim.zip is nowhere to be found. That is because the unzip program looks at the end of the file to see if it finds a central directory and works from there.
This can also be seen when appending data to a ZIP file:
$ cat vim.zip /bin/vim > test.zip
and then running unzip again:
$ unzip -l test.zip Archive: test.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of test.zip or test.zip.zip, and cannot find test.zip.ZIP, period.
This is because the file no longer ends with the central directory from the ZIP file. Python's zipfile module behaves in a similar way. This means that to properly carve and unpack ZIP files from files that contain multiple ZIP files (possibly with other data in between the ZIP files), or where the ZIP file is followed by other data you need to make sure that you know where the ZIP file starts, and where it ends.
For regular ZIP files this is not much of a problem, and you can do the following (simplified):
- search for a local file header (section 4.3.7) and parse it. Especially the compressed file size is important, as this allows you to skip the compressed data.
- do step 1 for all files, until you hit the first central directory record (section 4.3.12). There will be one for every file and/or directory stored in the ZIP file. The data in the central directory is largely a copy of the local file header, so ideal for verifying data consistency.
- do step 2 for all files, until you encounter the end of central directory record (section 4.3.16). Parse this to make sure you also catch things like the ZIP comment.
ZIP64
When packing large files the 8 byte size fields in the local file header and central directory might not be enough to hold the size of the compressed file or uncompressed file. There is an extension called ZIP64, which allows storing larger files. For this the data is stored somewhere else in the file. In the central directory special purpose ZIP64 fields (section 4.3.14 and 4.3.15) are included, but this only helps when working backwards from the central directory, not from the beginning of the file. Instead, the correct size will be stored in a so called "extra field" (sections 4.4.8, 4.4.9 and 4.5.3).Data descriptor
There are cases where the local file header does not contain the file size. This is when a certain bit in the the general purpose bit flag is set to 1 (sections 4.4.4, 4.4.8, 4.4.9). Instead the size follows the file data. This is not a problem if you already have the central directory, as there you can simply find the correct size, but if you don't know where the central directory is (which is our assumption), then you need to do some more work and from the local file header work forward until you see known ZIP headers and then perform some checks. This can actually be quite tricky.The best example is when a ZIP file contains another ZIP file. The ZIP program will not compress already compressed data, because there would not be a gain (and a file might actually become larger as a result). That's why sometimes you can see data that is not compressed. If a ZIP file contains another ZIP file there will be lots of false positives for ZIP headers. The workaround is to look at the 12 bytes just before the header: these should contain the compressed size, uncompressed size and CRC header (section 4.3.9) which can be used to verify if it is indeed the end of the compressed data.
Android signing block
According to the ZIP specifications there should not be any data between the file data and the central directory (at least, that is how I am reading it), but in practice this isn't always followed. Also, it doesn't matter, as programs unpacking ZIP files can simply rely on the central directory. Unless, of course, you are carving ZIP files, like I do.Since Android 7 Google has been squeezing the APK signing block in between the ZIP file data and the central directory of the ZIP file as explained on the Android Signature Scheme page.
The Android APK Signing Block has a clear structure, so it is easy to verify whether or not there is an APK Signing Block or not and skip it.
Reacties
Een reactie posten