Doorgaan naar hoofdcontent

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP) it builds on some quite old foundations.

One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions.
Everyting counts in large amounts - Depeche Mode

WebP file format

The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically.

A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individual chunks.

WebP header

The header of a WebP file is 12 bytes:
  1. string RIFF (4 bytes)
  2. size of the rest of the file (4 bytes) in little endian format. This excludes the string RIFF and the size field itself
  3. string WEBP
Because a valid WebP file always has to have a header it is impossible to have a valid WebP file of less than 12 bytes. Because the size is recorded in the header it is very easy to verify if the file you are looking at is a WebP file, or contains a WebP file: if the file is smaller than what is declared in the header, then the file cannot be a valid WebP file. If it is the same, then the entire file might be a WebP file and if the file is larger than what is declared then part of the file might be a WebP file.

WebP chunks

The structure of the chunks is:
  1. FourCC (4 bytes) - a string indicating the type of chunk
  2. size of the rest of the chunk (4 bytes), excluding the FourCC and size field itself, also excluding padding in little endian format.
  3. data ('chunk size' bytes if chunk size is even, else chunk size + 1). If the chunk size is odd, then the last byte of the data will be a padding byte with the value 0.
Values for FourCC are defined in the WebP specifications and include strings like:
  • ANIM
  • EXIF
  • VP8X
  • VP8L
but there are also a few more. During the history of WebP there have been several renames of tags, and one tag (FRGM) also has been dropped in later versions of WebP. FourCC values could include spaces.

Writing a WebP parser

A simple parser for WebP is fairly easy to make and would require only a single pass over the file:
  1. check if the file size is 12 bytes or more. If not, then exit.
  2. open the file, read the first 4 bytes and see if it matches the string RIFF. If not, close the file and exit.
  3. read the next 4 bytes and see if the integer + 8 matches the size of the file. If not, close the file and exit (unless you want to carve the file from a bigger
  4. read the next 4 bytes and see if it matches the string WEBP. If not, close the file and exit.
Then there are several chunks. For each chunk the following steps should be done until the end of the file is reached:
  1. read the next 4 bytes. Check if four bytes were read. If not, close the file and exit. Optionally check to see if the bytes match a known FourCC from the WebP specification. If not, close the file and exit.
  2. read the next 4 bytes for the length of the chunk. Check if four bytes could be read. If not, close the file and exit. Check if the length + 8 is less than or equal to the remaining bytes in the file. If not, close the file and exit (as a chunk cannot be outside of the file).
  3. skip over the amount of bytes from the previous step. If the amount of bytes is odd, then read the next byte and check if it is 0x00 (padding). If not, close the file and exit.
After all data has been read the amount of data read in all the steps should match the size declared in the header plus 8 header bytes.

A more extensive parser could do more checks for the data in the individual chunks. In case a file has to be carved from a larger file then step 1 of the chunks should be slightly changed to simply stop processing and writing all data read so far to a separate file.

Reacties

Populaire posts van deze blog

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Walkthrough: Apple resource fork files

For a long time Apple has stored structured metadata about files in special files called resource forks . These files tend to pop up in archives that were created or packed on an Apple computer. Typically you can find these files in a directory called __MACOSX :  $ file __MACOSX/test/._.DS_Store __MACOSX/test/._.DS_Store: AppleDouble encoded Macintosh file I try to recognize these files, tag them and then ignore them, as the information contained in it is not very useful for me Apple resource fork structure An Apple resource fork file consists of a header and then a number of descriptors of each entry. A full description of the values of descriptors can be found in Appendix A & B of RFC1740 . Apple resource fork header The header consists of: signature: 0x00 0x05 0x16 0x07 version number (4 bytes) filler (16 bytes) - these should all be 0x00 number of entries (2 bytes) - this is in big endian format  The minimum resource fork file is 4 + 4 + 16 + 2 = 26 b...