Doorgaan naar hoofdcontent

Walkthrough: PNG file format

A relatively straightforward file format that is used a lot in firmware files that I see is the Portable Network Graphics file format, or simply PNG. To give an example of how widespread it is: in a regular Android firmware with a few applications installed you can easily find over 50,000 PNG files, with quite a few duplicates as well.

What baffles me is that quite a few of the license scanning tools out there (including some open source tools) also try to do a license scan of a PNG file. This makes no sense to me at all. While possibly interesting from a copyright perspective (which is about what is in the picture or possibly in the metadata) the files themselves are not interesting when scanning software:
  1. valid PNG files do not contain executable code (maliciously crafted PNG files that exploit errors in PNG parsers are of course a different story).
  2. PNG files cannot be combined with other files to create "derivative" software: software cannot be linked with a PNG file as a PNG is not software. Of course the contents of PNG files could have been copied from somewhere else and a derivative work could be created, but that is not software linking.
  3. PNG files have a fairly fixed structure that makes them look similar to eachother, possibly leading to false positives when doing for example "proximity scans" or "fuzzy matching" with algorithms such as TLSH.
This is why you want to skip PNG files when scanning source code or binary files. Parsing PNG files is very easy and quick and if you implement it right you can verify thousands of PNG files in a very short timespan. Verifying if a PNG file is actually a valid PNG allows you to ignore them if you wish to do so (for example by flagging them as a PNG file, or graphics, or something).

The Wikipedia page about PNG has a good explanation about why PNG was created, but in short: patents covering other formats, as well as technical limitations of the other formats.

PNG structure

The PNG specifications are public. To create a parser for PNG the important sections of the specifications are 5 (datastream structure) and 11 (chunk specifications).

Basically a PNG file consists of a PNG signature, followed by several chunks. The chunks all have the same structure containing a length value, a chunk type, a payload and a checksum value. This fixed structure makes it very easy to verify the chunks (without verifying the actual syntax of the chunk) and very quickly step through the file in a single pass.

PNG signature

The signature is always the same 8 bytes for every PNG file and is described in section 5.2 of the specification. Without this signature a file cannot be a valid PNG file.

Chunks

The signature is followed by a set of chunks. Each chunk has 3 or 4 fields (section 5.3 of the specification).
  1. length (4 bytes) - this value is in network byte order (big endian)
  2. chunk type (4 bytes)
  3. chunk data (optional if length = 0)
  4. CRC32 computed from chunk type and chunk data (4 bytes)
A minimal chunk (without data) is 12 bytes and a minimal PNG has three chunks: IHDR (header), IDAT (data) and IEND (terminator).

The terminator IEND always has length 0 (meaning there is no data), the chunk type is always IEND, so the CRC32 value is also the same. This means that the IEND chunk is always the same 12 bytes (section 11.2.5).

The header IHDR can contain different data, but is always 25 bytes. An IDAT chunk is minimal 12 bytes. A minimal PNG file (signature plus three mandatory chunks) is therefore 8 + 25 + 12 + 12 = 57 bytes. A file shorter than 57 bytes cannot be a valid PNG file.
A file shorter than 57 bytes cannot be a valid PNG file.

Writing a simple PNG parser

A simple parser to see if a file contains a single PNG (and the whole file is a PNG) in a single pass could look like this:
  1. check if the file size is 57 bytes or more. If not, exit.
  2. open the file at byte 0, read the first 8 bytes and see if it matches the PNG signature. If not, close the file and exit.
  3. read the next 25 bytes
  4. see if the first 4 bytes read in step 3 match 0x00 0x00 0x00 0x0d (= 13), which is the size of the IHDR chunk. If not, close the file and exit.
  5. check the next 4 bytes and see if they match the string IHDR. If not, close the file and exit.
  6. check the next 13 bytes and compute the CRC32 checksum over IHDR and the 13 bytes (chunk data).
  7. the next 4 bytes should match the checksum computed in the previous step. If not, close the file and exit.

Then for each chunk that follows do the following:
  1. read four bytes to determine the chunk length. Verify if four bytes could be read. Check if the length value is less than or equal to the remaining bytes in the file. If not, close the file and exit (this is because a chunk cannot be outside of the file).
  2. read four bytes to determine the chunk type. Verify if four bytes could be read. If not close the file and exit. If the chunk type is IHDR close the file and exit (only one IHDR per file is allowed).
  3. if the chunktype is IEND check if the length of the remaining bytes in the file is exactly four. If not, close and exit. Check if the length of the chunk equals 0. If not, close and exit. Read four bytes (CRC32 checksum) and verify if they equal 0xae 0x42 0x60 0x82. If not, close and exit.
  4. if the chunktype is not IEND, then read the amount of bytes as specified in the chunk length. Verify if the amount of bytes could actually be read. If not close the file and exit. Append these bytes to the chunk type from.
  5. compute the CRC32 for the data from the previous step.
  6. read four bytes from the file. Verify that four bytes could be read. If not, close the file and exit. Verify that the bytes match the result from the previous step. If not, close the file and exit.
If the end of the file has been reached, but no IDAT or IEND sections were found, then the file is invalid as well.

And basically that is all there is to it. It is really simple.

Note that this verifier would not look at the actual payload of the chunks to see if it is correct. It is purely to see if the structure of the file is valid. Extra checks could include the order in which the chunks appear (section 5.6) and checks for data inside chunks.

Carving PNG files from a larger file

Carving PNG files from a larger file is just a little bit more work but is also very easy to do. The only changes are that the PNG signature might not be at byte 0, and after seeing IEND the rest of the data should simply be ignored.

This could be useful in case you encounter a file with an unknown structure that cannot be unpacked using regular tools, but where still data can be extracted from. An example of this could be custom update images from vendors, or an image of unknown file systems.

Reacties

Populaire posts van deze blog

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...