Doorgaan naar hoofdcontent

PDF woes

In the past few days I have been looking at the PDF file format to implement some basic PDF carving support for BANG. Originally PDF was a proprietary file format from Adobe, but recent versions have been released as an ISO standard. The specification for PDF 1.7 is publicly available (as are the errata), and the specification for PDF 2.0 is available after paying ISO (sigh), but example files for PDF 2.0 are freely available.

At the moment PDF 2.0 is not widely used (although some documents can be displayed by current PDF readers) and most of the documents I have found in the wild are PDF 1.x files.

Many people mistakingly believe that PDF files are files for printers, or that they are images on a page. They are not. Instead PDF is a container format: a basic PDF file consists of a header, a body with various objects and a cross reference table for those objects. Objects could be streams (think: pictures), text, fonts, comments, numbers, dictionaries, references, and so on. The cross reference table in the document determines the order in which these objects appear and which of the objects are "in use" or "free". The objects can also be modified (rotated, encrypted, compressed, and so on).

PDF files can be incrementally updated: extra data can be appended to the PDF file that adds to the PDF or overrides existing objects, enables/deletes objects, and so on. This makes PDF somewhat of a layered container format (this is all explained in the PDF 1.7 specification in section 7.5).

Reading a PDF works as follows: a PDF reader opens a file, seeks to the end of the file, reads the latest cross reference table (which contains a reference to the previous cross reference table, which contains a reference to an earlier cross reference table, and so on, creating a chain of cross reference tables) and works its way through the file to reconstruct the data.

At least, that's how it should be. What I have found is that in practice this isn't quite the case. Some of it is due to the fact that the PDF format is ambigious and not everything is explained. In other cases programs are simply generating files that are not compliant with the PDF standards. Let's walk through a few examples.

Line endings

In the PDF standard different types of line endings are used to make sure that the data works on Unix, Windows and older Apple Macs, as the line endings for all three platforms (LF for Unix, CR for Apple and CRLF for Windows) are all valid line "End Of Line" (EOL) markers. But there are a few things in the PDF standard that are simply not well explained. Let's look at the end of file marker (section 7.5.5):

"The last line of the file shall contain only the end-of-file marker, %%EOF."

This could be read in two ways:

  1. The last part of the file can only be %%EOF and the file has to end immediately
  2. The last line can only be %%EOF, whereas a line ends with an EOL marker

In section 7.5.1 the following definition of "line" is given:

"Each line shall be terminated by an end-of-line (EOL) marker, which may be a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both."

so I am inclined to go with the second interpretation as do most tool vendors. However, I have seen files where the file ended immediately after the end-of-file marker with EOL. Most likely they got confused by section 7.2.3:

"The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h)."

but this actually does not contradict the convention that there are EOL markers after each line: they can be there, they are just not part of the comment.

White space

The specification is not clear about where white space is allowed and I have encountered several files with white space where I got confused. One example is the header of the PDF file. In section 7.5.2 of the specification it says:

"The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7."

(of course, this has been updated by the PDF 2.0 specification)

The question is: is it OK to have trailing white space after the version number? Reading the specification it seems that this is not OK, but, again, this is not totally clear. In section 7.2.2 the following is said about white space:

"All white-space characters are equivalent, except in comments, strings, and streams. In all other contexts, PDF treats any sequence of consecutive white-space characters as one character."

and contains a list of characters that are considered white space, including all of the EOL markers. This is confusing, as somewhere else specific meaning was given to the EOL markers.

I have encountered PDF 1.3 files (made by an older version of "Simple Scan", with the Producer tag in the PDF set to "ImageMagick 6.5.8-10 2010-12-17 Q16") where extra white space was added after the header. This was most likely just a bug as it was fixed in other versions.

There are other files where there is a lot of extra white space for example in the trailer between the end of the cross reference table and the "startxref" element where in one case I found dozens of white space characters.

So when parsing PDF files this needs to be taken into account.

Comments

Similar to white space: it is unclear where comments are allowed and I have seen instances (files created with iText for example) where there are comments in the trailer between the end of the cross reference table and the "startxref" element.

Section 7.2.3 says that comments should be treated as white space:

"A conforming reader shall ignore comments, and treat them as single white-space characters."

which would make this the same as the white space situation.

Cross reference tables and updates

The PDF specification is very clear in section 7.5.6 that updates should be appended to a file:

"When updating a PDF file incrementally, changes shall be appended to the end of the file, leaving its original contents intact."

but in reality I am seeing many files where the updates are prepended to the file instead, with the value from startxref pointing to the update at the beginning of the file and the value for the previous cross reference table pointing forward in the file, instead of backward.

Now, for normal use this probably doesn't really matter: the cross reference table specifications themselves don't specify that the location of the previous table should be earlier in the file. As long as the references are correct it doesn't matter where in the file the data is. This makes it more of a "random access" type of container.

I am just left wondering: why?

Why do I care?

A good question is why I even care about these things: as long as PDF writers have taken care of the right references in their files none of these things should be a problem.

The reason I care is quite simple: I don't know what data I get in advance. It could be that I get a blob (a firmware file, a proprietary file system, etc.), which includes a PDF file and I want to be able to (partially) carve the PDF from the larger file. For this it is important to find out where the file starts (usually not the biggest problem) and where it ends (and this is where it gets tricky).

Wrapping up

There are definitely a few places where the PDF 1.7 specifications are unclear and could benefit from improvements, with more examples and more clarifications.

The exceptions and unclarities I found were from a test set of just over 600 PDF files. I am sure that as soon as I encounter more files there will be more exceptions and edge cases.

Reacties

Populaire posts van deze blog

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Walkthrough: PNG file format

A relatively straightforward file format that is used a lot in firmware files that I see is the Portable Network Graphics file format, or simply PNG. To give an example of how widespread it is: in a regular Android firmware with a few applications installed you can easily find over 50,000 PNG files, with quite a few duplicates as well. What baffles me is that quite a few of the license scanning tools out there (including some open source tools) also try to do a license scan of a PNG file. This makes no sense to me at all. While possibly interesting from a copyright perspective (which is about what is in the picture or possibly in the metadata ) the files themselves are not interesting when scanning software: valid PNG files do not contain executable code (maliciously crafted PNG files that exploit errors in PNG parsers are of course a different story). PNG files cannot be combined with other files to create "derivative" software: software cannot be linked with a PNG fil...