In the past few days I have been looking at the PDF file format to implement some basic PDF carving support for BANG. Originally PDF was a proprietary file format from Adobe, but recent versions have been released as an ISO standard. The specification for PDF 1.7 is publicly available (as are the errata), and the specification for PDF 2.0 is available after paying ISO (sigh), but example files for PDF 2.0 are freely available.
At the moment PDF 2.0 is not widely used (although some documents can be displayed by current PDF readers) and most of the documents I have found in the wild are PDF 1.x files.
Many people mistakingly believe that PDF files are files for printers, or that they are images on a page. They are not. Instead PDF is a container format: a basic PDF file consists of a header, a body with various objects and a cross reference table for those objects. Objects could be streams (think: pictures), text, fonts, comments, numbers, dictionaries, references, and so on. The cross reference table in the document determines the order in which these objects appear and which of the objects are "in use" or "free". The objects can also be modified (rotated, encrypted, compressed, and so on).
PDF files can be incrementally updated: extra data can be appended to the PDF file that adds to the PDF or overrides existing objects, enables/deletes objects, and so on. This makes PDF somewhat of a layered container format (this is all explained in the PDF 1.7 specification in section 7.5).
Reading a PDF works as follows: a PDF reader opens a file, seeks to the end of the file, reads the latest cross reference table (which contains a reference to the previous cross reference table, which contains a reference to an earlier cross reference table, and so on, creating a chain of cross reference tables) and works its way through the file to reconstruct the data.
At least, that's how it should be. What I have found is that in practice this isn't quite the case. Some of it is due to the fact that the PDF format is ambigious and not everything is explained. In other cases programs are simply generating files that are not compliant with the PDF standards. Let's walk through a few examples.
"The last line of the file shall contain only the end-of-file marker, %%EOF."
This could be read in two ways:
In section 7.5.1 the following definition of "line" is given:
"Each line shall be terminated by an end-of-line (EOL) marker, which may be a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both."
so I am inclined to go with the second interpretation as do most tool vendors. However, I have seen files where the file ended immediately after the end-of-file marker with EOL. Most likely they got confused by section 7.2.3:
"The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h)."
but this actually does not contradict the convention that there are EOL markers after each line: they can be there, they are just not part of the comment.
"The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7."
(of course, this has been updated by the PDF 2.0 specification)
The question is: is it OK to have trailing white space after the version number? Reading the specification it seems that this is not OK, but, again, this is not totally clear. In section 7.2.2 the following is said about white space:
"All white-space characters are equivalent, except in comments, strings, and streams. In all other contexts, PDF treats any sequence of consecutive white-space characters as one character."
and contains a list of characters that are considered white space, including all of the EOL markers. This is confusing, as somewhere else specific meaning was given to the EOL markers.
I have encountered PDF 1.3 files (made by an older version of "Simple Scan", with the Producer tag in the PDF set to "ImageMagick 6.5.8-10 2010-12-17 Q16") where extra white space was added after the header. This was most likely just a bug as it was fixed in other versions.
There are other files where there is a lot of extra white space for example in the trailer between the end of the cross reference table and the "startxref" element where in one case I found dozens of white space characters.
So when parsing PDF files this needs to be taken into account.
Section 7.2.3 says that comments should be treated as white space:
"A conforming reader shall ignore comments, and treat them as single white-space characters."
which would make this the same as the white space situation.
"When updating a PDF file incrementally, changes shall be appended to the end of the file, leaving its original contents intact."
but in reality I am seeing many files where the updates are prepended to the file instead, with the value from startxref pointing to the update at the beginning of the file and the value for the previous cross reference table pointing forward in the file, instead of backward.
Now, for normal use this probably doesn't really matter: the cross reference table specifications themselves don't specify that the location of the previous table should be earlier in the file. As long as the references are correct it doesn't matter where in the file the data is. This makes it more of a "random access" type of container.
I am just left wondering: why?
The reason I care is quite simple: I don't know what data I get in advance. It could be that I get a blob (a firmware file, a proprietary file system, etc.), which includes a PDF file and I want to be able to (partially) carve the PDF from the larger file. For this it is important to find out where the file starts (usually not the biggest problem) and where it ends (and this is where it gets tricky).
The exceptions and unclarities I found were from a test set of just over 600 PDF files. I am sure that as soon as I encounter more files there will be more exceptions and edge cases.
At the moment PDF 2.0 is not widely used (although some documents can be displayed by current PDF readers) and most of the documents I have found in the wild are PDF 1.x files.
Many people mistakingly believe that PDF files are files for printers, or that they are images on a page. They are not. Instead PDF is a container format: a basic PDF file consists of a header, a body with various objects and a cross reference table for those objects. Objects could be streams (think: pictures), text, fonts, comments, numbers, dictionaries, references, and so on. The cross reference table in the document determines the order in which these objects appear and which of the objects are "in use" or "free". The objects can also be modified (rotated, encrypted, compressed, and so on).
PDF files can be incrementally updated: extra data can be appended to the PDF file that adds to the PDF or overrides existing objects, enables/deletes objects, and so on. This makes PDF somewhat of a layered container format (this is all explained in the PDF 1.7 specification in section 7.5).
Reading a PDF works as follows: a PDF reader opens a file, seeks to the end of the file, reads the latest cross reference table (which contains a reference to the previous cross reference table, which contains a reference to an earlier cross reference table, and so on, creating a chain of cross reference tables) and works its way through the file to reconstruct the data.
At least, that's how it should be. What I have found is that in practice this isn't quite the case. Some of it is due to the fact that the PDF format is ambigious and not everything is explained. In other cases programs are simply generating files that are not compliant with the PDF standards. Let's walk through a few examples.
Line endings
In the PDF standard different types of line endings are used to make sure that the data works on Unix, Windows and older Apple Macs, as the line endings for all three platforms (LF for Unix, CR for Apple and CRLF for Windows) are all valid line "End Of Line" (EOL) markers. But there are a few things in the PDF standard that are simply not well explained. Let's look at the end of file marker (section 7.5.5):"The last line of the file shall contain only the end-of-file marker, %%EOF."
This could be read in two ways:
- The last part of the file can only be %%EOF and the file has to end immediately
- The last line can only be %%EOF, whereas a line ends with an EOL marker
In section 7.5.1 the following definition of "line" is given:
"Each line shall be terminated by an end-of-line (EOL) marker, which may be a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both."
so I am inclined to go with the second interpretation as do most tool vendors. However, I have seen files where the file ended immediately after the end-of-file marker with EOL. Most likely they got confused by section 7.2.3:
"The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h)."
but this actually does not contradict the convention that there are EOL markers after each line: they can be there, they are just not part of the comment.
White space
The specification is not clear about where white space is allowed and I have encountered several files with white space where I got confused. One example is the header of the PDF file. In section 7.5.2 of the specification it says:"The first line of a PDF file shall be a header consisting of the 5 characters %PDF – followed by a version number of the form 1.N, where N is a digit between 0 and 7."
(of course, this has been updated by the PDF 2.0 specification)
The question is: is it OK to have trailing white space after the version number? Reading the specification it seems that this is not OK, but, again, this is not totally clear. In section 7.2.2 the following is said about white space:
"All white-space characters are equivalent, except in comments, strings, and streams. In all other contexts, PDF treats any sequence of consecutive white-space characters as one character."
and contains a list of characters that are considered white space, including all of the EOL markers. This is confusing, as somewhere else specific meaning was given to the EOL markers.
I have encountered PDF 1.3 files (made by an older version of "Simple Scan", with the Producer tag in the PDF set to "ImageMagick 6.5.8-10 2010-12-17 Q16") where extra white space was added after the header. This was most likely just a bug as it was fixed in other versions.
There are other files where there is a lot of extra white space for example in the trailer between the end of the cross reference table and the "startxref" element where in one case I found dozens of white space characters.
So when parsing PDF files this needs to be taken into account.
Comments
Similar to white space: it is unclear where comments are allowed and I have seen instances (files created with iText for example) where there are comments in the trailer between the end of the cross reference table and the "startxref" element.Section 7.2.3 says that comments should be treated as white space:
"A conforming reader shall ignore comments, and treat them as single white-space characters."
which would make this the same as the white space situation.
Cross reference tables and updates
The PDF specification is very clear in section 7.5.6 that updates should be appended to a file:"When updating a PDF file incrementally, changes shall be appended to the end of the file, leaving its original contents intact."
but in reality I am seeing many files where the updates are prepended to the file instead, with the value from startxref pointing to the update at the beginning of the file and the value for the previous cross reference table pointing forward in the file, instead of backward.
Now, for normal use this probably doesn't really matter: the cross reference table specifications themselves don't specify that the location of the previous table should be earlier in the file. As long as the references are correct it doesn't matter where in the file the data is. This makes it more of a "random access" type of container.
I am just left wondering: why?
Why do I care?
A good question is why I even care about these things: as long as PDF writers have taken care of the right references in their files none of these things should be a problem.The reason I care is quite simple: I don't know what data I get in advance. It could be that I get a blob (a firmware file, a proprietary file system, etc.), which includes a PDF file and I want to be able to (partially) carve the PDF from the larger file. For this it is important to find out where the file starts (usually not the biggest problem) and where it ends (and this is where it gets tricky).
Wrapping up
There are definitely a few places where the PDF 1.7 specifications are unclear and could benefit from improvements, with more examples and more clarifications.The exceptions and unclarities I found were from a test set of just over 600 PDF files. I am sure that as soon as I encounter more files there will be more exceptions and edge cases.
Reacties
Een reactie posten