Doorgaan naar hoofdcontent

Weird files everywhere

I have been working on analysing binary files (such as firmware files) for well over a decade now. In the first few years I did this mostly by hand using standard Linux tools but since late 2009 I have been working on (and with) tools.

While working on tools I have been hearing from some people that the problems I try to solve are bordering on the trivial and I can just use the standard tools and libraries and just glue them together with some custom code. But that has actually not been my experience. Although for most of the files out there it would indeed be as simple as using standard tools to read and verify the files it gets a lot more complicated as soon as you start working with blobs where you don't know where files begin or start.

As an example: I often encounter firmware update files for embedded Linux devices, where it really depends on the vendor what the format looks like. Sometimes the firmware is the same size as the flash chip and I don't know where the partitions are and what file systems have been used. Another time I get an archive with file systems that are flashed by an installer script booted by a temporary Linux system. Or I get a custom firmware update file with all kinds of optimizations (example: a binary diff) and without any information how it is created (and when I ask the vendor I often hear "we cannot share that information with you").

When concatenating data things can very quickly get complex. As an example take a ZIP file: when adding data at the end of the ZIP file the standard tools will simply fail, as they will search to the end of the file to find the central directory of the ZIP file. Even a single byte added to the file will completely throw them off. Checking where a ZIP file ends turns out to be a non-trivial exercise.

Concatenated files that need to be carved are just one of many challenges that I have encountered. There are also quite a few instances where the specifications don't match reality, for example Google's Dalvik format. Google's own Dalvik files don't follow the official Dalvik specifications (for example "data_size" in the Dalvik header). There are also many PDF files where updates are not appended to files, but prepended instead.

And then there are vendors changing formats deliberatily for reasons unknown (obfuscation,shaving a few bytes off of space needed, etc.) as well implementation bugs in tools that output files that are not complying with the specifications: so far I have encountered ZIP files that should not exist, PDF files that prepend instead of append updates and GIF files where the XMP data has the wrong data.

Combine that with the vast amounts of data that I process and have to wade through and it should be clear that it is far from easy (and which is why I am automating the hell out of it). But, nope, people still keep insisting it is trivial...

Reacties

Populaire posts van deze blog

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Walkthrough: Apple resource fork files

For a long time Apple has stored structured metadata about files in special files called resource forks . These files tend to pop up in archives that were created or packed on an Apple computer. Typically you can find these files in a directory called __MACOSX :  $ file __MACOSX/test/._.DS_Store __MACOSX/test/._.DS_Store: AppleDouble encoded Macintosh file I try to recognize these files, tag them and then ignore them, as the information contained in it is not very useful for me Apple resource fork structure An Apple resource fork file consists of a header and then a number of descriptors of each entry. A full description of the values of descriptors can be found in Appendix A & B of RFC1740 . Apple resource fork header The header consists of: signature: 0x00 0x05 0x16 0x07 version number (4 bytes) filler (16 bytes) - these should all be 0x00 number of entries (2 bytes) - this is in big endian format  The minimum resource fork file is 4 + 4 + 16 + 2 = 26 b...