I have been working on analysing binary files (such as firmware files) for well over a decade now. In the first few years I did this mostly by hand using standard Linux tools but since late 2009 I have been working on (and with) tools.
While working on tools I have been hearing from some people that the problems I try to solve are bordering on the trivial and I can just use the standard tools and libraries and just glue them together with some custom code. But that has actually not been my experience. Although for most of the files out there it would indeed be as simple as using standard tools to read and verify the files it gets a lot more complicated as soon as you start working with blobs where you don't know where files begin or start.
As an example: I often encounter firmware update files for embedded Linux devices, where it really depends on the vendor what the format looks like. Sometimes the firmware is the same size as the flash chip and I don't know where the partitions are and what file systems have been used. Another time I get an archive with file systems that are flashed by an installer script booted by a temporary Linux system. Or I get a custom firmware update file with all kinds of optimizations (example: a binary diff) and without any information how it is created (and when I ask the vendor I often hear "we cannot share that information with you").
When concatenating data things can very quickly get complex. As an example take a ZIP file: when adding data at the end of the ZIP file the standard tools will simply fail, as they will search to the end of the file to find the central directory of the ZIP file. Even a single byte added to the file will completely throw them off. Checking where a ZIP file ends turns out to be a non-trivial exercise.
Concatenated files that need to be carved are just one of many challenges that I have encountered. There are also quite a few instances where the specifications don't match reality, for example Google's Dalvik format. Google's own Dalvik files don't follow the official Dalvik specifications (for example "data_size" in the Dalvik header). There are also many PDF files where updates are not appended to files, but prepended instead.
And then there are vendors changing formats deliberatily for reasons unknown (obfuscation,shaving a few bytes off of space needed, etc.) as well implementation bugs in tools that output files that are not complying with the specifications: so far I have encountered ZIP files that should not exist, PDF files that prepend instead of append updates and GIF files where the XMP data has the wrong data.
Combine that with the vast amounts of data that I process and have to wade through and it should be clear that it is far from easy (and which is why I am automating the hell out of it). But, nope, people still keep insisting it is trivial...
While working on tools I have been hearing from some people that the problems I try to solve are bordering on the trivial and I can just use the standard tools and libraries and just glue them together with some custom code. But that has actually not been my experience. Although for most of the files out there it would indeed be as simple as using standard tools to read and verify the files it gets a lot more complicated as soon as you start working with blobs where you don't know where files begin or start.
As an example: I often encounter firmware update files for embedded Linux devices, where it really depends on the vendor what the format looks like. Sometimes the firmware is the same size as the flash chip and I don't know where the partitions are and what file systems have been used. Another time I get an archive with file systems that are flashed by an installer script booted by a temporary Linux system. Or I get a custom firmware update file with all kinds of optimizations (example: a binary diff) and without any information how it is created (and when I ask the vendor I often hear "we cannot share that information with you").
When concatenating data things can very quickly get complex. As an example take a ZIP file: when adding data at the end of the ZIP file the standard tools will simply fail, as they will search to the end of the file to find the central directory of the ZIP file. Even a single byte added to the file will completely throw them off. Checking where a ZIP file ends turns out to be a non-trivial exercise.
Concatenated files that need to be carved are just one of many challenges that I have encountered. There are also quite a few instances where the specifications don't match reality, for example Google's Dalvik format. Google's own Dalvik files don't follow the official Dalvik specifications (for example "data_size" in the Dalvik header). There are also many PDF files where updates are not appended to files, but prepended instead.
And then there are vendors changing formats deliberatily for reasons unknown (obfuscation,shaving a few bytes off of space needed, etc.) as well implementation bugs in tools that output files that are not complying with the specifications: so far I have encountered ZIP files that should not exist, PDF files that prepend instead of append updates and GIF files where the XMP data has the wrong data.
Combine that with the vast amounts of data that I process and have to wade through and it should be clear that it is far from easy (and which is why I am automating the hell out of it). But, nope, people still keep insisting it is trivial...
Reacties
Een reactie posten