Binary analysis, code scanning & more...

Posts

Posts uit juli, 2018 tonen

Walkthrough: Intel HEX format

One format that you normally would not encounter very often unless working with certain microcontrollers is the Intel HEX format. This format is a text format to transfer binary information in a text representation. The Wikipedia article about the format is very informative and lists almost everything that needs to be known about the format (but not everyting, as I will show later). Most scanners would say that these files are text files, but they are actually binary files in disguise! This is why I try to recognize them and process them. Unless you are working a lot with microcontrollers then the most likely place where you will find these files is in the Linux kernel, where many firmware files (for chips) are included in Intel HEX format. Creating an unpacker for this file format is quite easy, but you could also use the the SRecord package , which also is able to extract/convert files in different, but similar file formats, such as SREC and others. For example to convert th...

Meer lezen

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Meer lezen

Walkthrough: PNG file format

A relatively straightforward file format that is used a lot in firmware files that I see is the Portable Network Graphics file format, or simply PNG. To give an example of how widespread it is: in a regular Android firmware with a few applications installed you can easily find over 50,000 PNG files, with quite a few duplicates as well. What baffles me is that quite a few of the license scanning tools out there (including some open source tools) also try to do a license scan of a PNG file. This makes no sense to me at all. While possibly interesting from a copyright perspective (which is about what is in the picture or possibly in the metadata ) the files themselves are not interesting when scanning software: valid PNG files do not contain executable code (maliciously crafted PNG files that exploit errors in PNG parsers are of course a different story). PNG files cannot be combined with other files to create "derivative" software: software cannot be linked with a PNG fil...

Meer lezen

Walkthrough: ZIP file format

A very popular format for compressing and archiving files is the ZIP format. Its history is well documented at the PKZIP page on Wikipedia . While on Unix systems (such as Linux) other compression tools (tar for archiving, combined with gzip for compression) have historically been more popular it is the dominant archiving and compression format in the rest of the world. Other systems, like Java and Android are also using ZIP extensively (in JAR and APK files respectively) and other formats, such as Open Document Format (ODF) are also based on ZIP. The specifications for ZIP have been open for years and can be freely (re)implemented and every decent programming language has good support for working with ZIP files, although not every implementation fully implements all functionality. For example, Python 3 has is a fairly complete implementation with support for LZMA and bzip2 compression and large files (ZIP64), but not for multi-file ZIP. The references in the rest of this blog post...

Meer lezen

Binary analysis, code scanning & more...

The problem with the current market of scanning tools The current market of scanning tools is dominated by a few companies. These companies charge quite a bit of money for solutions with big databases. But when I then see how their customers are using it I think that they are, in many (but not all) cases, simply wasting their money, and have the wrong tool for the job at hand and a different tool would have been more appropriate. What I see is that scanners are used for three use cases: license scanning finding out which files are from an open source project and used unchanged code clone detection ("snippets") License scanning For license scanning there is absolutely no need to be using the commercial license scanners. There are superior alternatives such as ScanCode and FOSSology , which are open source licensed and which can scale really well and have a higher accuracy than most, if not all, commercial license scanners. Finding out what open source files have...

Meer lezen