Doorgaan naar hoofdcontent

Binary analysis, code scanning & more...

The problem with the current market of scanning tools

The current market of scanning tools is dominated by a few companies. These companies charge quite a bit of money for solutions with big databases. But when I then see how their customers are using it I think that they are, in many (but not all) cases, simply wasting their money, and have the wrong tool for the job at hand and a different tool would have been more appropriate.

What I see is that scanners are used for three use cases:
  1. license scanning
  2. finding out which files are from an open source project and used unchanged
  3. code clone detection ("snippets")

License scanning

For license scanning there is absolutely no need to be using the commercial license scanners. There are superior alternatives such as ScanCode and FOSSology, which are open source licensed and which can scale really well and have a higher accuracy than most, if not all, commercial license scanners.

Finding out what open source files have been used

I absolutely do not understand why companies keep scanning code they have already seen countless times before. The Yaminabe project run by Linux Foundation already showed that almost all open source code that is used is used unchanged, so there is no need to keep scanning it. In my own experience (with for example the OSADL license compliance audit) the open source components in a device are 99% to 100% unmodified.

Instead, you should cache the results data, but I have seen EULAs that prevent this. Scanning code unnecessarily wastes time, money and resources. Instead what companies should do is scan once, and then aggressively cache the results in a database, and simply do lookups (for example using SHA256 checksums) to see if a file is already known and then store it in a database. Very often the "scan" would be reduced to a simple lookup in a database, which is much faster than running a full scan and all you need to worry about is looking at the differences and modifications.

When sharing a database of pre-scanned and vetted components with other companies prices for scanning would quickly fall to zero.

Code clone detection/Snippets

The difficult part of code scanning is code clone detection, or "detecting snippets" as most vendors tend to call it. This is where there is (currently) no good open source solution that scales.

However, by adding a little bit of information to the scanning process you can often make an educated guess if it is about detecting open source code. For example, if (nearly) all the files are coming directly from an open source package, but there are a few that you cannot find an exact match for, but the names are identical to files in that package, then you can be pretty sure that there likely is some open source in there and you can easily work around that situation. As said earlier: in my experience open source code is minimally modified, if at all. For detecting similarities you could consider using TLSH or another proximity matching algorithm to find the closest match.

It gets more difficult when you have a large chunk of proprietary code and you need to find the open source code in there, or if various open source files have been combined into a new one. For that scenario you are, right now, limited to the commercial scanners. You might be able to get somewhere by for example searching for function names, string literals or variable names and see if they are identical, but that assumes that codes has not been obfuscated. If this is your primary use case, then indeed the commercial scanners are your best option right now.

What about security?

The commercial code scanners have more functionality than just code clone detection and also can connect with databases of security information. At the moment this is where open source and open data solutions are still lacking, but my prediction is that open databases with vulnerability information will be popping up soon as well, as most of the proprietary databases are primarily built from open databases such as NVD, but cross-referenced with open source code.

Wrapping up...

In my opinion for most open source scanning use cases you can easily deploy open source tools, and use a pure open source stack for the bulk of your scanning needs. There are still some use cases where it is currently not possible to use a pure open source stack, but it is just a matter of time before this problem will be fixed as well.

Reacties

Populaire posts van deze blog

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work. Cryptographic hashes Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number. A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless. Locality sensitive hashes A different ...

Walkthrough: PNG file format

A relatively straightforward file format that is used a lot in firmware files that I see is the Portable Network Graphics file format, or simply PNG. To give an example of how widespread it is: in a regular Android firmware with a few applications installed you can easily find over 50,000 PNG files, with quite a few duplicates as well. What baffles me is that quite a few of the license scanning tools out there (including some open source tools) also try to do a license scan of a PNG file. This makes no sense to me at all. While possibly interesting from a copyright perspective (which is about what is in the picture or possibly in the metadata ) the files themselves are not interesting when scanning software: valid PNG files do not contain executable code (maliciously crafted PNG files that exploit errors in PNG parsers are of course a different story). PNG files cannot be combined with other files to create "derivative" software: software cannot be linked with a PNG fil...