Binary analysis, code scanning & more...

The problem with the current market of scanning tools

The current market of scanning tools is dominated by a few companies. These companies charge quite a bit of money for solutions with big databases. But when I then see how their customers are using it I think that they are, in many (but not all) cases, simply wasting their money, and have the wrong tool for the job at hand and a different tool would have been more appropriate.

What I see is that scanners are used for three use cases:

license scanning
finding out which files are from an open source project and used unchanged
code clone detection ("snippets")

License scanning

For license scanning there is absolutely no need to be using the commercial license scanners. There are superior alternatives such as ScanCode and FOSSology, which are open source licensed and which can scale really well and have a higher accuracy than most, if not all, commercial license scanners.

Finding out what open source files have been used

I absolutely do not understand why companies keep scanning code they have already seen countless times before. The Yaminabe project run by Linux Foundation already showed that almost all open source code that is used is used unchanged, so there is no need to keep scanning it. In my own experience (with for example the OSADL license compliance audit) the open source components in a device are 99% to 100% unmodified.

Instead, you should cache the results data, but I have seen EULAs that prevent this. Scanning code unnecessarily wastes time, money and resources. Instead what companies should do is scan once, and then aggressively cache the results in a database, and simply do lookups (for example using SHA256 checksums) to see if a file is already known and then store it in a database. Very often the "scan" would be reduced to a simple lookup in a database, which is much faster than running a full scan and all you need to worry about is looking at the differences and modifications.

When sharing a database of pre-scanned and vetted components with other companies prices for scanning would quickly fall to zero.

Code clone detection/Snippets

The difficult part of code scanning is code clone detection, or "detecting snippets" as most vendors tend to call it. This is where there is (currently) no good open source solution that scales.

However, by adding a little bit of information to the scanning process you can often make an educated guess if it is about detecting open source code. For example, if (nearly) all the files are coming directly from an open source package, but there are a few that you cannot find an exact match for, but the names are identical to files in that package, then you can be pretty sure that there likely is some open source in there and you can easily work around that situation. As said earlier: in my experience open source code is minimally modified, if at all. For detecting similarities you could consider using TLSH or another proximity matching algorithm to find the closest match.

It gets more difficult when you have a large chunk of proprietary code and you need to find the open source code in there, or if various open source files have been combined into a new one. For that scenario you are, right now, limited to the commercial scanners. You might be able to get somewhere by for example searching for function names, string literals or variable names and see if they are identical, but that assumes that codes has not been obfuscated. If this is your primary use case, then indeed the commercial scanners are your best option right now.

What about security?

The commercial code scanners have more functionality than just code clone detection and also can connect with databases of security information. At the moment this is where open source and open data solutions are still lacking, but my prediction is that open databases with vulnerability information will be popping up soon as well, as most of the proprietary databases are primarily built from open databases such as NVD, but cross-referenced with open source code.

Wrapping up...

In my opinion for most open source scanning use cases you can easily deploy open source tools, and use a pure open source stack for the bulk of your scanning needs. There are still some use cases where it is currently not possible to use a pure open source stack, but it is just a matter of time before this problem will be fixed as well.

Binary analysis, code scanning & more...

Zoeken in deze blog