The problem with the current market of scanning tools
The current market of scanning tools is dominated by a few companies. These companies charge quite a bit of money for solutions with big databases. But when I then see how their customers are using it I think that they are, in many (but not all) cases, simply wasting their money, and have the wrong tool for the job at hand and a different tool would have been more appropriate.What I see is that scanners are used for three use cases:
- license scanning
- finding out which files are from an open source project and used unchanged
- code clone detection ("snippets")
License scanning
For license scanning there is absolutely no need to be using the commercial license scanners. There are superior alternatives such as ScanCode and FOSSology, which are open source licensed and which can scale really well and have a higher accuracy than most, if not all, commercial license scanners.Finding out what open source files have been used
I absolutely do not understand why companies keep scanning code they have already seen countless times before. The Yaminabe project run by Linux Foundation already showed that almost all open source code that is used is used unchanged, so there is no need to keep scanning it. In my own experience (with for example the OSADL license compliance audit) the open source components in a device are 99% to 100% unmodified.Instead, you should cache the results data, but I have seen EULAs that prevent this. Scanning code unnecessarily wastes time, money and resources. Instead what companies should do is scan once, and then aggressively cache the results in a database, and simply do lookups (for example using SHA256 checksums) to see if a file is already known and then store it in a database. Very often the "scan" would be reduced to a simple lookup in a database, which is much faster than running a full scan and all you need to worry about is looking at the differences and modifications.
When sharing a database of pre-scanned and vetted components with other companies prices for scanning would quickly fall to zero.
Code clone detection/Snippets
The difficult part of code scanning is code clone detection, or "detecting snippets" as most vendors tend to call it. This is where there is (currently) no good open source solution that scales.However, by adding a little bit of information to the scanning process you can often make an educated guess if it is about detecting open source code. For example, if (nearly) all the files are coming directly from an open source package, but there are a few that you cannot find an exact match for, but the names are identical to files in that package, then you can be pretty sure that there likely is some open source in there and you can easily work around that situation. As said earlier: in my experience open source code is minimally modified, if at all. For detecting similarities you could consider using TLSH or another proximity matching algorithm to find the closest match.
It gets more difficult when you have a large chunk of proprietary code and you need to find the open source code in there, or if various open source files have been combined into a new one. For that scenario you are, right now, limited to the commercial scanners. You might be able to get somewhere by for example searching for function names, string literals or variable names and see if they are identical, but that assumes that codes has not been obfuscated. If this is your primary use case, then indeed the commercial scanners are your best option right now.
Reacties
Een reactie posten