Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work.
A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless.
Comparing two files for their distance works as follows:
The SHA256 shows that they are different:
The TLSH hashes can be computed and compared using a bit of Python (needs Python bindings for TLSH to be installed):
Personally I found it very useful to use when scanning source code to see how close files are too eachother, as explored in the Yaminabe 2 project.
It is also useful when comparing executable files to eachother, but only as long as they have been built in the exact same environment.
Even using different build flags will lead to significant differences in generated code. So unless you know that the binaries are for the same architectures and (likely) generated in the same or similar build environment there is absolutely no benefit to using fuzzy matching for binaries.
The compressed files are quite far apart from eachother (200+ is considered a big difference in TLSH), while the uncompressed distance was very low. This means that compression has a big impact on the score.
Cryptographic hashes
Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number.A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless.
Locality sensitive hashes
A different class of hashing algorithms are the locality sensitive hashing algorithms. These algorithms have different properties, which allow hashes to be compared to eachother and determine how close the inputs were. One algorithm that has become quite popular is Trendmicro's TLSH, which has several open source implementations. The most widely used one is the implementation by Trendmicro, which is available for C/C++ and also has Python bindings.Comparing two files for their distance works as follows:
- compute the TLSH hash for file 1
- compute the TLSH hash for file 2
- compute the difference between the two hashes
The SHA256 shows that they are different:
$ sha256sum GPLv2-* 8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643 GPLv2-1 ab15fd526bd8dd18a9e77ebc139656bf4d33e97fc7238cd11bf60e2b9b8666c6 GPLv2-2
The TLSH hashes can be computed and compared using a bit of Python (needs Python bindings for TLSH to be installed):
>>> import tlsh >>> gpl2 = open('GPLv2-1', 'rb').read() >>> gpl2alt = open('GPLv2-2', 'rb').read() >>> gpl2tlsh = tlsh.hash(gpl2) >>> gpl2alttlsh = tlsh.hash(gpl2alt) >>> tlsh.diff(gpl2tlsh, gpl2alttlsh) 6The difference that is computed is 6, which is low, meaning the files are very close. What TLSH doesn't tell is where in the file the differences are, it just says how close the files are to eachother, which can be very helpful when triaging files.
When to use Locality Sensitive Hashing
Locality sensitive hashing can be quite useful in certain use cases. The original use case for TrendMicro was to be able to find variations of already known viruses to keep up with the flood of new computer viruses that is unleashed every day.Personally I found it very useful to use when scanning source code to see how close files are too eachother, as explored in the Yaminabe 2 project.
It is also useful when comparing executable files to eachother, but only as long as they have been built in the exact same environment.
When not to use Locality Sensitive Hashing
There are also situations where it makes no sense at all to use locality sensitive hashing. I want to highlight three.Executables
When comparing two executable ELF files that were built on different systems it makes absolutely no sense at all: in ELF executables the generated code takes up the majority of the file. Data that is platform agnostic, such as function names and variable names, and string literals, are often a very tiny fraction of the executable file (less than 1%), but the generated code (the bulk of the file) vastly differs per hardware architecture.Even using different build flags will lead to significant differences in generated code. So unless you know that the binaries are for the same architectures and (likely) generated in the same or similar build environment there is absolutely no benefit to using fuzzy matching for binaries.
Compressed files
Using locality sensitive hashing for compressed files makes absolutely no sense, as differences in the uncompressed data leads to differences in the compressed data, but if this does not necessarily mean anything about the uncompressed data. To illustrate this I compressed the text files with the GPL2 license text that I used earlier using gzip and then computed the TLSH difference between the two gzip compressed files (again, with some Python code):>>> import tlsh >>> gpl2 = open('GPLv2-1.gz', 'rb').read() >>> gpl2alt = open('GPLv2-2.gz', 'rb').read() >>> gpl2tlsh = tlsh.hash(gpl2) >>> gpl2alttlsh = tlsh.hash(gpl2alt) >>> tlsh.diff(gpl2tlsh, gpl2alttlsh) 203
The compressed files are quite far apart from eachother (200+ is considered a big difference in TLSH), while the uncompressed distance was very low. This means that compression has a big impact on the score.
Reacties
Een reactie posten