Fuzzy hash matching

Fuzzy hash matching, or proximity hashing, is a powerful method to find files that are close to the scanned file. But: it is not a silver bullet. In this blogpost I want to look a bit into proximity matching, when it works and especially when it does not work.

Cryptographic hashes

Most programmers are familiar with cryptographic hashes such as MD5, SHA256, and so on. These hashes are very useful when needing to uniquely identify files (except in the case of hash collisions, but those are extremely rare). These algorithms work by taking an input (the contents of a file) and then computing a very long number.

A slight change in the input will lead to a drastically different number. This is why these cryptographic hashes are great for uniquely identifying files as the same input will lead to the same hash, but useless for comparing files, as different inputs will lead to a very different hash and a comparison of hashes is completely useless.

Locality sensitive hashes

A different class of hashing algorithms are the locality sensitive hashing algorithms. These algorithms have different properties, which allow hashes to be compared to eachother and determine how close the inputs were. One algorithm that has become quite popular is Trendmicro's TLSH, which has several open source implementations. The most widely used one is the implementation by Trendmicro, which is available for C/C++ and also has Python bindings.

Comparing two files for their distance works as follows:

compute the TLSH hash for file 1
compute the TLSH hash for file 2
compute the difference between the two hashes

As a test I took two copies of the GPL 2 license, with minimal differences (spacing).

The SHA256 shows that they are different:

$ sha256sum GPLv2-*
8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643  GPLv2-1
ab15fd526bd8dd18a9e77ebc139656bf4d33e97fc7238cd11bf60e2b9b8666c6  GPLv2-2

The TLSH hashes can be computed and compared using a bit of Python (needs Python bindings for TLSH to be installed):

>>> import tlsh
>>> gpl2 = open('GPLv2-1', 'rb').read()
>>> gpl2alt = open('GPLv2-2', 'rb').read()
>>> gpl2tlsh = tlsh.hash(gpl2)
>>> gpl2alttlsh = tlsh.hash(gpl2alt)
>>> tlsh.diff(gpl2tlsh, gpl2alttlsh)
6

The difference that is computed is 6, which is low, meaning the files are very close. What TLSH doesn't tell is where in the file the differences are, it just says how close the files are to eachother, which can be very helpful when triaging files.

When to use Locality Sensitive Hashing

Locality sensitive hashing can be quite useful in certain use cases. The original use case for TrendMicro was to be able to find variations of already known viruses to keep up with the flood of new computer viruses that is unleashed every day.

Personally I found it very useful to use when scanning source code to see how close files are too eachother, as explored in the Yaminabe 2 project.

It is also useful when comparing executable files to eachother, but only as long as they have been built in the exact same environment.

When not to use Locality Sensitive Hashing

There are also situations where it makes no sense at all to use locality sensitive hashing. I want to highlight three.

Executables

When comparing two executable ELF files that were built on different systems it makes absolutely no sense at all: in ELF executables the generated code takes up the majority of the file. Data that is platform agnostic, such as function names and variable names, and string literals, are often a very tiny fraction of the executable file (less than 1%), but the generated code (the bulk of the file) vastly differs per hardware architecture.

Even using different build flags will lead to significant differences in generated code. So unless you know that the binaries are for the same architectures and (likely) generated in the same or similar build environment there is absolutely no benefit to using fuzzy matching for binaries.

Compressed files

Using locality sensitive hashing for compressed files makes absolutely no sense, as differences in the uncompressed data leads to differences in the compressed data, but if this does not necessarily mean anything about the uncompressed data. To illustrate this I compressed the text files with the GPL2 license text that I used earlier using gzip and then computed the TLSH difference between the two gzip compressed files (again, with some Python code):

>>> import tlsh
>>> gpl2 = open('GPLv2-1.gz', 'rb').read()
>>> gpl2alt = open('GPLv2-2.gz', 'rb').read()
>>> gpl2tlsh = tlsh.hash(gpl2)
>>> gpl2alttlsh = tlsh.hash(gpl2alt)
>>> tlsh.diff(gpl2tlsh, gpl2alttlsh)
203

The compressed files are quite far apart from eachother (200+ is considered a big difference in TLSH), while the uncompressed distance was very low. This means that compression has a big impact on the score.

Graphics files and PDF

Similarly for graphics files and PDF files there is no need, for a few reasons: first of all, many formats (PNG, jpeg, etc.) use some form of compression in the data. Second, there is metadata (like EXIF or XMP) that could make it look like files are a lot closer than they actually are.

Wrapping up

Fuzzy hashes such as TLSH can be very useful, but only in a limited amount of cases, where you already know enough about the domain. Randomly comparing hashes of files to other hashes will very likely be a waste of resources and computing power, and not lead to the desired results.

Walkthrough: WebP file format

A graphics file format that I am encountering a bit more often during my work is Google's WebP file format. Even though it is fairly recent (or the history it is best to read the Wikipedia page about WebP ) it builds on some quite old foundations. One reason for Google to come up with a new graphics file format was file size: Google indexes and stores and sends many graphics files. By reducing the size of files they could significantly save on bandwidth and storage space. Shaving off some bytes here and there really starts to add up when you are doing it by the billions. Everyting counts in large amounts - Depeche Mode WebP file format The WebP format uses the Resource Interchange File Format (RIFF) as its container. This format is also used by other formats such as WAV and very easy to process automatically. A WebP file consists of a header, and then a number of chunks. The data in the header applies to the entire file, while data in the chunks only apply to the individu...

Meer lezen

Binary analysis, code scanning & more...

Zoeken in deze blog