Detecting duplicit files

Posted on Nov 09, 2021, 1 minute to read.

In this quick note, imagine you have a bunch of files, and you want to see which ones are identical.

First, there is a simple observation: if the files have different sizes, they are clearly not identical.

Now let’s suppose the size is the same. We can compute a hash and compare those. A classic example of doing SHA-512 in Python would be:

import hashlib

def hashfile(self, path, blocksize=65536):
    with open(path, 'rb') as afile:
        hasher = hashlib.sha512()
        buf = afile.read(blocksize)
        while len(buf) > 0:
            hasher.update(buf)
            buf = afile.read(blocksize)
        return hasher.hexdigest()

Finally, depending on your use case, you might still want to compare the files byte-by-byte or accept the likelihood of collision to be low enough and consider the files identical if the hashes match. The benefit of the hash-only approach is that you don’t have to store all the data.

Did you find this useful?

You can unsubscribe at any time by clicking the link in the footer of our emails. For information about our privacy practices, please visit our website.

I use Mailchimp as our marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices here.