Detecting duplicit filesPosted on Nov 09, 2021, 1 minute to read.
In this quick note, imagine you have a bunch of files, and you want to see which ones are identical.
First, there is a simple observation: if the files have different sizes, they are clearly not identical.
Now let’s suppose the size is the same. We can compute a hash and compare those. A classic example of doing SHA-512 in Python would be:
Finally, depending on your use case, you might still want to compare the files byte-by-byte or accept the likelihood of collision to be low enough and consider the files identical if the hashes match. The benefit of the hash-only approach is that you don’t have to store all the data.