Sometimes it’s not a human, not AI, but math that can best describe a file.
Mathematics provides us with wonderful data-shortening algorithms. We can apply them to files and get a perfect description of their contents. This description will always be the same size, will be identical if we give it the same file, and will (almost) always be different for non-identical files. Almost, because there is a non-zero probability of a collision occurring, but it’s small enough to accept such a risk.
This means easy detection of duplicates and (almost) guaranteed uniqueness of records. Therefore, it is often the best way to implement a primary key.
Instead of comparing the whole file, you can calculate its hash and only compare that.
Examples
- Git – the main identifier of a commit is precisely the hash.
- Docker – images are identified using a hash.
- Signatures – digital signatures are based on hashes, which are short enough to use asymmetric encryption.
- Blockchain – here, there are many uses, including ensuring integrity.
- Build cache – calculating the hash is faster than recompilation, so hashes help find previously built things.
All of these systems use hashes in some way. Some call it directly hash, others differently, e.g. digest, but they all refer to shortened representation of the data.
Simple as that. I think there is something beautiful in high-entropy data, in how numbers can represent almost anything in computer science.
