Hashing Part 1: Not just for potatoes!
Hash functions are an often-overlooked staple of good software. Hidden deep within code that you rely on every day are these mathematical gems that tell us what to trust and what to investigate.
The need for hash functions come up pretty often in software, lending themselves to a variety of uses like cryptography, speed, and reliability.
In this 2-part article, we’ll talk about how hash functions can be used in all sorts of ways.
The basic definition of a hash function is straightforward: given some data, calculate a number. The data can be of any size, from a few characters (like a password) to gigabytes and gigabytes of data. Because these algorithms can swallow so much data, they’re often called “digests”.
A good hash function creates a number that is “very unique”. That is, for two similar data sets, two different numbers are generated. The idea is that if you get two pieces of data with the same hash, you are very sure (but not completely sure) that those two pieces of data are exactly the same.
Some Real Examples
As one might be able to tell from the Wikipedia article, there are tons of hashing algorithms out there. Perhaps the most commonly known hash algorithms are MD5 and SHA256. In both cases of MD5 and SHA256, the number outputted is extremely large.
For example, the MD5 hash of the phrase “Hello, World!” is 6cd3556deb0da54bca060b4c39479839. For those keeping score at home, that’s a base 16 number with 32 digits! (Or should they be called hexits?) That’s somewhere around 3 * 10^38 power (see this link for the exact number).
Is it possible to two pieces of data that compute to the same hash? Sure – it’s called a “collision”. Collisions are astronomically unlikely (well… unless you’ve cracked the function – more on that in Part 2).
Hashing for Speed
So what does all this have to do with speed? Let’s start with a common programming problem: comparing two pieces of data. Suppose we have:
- 100 very large movie files, and
- we want to know if any two of those files are exactly the same file.
No matter what, you will have make lots of comparisons to complete this problem. But, comparing two files byte-by-byte is rather slow. Sure, if the two files are different by only one byte, you could stop the comparison short. But what if lots of the files are the same? That would be quite slow.
What if we could reduce the speed of a comparison down to something negligble… like instead of comparing gigabytes upon gigabytes of data, we just compare two 32-digit (4-byte) numbers. Enter the hash function! Just run each file through your hash function, storing all of those numbers (also called “hashes”), and compare those hashes. If the hashes are different, you know the files are different. If the hashes are the same, then then you almost positive that the two files are the same (e.g. for Md5, there’s a 1 in 1,00000 chance that you’re wrong).
That reduces your comparison time way down. Sure, you have to read through the file byte-by-byte once to calculate the hash, but after that, you can just use the hash for all your comparisons.
One common example of a system that does this exact kind of speedup is the Git version control system. See? I’m not making this up!
Stay tuned!
Now hashing can be much more useful beyond speedy file comparisons. In our next article, we’ll talk about how hashing is used in computer security – stay tuned!
3 Responses to 'Hashing Part 1: Not just for potatoes!'
Leave a Reply
You must be logged in to post a comment.


I think I need the remedial hash tag course…. What does a hash tag mean in twitter?
dmeneely
31 Aug 10 at 10:09 am
Hash tags in Twitter are different from hash functions. No relation. Hash tags in tweet are for creating trending topics, so it’s a way to classify your tweet. They’re called hash tags because they begin with the “hash sign” or #
andy
31 Aug 10 at 2:00 pm
I like hash browns, too. Especially from Waffle House for some reason. I think they should call it “Hash Brown House” since most people I know go there for the Hash Browns and not the Waffles.
programsam
31 Aug 10 at 4:31 pm