In CS 15, we recently learned about a way of storing data called *hash tables*. Weāre now working on an assignment that uses themĀ āĀ a search tool for finding a word across thousands of books. I wanted to make an attempt at explaining how hash tables work, without assuming computer science knowledge.

So in programming languages, we have a concept of a list of things. Letās say Iām trying to store a bunch of users for my app. I could store them all in a list, where each item in the list is a user (maybe with their ID number, email address, birthday, etc).

Letās say I want to retrieve a user, and I know their ID number ā I want user #27. Iād go through the list one-by-one, checking whether each user in the list had ID #27.

But this isnāt very efficient. I have to go through the entire list one-by-one, checking each one. If Iām storing a large number of users, it isnāt efficient to check every single user.

So what if I stored the users in my list in sequential order according to their ID number? The user with ID #1 would be first in my list, the user with ID #2 second, and the user with ID #27 would be twenty-seventh in my list.

Then, if I knew I wanted to retrieve the information belonging to user #27, Iād be able to jump straight to the twenty-seventh spot in my list. Thatās a lot faster than going through every user one-by-one and checking for a match! I can immediately jump straight to the user whose information I want.

But what if I wanted to file each userās information according to their email address, instead of their user ID? After all, I probably have a userās email address but not their user ID.

This presents a problem: I canāt store `[email protected]`

ās information in the `[email protected]`

th spot in a list.

Enter: *hash functions*. Hash functions are a type of mathematical function that takes any phrase and converts it into a number. Theyāre very complicated to invent, but people make these hash functions and the rest of us can use them.

The most important part of a hash function is that it **spits out the same number every time we give it the same phrase**. So if I give `[email protected]`

to my hash function and that functions spits out `7673492023`

, then `[email protected]`

should *always* cause the hash function to spit out `7673492023`

. Same input, same output.

So now, if I want to store `[email protected]`

ās data, I store it in the `7673492023`

th spot in my array. And if I want to retrieve his user data, I give ā`[email protected]`

ā to the same hash function, get `7673492023`

again, and know where to instantly find his user data.

Perfect! One problem though: the hash function spits out large numbers (like `7673492023`

), and having a list that long would be wasteful since most of the spots would lie empty.

So instead of using `7673492023`

, we set a reasonable size for our list (letās say 100) and use the mathematical *modulo* operation. Modulo gives us the remainder when one number is divided by another (so any number modulo 100 is between 0 and 99, since the remainder canāt be 100).

`7673492023`

modulo 100 gives us `23`

, so we store the user data in the 23rd spot in our list instead ā and we end up using less space to store the whole list, since it isnāt ridiculously long (only 100 spots long).

Thereās bound to be *collisions* though. What if the hash function on `[email protected]`

spits out `7673492023`

, and the hash function on `[email protected]`

spits out `1300287523`

? Two different numbers, but since our list is only 100 long, both numbers modulo by 100 give us `23`

. Weāll be trying to store two things in spot `23`

.

The solution: make every spot in the list another list! So at spot `23`

we store a sub-list that contains `[email protected]`

ās data and `[email protected]`

ās data. When we want to find either oneās data, we run their email address through the hash function, modulo by 100, jump to spot `23`

, and then go through the sub-list at spot `23`

to find the correct userās data.

However, thereās a formula for calculating whether your list is too short for the data it contains and more spots need to be added (in this example, expanding above 100 spots).

If you expand your list as necessary, youāll have minimal collisions, which will keep each spot in the list containing only ~1 userās data ā therefore keeping things efficient, since you can jump straight to a spot in the list and not have a long sub-list to go through at that spot.