Bloom filter

Have you ever wondered how it happens that when you try to register on a very popular portal, entering a made-up username, you get a response in a split second that this username is already taken and you should try another one? I’ll be honest, I didn’t either, but coincidentally, while reading Data Intensive Applications by Martin Kleppmann, I probably learned this secret. It is the Bloom filter. Or maybe not. Actually, I don’t know if there are companies using Bloom filter to verify usernames of their users, but I think this example is easy to understand, so I will stick with it).

As Wikipedia says , a Bloom filter is a probabilistic data structure that allows to check whether an element belongs to a set. An interesting aspect of this structure is their abilities. With their help, we can verify whether an element, with a high degree of probability, belongs to the set (the username is already taken) or definitely not (you can use your username because it is free). Sound interesting? Further will be more interesting.

The rule of work is simple: first, we have an array of bits with all indexes being 0. Next, we have a few hash functions. When we add an element to the set, we have to calculate the hash from every hash function we have. Every result is an index of our array, where we have to change the value from 0 to 1 (this is repeated for every function). Now, if we want to check whether our element belongs to the set (whether a username has already been taken), we have to calculate hashes again in our hash functions and check whether indexes in our array have values 0 or 1. If any index has a value equal to 0, we have 100% confidence that an element does not belong to the set. But if all indexes are set to 1, then there is a high probability that an element belongs to the set, but we don’t have 100% confidence.

As you can see, the idea of Bloom filter, as it often is, is simple.

What are the pros of the Bloom filter?

As you can see above, the idea behind this data structure seems simple

We are operating on an array where operations like inserting, updating, and getting values are fast [O(1)]. Some drawbacks here could be the need to call a few hash functions (because of that, they must be fast).

Because we store only a few bits of information instead of a full object, we can store information about millions (or even billions) using relatively little memory (which is cheap these days).

Ok, that sounds good, but what about the cons?

Sometimes, we might get a response that an element belongs to the set even if it does not. This is because hash functions can calculate the same results for two different objects. Sometimes, this could be a problem, but popular implementations should offer capabilities to minimize this inconvenience.

One of the significant drawbacks of the Bloom filter is its inability to remove elements once they are added. This limitation arises because the bits in the array are shared among multiple elements. Resetting a bit to 0 to remove an element could inadvertently affect other elements, leading to incorrect results. This means that information about an element cannot be deleted without potentially corrupting the integrity of the filter (this problem was fixed in the Counting Bloom filter - an improved version of the original Bloom filter data structure).

The quality of your Bloom filter depends on the quality of your hash functions. They should be fast, deterministic, independent of each other, and uniformly distribute the hashed values across the entire bit array, ensuring that each bit is set with equal probability. The uniformity minimises collisions and enhances the efficiency of the Bloom filter.

The Bloom filter is not new (the concept dates back to the 1970s), so there are ready-to-use implementations. Because I am from the JVM world, the implementation from the Google Guava library is closest to my heart. It is simple to use, but you have to properly configure expected iterations, which hold the number of elements you want to store in your filter and false positive probability expressed in percentages.