Bloom Filter 101
What data structure would you use to determine whether an entity is in a given dataset with speed? Many would answer “Hashtable” without hesitation. Hashtable performs well when accessing an entity among a group of entities: the average time complexity of its read operations is constant. Nevertheless, using Hashtables can be rather costly when dealing with large datasets. Hashtables need to keep track of every entity in the dataset, resulting in a linear space complexity.
Is there a better alternative to Hashtable? You might want to ask. The answer is yes, on the condition that you are willing to relinquish accuracy. In this article, I will introduce Bloom Filter, an alternative to Hashtable for checking if an entity belongs to a dataset with better space complexities at the cost of lower accuracy.
What is Bloom Filter
Bloom Filter is a space-efficient probabilistic data structure for testing whether an item belongs to a set. Similar to Hashtable, Bloom Filter also makes use of hashing. However, unlike Hashtable, Bloom Filter does not store the items added to it. Instead, it calculates a hash value for the added item and marks the value as present in the set. Therefore, it does not require as much space as Hashtable, resulting in a superior space complexity.
However, we are making a tradeoff between space efficiency and accuracy here. By not storing the items in Bloom Filter, we also lose the ability to tell with certainty if an item belongs to a set, something we take for granted when using Hashtable. When we want to know if an item exists in a Bloom Filter, we calculate the hash value of the item and check if the hash value is marked as present in the Bloom Filter. We have to worry about false positives but not false negatives. In other words, when a Bloom Filter claims an item does not belong to a set, the accuracy of the claim is guaranteed to be 100%. Nevertheless, the opposite claim made by a Bloom Filter is not always correct.
When to use Bloom Filter
As mentioned above, Bloom filter is an ideal alternative to Hashtable when you value space efficiency over accuracy (specifically, the absence of false positives). A malicious URL detector is an example. Let’s say that you have collected millions of malicious URLs. If you built a malicious URL detector with a Hashtable, you would have to dump all of the malicious URLs in the Hashtable. The lookup would be slow, and the URLs would take up a lot of space. This detector does have a high level of accuracy. However, it doesn’t have to be accurate because the point of the malicious URL detector is to protect users from malicious URLs. As long as the detector does not mistake malicious URLs for safe URLs, which are the URLs users end up opening, it will have fulfilled its duties.
In comparison, building the malicious URL detector with Bloom Filter will lead to better performance in terms of both space complexities. Also, Bloom Filter can still make sure all the URLs marked as safe are always safe to open, protecting the users from malicious URLs. The only downside is that the malicious URL detector is now plagued by false positives, meaning the URLs marked as malicious might be safe. This downside is outweighed by the benefit of requiring less space and faster detection speed.
We can still use Bloom Filter even in cases where eliminating false positives is critical. However, we need to add a crucial step after Bloom Filter returns a positive result. To make sure the positive is not a false positive (a negative in disguise), we refer back to the original data source (database, etc.) and verify whether the positive result is correct. We don’t need to do the same for negative results since all negative results are guaranteed to be correct.
In this article, I have covered what Bloom Filter is, how it works, and when to use it. Bloom Filter is an alternative to Hashtable, but they both have their respective advantages and weaknesses. Please keep this in mind the next time you decide which one to use for building your feature.