Sunday, November 07, 2004

Bloom Filters

This post is about this link

The first think for me would be to implement a quick exists(txt) in a given huge collection of texts. rom there I can start searching.
Maybe the vector could be a "hash table" with indexes in the collection as values and provided that the hash functions, between themselves have very low collision ratio, the bits in the vector are now pointing to index in the collection instead of seen/not-seen and we could have a very fast search.
Also would make a nice implementation of "e-mail and beyond" black-list/white-list.
The only question is how do you come up with k different hashing mechanisms yet in a limited space like (1..m) and what is the best k,m to use for a given lenght of your collection?. I guess the answer is God bless the mathematicians.

UpdateNow that I think a little more about, this is all blabla (hash table), since m would probably need to be as big as the collection length, or somewhere near to allow for overlaping not to happend, and so cannot be fixed from the start, wich I beleve was the hole point of this, (where there are k different hashes to minimize the impact of a given relatively small m).


Post a Comment

<< Home