Groups > comp.lang.python > #6966 > unrolled thread

Bloom Filter in 22 lines of Python (updated)

Started by	Raymond Hettinger <python@rcn.com>
First post	2011-06-03 13:17 -0700
Last post	2011-06-08 13:48 -0700
Articles	4 — 2 participants

Back to article view | Back to comp.lang.python

  Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-03 13:17 -0700
    Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-06 10:47 -0700
      Re: Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-08 13:05 -0700
        Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-08 13:48 -0700

#6966 — Bloom Filter in 22 lines of Python (updated)

From	Raymond Hettinger <python@rcn.com>
Date	2011-06-03 13:17 -0700
Subject	Bloom Filter in 22 lines of Python (updated)
Message-ID	<5a03279e-993e-4a35-9220-da5014d68430@34g2000pru.googlegroups.com>

Thanks for all the feedback on the earlier post.

I've updated the recipe to use a cleaner API, simpler code,
more easily subclassable, and with optional optimizations
for better cache utilization and speed:

 http://code.activestate.com/recipes/577684-bloom-filter/


Raymond

----------------------
follow my python tips and recipes on twitter: @raymondh

[toc] | [next] | [standalone]

#7102

From	geremy condra <debatem1@gmail.com>
Date	2011-06-06 10:47 -0700
Message-ID	<mailman.2499.1307382457.9059.python-list@python.org>
In reply to	#6966

On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <python@rcn.com> wrote:
> Thanks for all the feedback on the earlier post.
>
> I've updated the recipe to use a cleaner API, simpler code,
> more easily subclassable, and with optional optimizations
> for better cache utilization and speed:
>
>  http://code.activestate.com/recipes/577684-bloom-filter/

Any chance of this landing in collections?

Geremy Condra

[toc] | [prev] | [next] | [standalone]

#7252

From	Raymond Hettinger <python@rcn.com>
Date	2011-06-08 13:05 -0700
Message-ID	<c5671628-1464-47aa-ac26-9715bdbe94e7@s41g2000prb.googlegroups.com>
In reply to	#7102

On Jun 6, 10:47 am, geremy condra <debat...@gmail.com> wrote:
> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt...@rcn.com> wrote:
> > Thanks for all the feedback on the earlier post.
>
> > I've updated the recipe to use a cleaner API, simpler code,
> > more easily subclassable, and with optional optimizations
> > for better cache utilization and speed:
>
> >  http://code.activestate.com/recipes/577684-bloom-filter/
>
> Any chance of this landing in collections?

I would be happy to add something like this to collections
once the API matures.

Right now, the constructor expects the user to know the desired
memory size and number of probes.  Those could be computed
from the maximum number of unique keys and the tolerable
error false positive rate.

The most convenient constructor could be:
    b = BloomFilter(myseq)
where len(myseq) is presumed indicate the maximum number of
unique entries and there is a default tolerable false positive
rate of 1 in 10,000.

The API already lets the user substitute different methods
for get_probes().  It should also allow the bytearray to be
easily replaced by other objects with indexable access
(array objects, mmap objects, etc).

We could also provide union, intersection, and difference
operations but these would be a little awkward for general
purpose use because the result is only meaningful when
the sizes, probe functions, and input domain are all the same.

Fortunately, there is a lot of prior art, so the API design
can be informed by what has worked well for others.

Raymond

[toc] | [prev] | [next] | [standalone]

#7260

From	geremy condra <debatem1@gmail.com>
Date	2011-06-08 13:48 -0700
Message-ID	<mailman.35.1307566137.11593.python-list@python.org>
In reply to	#7252

On Wed, Jun 8, 2011 at 1:05 PM, Raymond Hettinger <python@rcn.com> wrote:
> On Jun 6, 10:47 am, geremy condra <debat...@gmail.com> wrote:
>> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt...@rcn.com> wrote:
>> > Thanks for all the feedback on the earlier post.
>>
>> > I've updated the recipe to use a cleaner API, simpler code,
>> > more easily subclassable, and with optional optimizations
>> > for better cache utilization and speed:
>>
>> >  http://code.activestate.com/recipes/577684-bloom-filter/
>>
>> Any chance of this landing in collections?
>
> I would be happy to add something like this to collections
> once the API matures.
>
> Right now, the constructor expects the user to know the desired
> memory size and number of probes.  Those could be computed
> from the maximum number of unique keys and the tolerable
> error false positive rate.
>
> The most convenient constructor could be:
>    b = BloomFilter(myseq)
> where len(myseq) is presumed indicate the maximum number of
> unique entries and there is a default tolerable false positive
> rate of 1 in 10,000.
>
> The API already lets the user substitute different methods
> for get_probes().  It should also allow the bytearray to be
> easily replaced by other objects with indexable access
> (array objects, mmap objects, etc).
>
> We could also provide union, intersection, and difference
> operations but these would be a little awkward for general
> purpose use because the result is only meaningful when
> the sizes, probe functions, and input domain are all the same.
>
> Fortunately, there is a lot of prior art, so the API design
> can be informed by what has worked well for others.

Seems like the code that Dan Stromberg posted addresses a lot of this.
Thoughts on his code?

Also, if you're interested I'm about halfway through writing a C
implementation, although it would have to be somewhat less flexible to
get a real speed advantage. Right now there's a hefty enough speed
penalty to the filter (the states test takes about 150 times as long
with a bloom filter as with a normal set) that it may make sense to
have a base case that uses an optimized C implementation but that can
be overridden as needed.

Geremy Condra

[toc] | [prev] | [standalone]

csiph-web

Bloom Filter in 22 lines of Python (updated)

Contents

#6966 — Bloom Filter in 22 lines of Python (updated)

#7102

#7252

#7260