Re: Bloom Filter in 22 lines of Python (updated)

References	<5a03279e-993e-4a35-9220-da5014d68430@34g2000pru.googlegroups.com> <mailman.2499.1307382457.9059.python-list@python.org> <c5671628-1464-47aa-ac26-9715bdbe94e7@s41g2000prb.googlegroups.com>
Date	2011-06-08 13:48 -0700
Subject	Re: Bloom Filter in 22 lines of Python (updated)
From	geremy condra <debatem1@gmail.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.35.1307566137.11593.python-list@python.org> (permalink)

Show all headers | View raw

On Wed, Jun 8, 2011 at 1:05 PM, Raymond Hettinger <python@rcn.com> wrote:
> On Jun 6, 10:47 am, geremy condra <debat...@gmail.com> wrote:
>> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt...@rcn.com> wrote:
>> > Thanks for all the feedback on the earlier post.
>>
>> > I've updated the recipe to use a cleaner API, simpler code,
>> > more easily subclassable, and with optional optimizations
>> > for better cache utilization and speed:
>>
>> >  http://code.activestate.com/recipes/577684-bloom-filter/
>>
>> Any chance of this landing in collections?
>
> I would be happy to add something like this to collections
> once the API matures.
>
> Right now, the constructor expects the user to know the desired
> memory size and number of probes.  Those could be computed
> from the maximum number of unique keys and the tolerable
> error false positive rate.
>
> The most convenient constructor could be:
>    b = BloomFilter(myseq)
> where len(myseq) is presumed indicate the maximum number of
> unique entries and there is a default tolerable false positive
> rate of 1 in 10,000.
>
> The API already lets the user substitute different methods
> for get_probes().  It should also allow the bytearray to be
> easily replaced by other objects with indexable access
> (array objects, mmap objects, etc).
>
> We could also provide union, intersection, and difference
> operations but these would be a little awkward for general
> purpose use because the result is only meaningful when
> the sizes, probe functions, and input domain are all the same.
>
> Fortunately, there is a lot of prior art, so the API design
> can be informed by what has worked well for others.

Seems like the code that Dan Stromberg posted addresses a lot of this.
Thoughts on his code?

Also, if you're interested I'm about halfway through writing a C
implementation, although it would have to be somewhat less flexible to
get a real speed advantage. Right now there's a hefty enough speed
penalty to the filter (the states test takes about 150 times as long
with a bloom filter as with a normal set) that it may make sense to
have a base case that uses an optimized C implementation but that can
be overridden as needed.

Geremy Condra

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-03 13:17 -0700
  Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-06 10:47 -0700
    Re: Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-08 13:05 -0700
      Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-08 13:48 -0700

csiph-web