Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #7260
| References | <5a03279e-993e-4a35-9220-da5014d68430@34g2000pru.googlegroups.com> <mailman.2499.1307382457.9059.python-list@python.org> <c5671628-1464-47aa-ac26-9715bdbe94e7@s41g2000prb.googlegroups.com> |
|---|---|
| Date | 2011-06-08 13:48 -0700 |
| Subject | Re: Bloom Filter in 22 lines of Python (updated) |
| From | geremy condra <debatem1@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.35.1307566137.11593.python-list@python.org> (permalink) |
On Wed, Jun 8, 2011 at 1:05 PM, Raymond Hettinger <python@rcn.com> wrote: > On Jun 6, 10:47 am, geremy condra <debat...@gmail.com> wrote: >> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt...@rcn.com> wrote: >> > Thanks for all the feedback on the earlier post. >> >> > I've updated the recipe to use a cleaner API, simpler code, >> > more easily subclassable, and with optional optimizations >> > for better cache utilization and speed: >> >> > http://code.activestate.com/recipes/577684-bloom-filter/ >> >> Any chance of this landing in collections? > > I would be happy to add something like this to collections > once the API matures. > > Right now, the constructor expects the user to know the desired > memory size and number of probes. Those could be computed > from the maximum number of unique keys and the tolerable > error false positive rate. > > The most convenient constructor could be: > b = BloomFilter(myseq) > where len(myseq) is presumed indicate the maximum number of > unique entries and there is a default tolerable false positive > rate of 1 in 10,000. > > The API already lets the user substitute different methods > for get_probes(). It should also allow the bytearray to be > easily replaced by other objects with indexable access > (array objects, mmap objects, etc). > > We could also provide union, intersection, and difference > operations but these would be a little awkward for general > purpose use because the result is only meaningful when > the sizes, probe functions, and input domain are all the same. > > Fortunately, there is a lot of prior art, so the API design > can be informed by what has worked well for others. Seems like the code that Dan Stromberg posted addresses a lot of this. Thoughts on his code? Also, if you're interested I'm about halfway through writing a C implementation, although it would have to be somewhat less flexible to get a real speed advantage. Right now there's a hefty enough speed penalty to the filter (the states test takes about 150 times as long with a bloom filter as with a normal set) that it may make sense to have a base case that uses an optimized C implementation but that can be overridden as needed. Geremy Condra
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-03 13:17 -0700
Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-06 10:47 -0700
Re: Bloom Filter in 22 lines of Python (updated) Raymond Hettinger <python@rcn.com> - 2011-06-08 13:05 -0700
Re: Bloom Filter in 22 lines of Python (updated) geremy condra <debatem1@gmail.com> - 2011-06-08 13:48 -0700
csiph-web