Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #4640 > unrolled thread
| Started by | Raymond Hettinger <python@rcn.com> |
|---|---|
| First post | 2011-05-04 11:17 -0700 |
| Last post | 2011-05-05 07:33 -0700 |
| Articles | 14 — 8 participants |
Back to article view | Back to comp.lang.python
Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 11:17 -0700
Re: Today's fun and educational Python recipe Irmen de Jong <irmen@-NOSPAM-xs4all.nl> - 2011-05-04 21:02 +0200
Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 12:13 -0700
Re: Today's fun and educational Python recipe Irmen de Jong <irmen@-NOSPAM-xs4all.nl> - 2011-05-04 21:35 +0200
Re: Today's fun and educational Python recipe Grant Edwards <invalid@invalid.invalid> - 2011-05-04 19:17 +0000
Re: Today's fun and educational Python recipe Ben Finney <ben+python@benfinney.id.au> - 2011-05-05 09:33 +1000
Re: Today's fun and educational Python recipe Chris Angelico <rosuav@gmail.com> - 2011-05-05 12:22 +1000
Re: Today's fun and educational Python recipe Paul Rubin <no.email@nospam.invalid> - 2011-05-04 12:27 -0700
Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 14:53 -0700
Re: Today's fun and educational Python recipe Terry Reedy <tjreedy@udel.edu> - 2011-05-04 15:42 -0400
Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 14:39 -0700
Re: Today's fun and educational Python recipe Terry Reedy <tjreedy@udel.edu> - 2011-05-04 20:26 -0400
Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 18:15 -0700
Re: Today's fun and educational Python recipe nn <pruebauno@latinmail.com> - 2011-05-05 07:33 -0700
| From | Raymond Hettinger <python@rcn.com> |
|---|---|
| Date | 2011-05-04 11:17 -0700 |
| Subject | Today's fun and educational Python recipe |
| Message-ID | <dac673e7-fc1d-41fb-839e-97baa1bad360@s16g2000prf.googlegroups.com> |
Here's a 22-line beauty for a classic and amazing algorithm: http://bit.ly/bloom_filter The wiki article on the algorithm is brief and well-written: http://en.wikipedia.org/wiki/Bloom_filter It turns out that people in the 1970's were pretty smart :-) Raymond ------- follow my other python tips and recipes on twitter: @raymondh
[toc] | [next] | [standalone]
| From | Irmen de Jong <irmen@-NOSPAM-xs4all.nl> |
|---|---|
| Date | 2011-05-04 21:02 +0200 |
| Message-ID | <4dc1a2b9$0$41110$e4fe514c@news.xs4all.nl> |
| In reply to | #4640 |
On 04-05-11 20:17, Raymond Hettinger wrote: > Here's a 22-line beauty for a classic and amazing algorithm: > http://bit.ly/bloom_filter > > The wiki article on the algorithm is brief and well-written: > http://en.wikipedia.org/wiki/Bloom_filter > > It turns out that people in the 1970's were pretty smart :-) > I think that often, the cleverness of people is inversely proportional to the amount of CPU power and RAM that they have in their computer. "Meh, what would I need such a thing for, I could just as well stick everything into a list" Thankfully there are also people for whom this doesn't count. (are they stuck in the '70s?) Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? Irmen
[toc] | [prev] | [next] | [standalone]
| From | Raymond Hettinger <python@rcn.com> |
|---|---|
| Date | 2011-05-04 12:13 -0700 |
| Message-ID | <00cf6d1f-518e-4638-abdf-9db2cad0735b@q12g2000prb.googlegroups.com> |
| In reply to | #4645 |
> > It turns out that people in the 1970's were pretty smart :-) > > I think that often, the cleverness of people is inversely proportional > to the amount of CPU power and RAM that they have in their computer. The Google guys have plenty of CPU power *and* plenty of cleverness :-) According to the wikipedia article, Google BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or column. The Google Chrome web browser also uses Bloom filters to speed up its Safe Browsing service. > Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? Yes! As a matter of fact there was: http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters Raymond ------- follow my other python tips and recipes on twitter: @raymondh
[toc] | [prev] | [next] | [standalone]
| From | Irmen de Jong <irmen@-NOSPAM-xs4all.nl> |
|---|---|
| Date | 2011-05-04 21:35 +0200 |
| Message-ID | <4dc1aa6f$0$41102$e4fe514c@news.xs4all.nl> |
| In reply to | #4646 |
On 04-05-11 21:13, Raymond Hettinger wrote: >>> It turns out that people in the 1970's were pretty smart :-) >> >> I think that often, the cleverness of people is inversely proportional >> to the amount of CPU power and RAM that they have in their computer. > > The Google guys have plenty of CPU power *and* plenty of > cleverness :-) Haha, true. We need to add a Googlyness factor in the equation. Or perhaps: think what they could achieve if they only had a few machines instead of thousands ;-) >> Also: wasn't there a talk on Pycon in which a bloom filter was mentioned? > > Yes! As a matter of fact there was: > http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters Thanks, that was the one. I didn't attend Pycon but I watched a truckload of talks on blip.tv and that one caught my attention because of its somewhat funny title 'handling ridiculous amounts of data with probabilistic data structures' I didn't understand all of it but the name Bloom Filter stuck, it seems. Adding it to my list of bookmarks of useful-stuff-I-intend-to-use-one-day-in-the-future... Irmen
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2011-05-04 19:17 +0000 |
| Message-ID | <ips8nv$oec$1@reader1.panix.com> |
| In reply to | #4645 |
On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote:
> On 04-05-11 20:17, Raymond Hettinger wrote:
>> Here's a 22-line beauty for a classic and amazing algorithm:
>> http://bit.ly/bloom_filter
>>
>> The wiki article on the algorithm is brief and well-written:
>> http://en.wikipedia.org/wiki/Bloom_filter
>>
>> It turns out that people in the 1970's were pretty smart :-)
>>
>
> I think that often, the cleverness of people is inversely
> proportional to the amount of CPU power and RAM that they have in
> their computer.
True.
Unfortunately the difficulty in debugging and maintaining code is
often directly proportional to the cleverness exhibited by the
original programmer.
--
Grant Edwards grant.b.edwards Yow! I'm also against
at BODY-SURFING!!
gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2011-05-05 09:33 +1000 |
| Message-ID | <87ei4ef3bx.fsf@benfinney.id.au> |
| In reply to | #4648 |
Grant Edwards <invalid@invalid.invalid> writes: > On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote: > > I think that often, the cleverness of people is inversely > > proportional to the amount of CPU power and RAM that they have in > > their computer. > > True. > > Unfortunately the difficulty in debugging and maintaining code is > often directly proportional to the cleverness exhibited by the > original programmer. +1 QOTW -- \ “Without cultural sanction, most or all of our religious | `\ beliefs and rituals would fall into the domain of mental | _o__) disturbance.” —John F. Schumaker | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-05-05 12:22 +1000 |
| Message-ID | <mailman.1180.1304562159.9059.python-list@python.org> |
| In reply to | #4645 |
On Thu, May 5, 2011 at 5:02 AM, Irmen de Jong <irmen@-nospam-xs4all.nl> wrote: > I think that often, the cleverness of people is inversely proportional to > the amount of CPU power and RAM that they have in their computer. As Mark Rosewater is fond of saying, restrictions breed creativity. Lack of computational resources is a major restriction (for an extreme example of RAM shortage, look at how much code you can fit into a boot sector without loading anything more from the disk). Take away all the restrictions, and people will tend to code sloppily. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2011-05-04 12:27 -0700 |
| Message-ID | <7xy62mxo3v.fsf@ruckus.brouhaha.com> |
| In reply to | #4640 |
Raymond Hettinger <python@rcn.com> writes: > Here's a 22-line beauty for a classic and amazing algorithm: > http://bit.ly/bloom_filter The use of pickle to serialize the keys is a little bit suspicious if there might be a reason to dump the filter to disk and re-use it in another run of the program. Pickle representation might change between Python releases, for example. It's just supposed to stay interoperable between versions, not necessarily bitwise-identical. Otherwise it's quite nice. I'd suggest adding a .update() operation that adds keys from a user-supplied iterator.
[toc] | [prev] | [next] | [standalone]
| From | Raymond Hettinger <python@rcn.com> |
|---|---|
| Date | 2011-05-04 14:53 -0700 |
| Message-ID | <555ea465-0faf-428b-bfe0-c51efbc2d07f@35g2000prp.googlegroups.com> |
| In reply to | #4650 |
On May 4, 12:27 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Raymond Hettinger <pyt...@rcn.com> writes:
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> The use of pickle to serialize the keys is a little bit suspicious if
> there might be a reason to dump the filter to disk and re-use it in
> another run of the program. Pickle representation might change between
> Python releases, for example. It's just supposed to stay interoperable
> between versions, not necessarily bitwise-identical.
>
> Otherwise it's quite nice. I'd suggest adding a .update() operation
> that adds keys from a user-supplied iterator.
I chose pickle because it solved the problem of turning arbitrary
objects into bytes which are needed as inputs to sha512.
It seems that this particular choice of hash function is distracting
some readers away from the interesting part of how a Bloom filter
works.
Since the choice of hash functions is completely arbitrary, I'm
thinking of substituting something a little more basic. What do you
think of this alternative as a way to make it clearer that each
successive probe is uncorrelated, yet fully dependent on the key?
def get_probes(self, key):
hasher = Random(key).randrange
num_words = len(self.arr)
for _ in range(self.num_probes):
array_index = hasher(num_words)
bit_index = hasher(32)
yield array_index, 1 << bit_index
Raymond
-------
follow my other python tips and recipes on twitter: @raymondh
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-04 15:42 -0400 |
| Message-ID | <mailman.1168.1304538174.9059.python-list@python.org> |
| In reply to | #4640 |
On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
> Here's a 22-line beauty for a classic and amazing algorithm:
> http://bit.ly/bloom_filter
>
> The wiki article on the algorithm is brief and well-written:
> http://en.wikipedia.org/wiki/Bloom_filter
As I understand the article, the array of num_bits should have
num_probes (more or less independent) bits set for each key. But as I
understand the code
for i in range(self.num_probes):
h, array_index = divmod(h, num_words)
h, bit_index = divmod(h, 32)
yield array_index, 1 << bit_index
the same bit is being set or tested num_probes times. The last three
lines have no dependence on i that I can see, so they appear to do the
same thing each time. This seems like a bug.
The article says "For a good hash function with a wide output, there
should be little if any correlation between different bit-fields of such
a hash, so this type of hash can be used to generate multiple
"different" hash functions by slicing its output into multiple bit
fields. Alternatively, one can pass k different initial values (such as
0, 1, ..., k − 1) to a hash function that takes an initial value; or add
(or append) these values to the key." I do not see the code doing either
of these.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Raymond Hettinger <python@rcn.com> |
|---|---|
| Date | 2011-05-04 14:39 -0700 |
| Message-ID | <c3f5a37b-d268-4c28-8ec1-8f2ca8365aae@f31g2000pri.googlegroups.com> |
| In reply to | #4653 |
On May 4, 12:42 pm, Terry Reedy <tjre...@udel.edu> wrote:
> On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
>
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> > The wiki article on the algorithm is brief and well-written:
> >http://en.wikipedia.org/wiki/Bloom_filter
>
> As I understand the article, the array of num_bits should have
> num_probes (more or less independent) bits set for each key. But as I
> understand the code
>
> for i in range(self.num_probes):
> h, array_index = divmod(h, num_words)
> h, bit_index = divmod(h, 32)
> yield array_index, 1 << bit_index
>
> the same bit is being set or tested num_probes times. The last three
> lines have no dependence on i that I can see, so they appear to do the
> same thing each time. This seems like a bug.
The 512 bits in h are progressively eaten-up between iterations. So
each pass yields a different (array index, bit_mask) pair.
It's easy to use the interactive prompt to show that different probes
are produced on each pass:
>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>> pprint(list(bf.get_probes('Alabama')))
[(19, 1073741824),
(11, 64),
(9, 134217728),
(25, 1024),
(24, 33554432),
(6, 16),
(7, 16777216),
(22, 1048576)]
The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)
The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised. It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives. That should be somewhat
convincing that it all works.
Raymond
-------
follow my other python tips and recipes on twitter: @raymondh
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-04 20:26 -0400 |
| Message-ID | <mailman.1176.1304555225.9059.python-list@python.org> |
| In reply to | #4662 |
On 5/4/2011 5:39 PM, Raymond Hettinger wrote:
> The 512 bits in h are progressively eaten-up between iterations. So
> each pass yields a different (array index, bit_mask) pair.
Yeh, obvious now that I see it.
> It's easy to use the interactive prompt to show that different probes
> are produced on each pass:
>
>>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>>> pprint(list(bf.get_probes('Alabama')))
> [(19, 1073741824),
> (11, 64),
> (9, 134217728),
> (25, 1024),
> (24, 33554432),
> (6, 16),
> (7, 16777216),
> (22, 1048576)]
Should have tried that.
> The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
> a cryptographic hash ;)
> The fifty state example in the recipe is a reasonable demonstration
> that the recipe works as advertised. It successfully finds all fifty
> states (the true positives) and it tries 100,000 negatives resulting
> in only a handful of false negatives.
I presume you mean 'false positives', as in the program comment and
Wikipedia.
The test would be more convincing to many with 100000 other geographic
names (hard to come by, I know), or other english names or words or even
with longer random strings that matched the lengths of the state names.
But an average of 5/100000 false positives in 5 runs is good.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Raymond Hettinger <python@rcn.com> |
|---|---|
| Date | 2011-05-04 18:15 -0700 |
| Message-ID | <1ba35556-6dd8-4ddb-a9ea-19d3ed3768bc@34g2000pru.googlegroups.com> |
| In reply to | #4675 |
On May 4, 5:26 pm, Terry Reedy <tjre...@udel.edu> wrote: > The test would be more convincing to many with 100000 other geographic > names (hard to come by, I know), or other english names or words or even > with longer random strings that matched the lengths of the state names. > But an average of 5/100000 false positives in 5 runs is good. I've just posted an update with a spell checker using a large dictionary. That should make it easy to validate against some real world text samples. Raymond
[toc] | [prev] | [next] | [standalone]
| From | nn <pruebauno@latinmail.com> |
|---|---|
| Date | 2011-05-05 07:33 -0700 |
| Message-ID | <aa19442f-6a06-4692-aee4-549e2b9e8f9e@c41g2000yqm.googlegroups.com> |
| In reply to | #4640 |
On May 4, 2:17 pm, Raymond Hettinger <pyt...@rcn.com> wrote: > Here's a 22-line beauty for a classic and amazing algorithm:http://bit.ly/bloom_filter > > The wiki article on the algorithm is brief and well-written:http://en.wikipedia.org/wiki/Bloom_filter > > It turns out that people in the 1970's were pretty smart :-) > > Raymond > > ------- > follow my other python tips and recipes on twitter: @raymondh Very cool! Thanks.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web