Groups > comp.lang.python > #4640 > unrolled thread

Today's fun and educational Python recipe

Started by	Raymond Hettinger <python@rcn.com>
First post	2011-05-04 11:17 -0700
Last post	2011-05-05 07:33 -0700
Articles	14 — 8 participants

Back to article view | Back to comp.lang.python

  Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 11:17 -0700
    Re: Today's fun and educational Python recipe Irmen de Jong <irmen@-NOSPAM-xs4all.nl> - 2011-05-04 21:02 +0200
      Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 12:13 -0700
        Re: Today's fun and educational Python recipe Irmen de Jong <irmen@-NOSPAM-xs4all.nl> - 2011-05-04 21:35 +0200
      Re: Today's fun and educational Python recipe Grant Edwards <invalid@invalid.invalid> - 2011-05-04 19:17 +0000
        Re: Today's fun and educational Python recipe Ben Finney <ben+python@benfinney.id.au> - 2011-05-05 09:33 +1000
      Re: Today's fun and educational Python recipe Chris Angelico <rosuav@gmail.com> - 2011-05-05 12:22 +1000
    Re: Today's fun and educational Python recipe Paul Rubin <no.email@nospam.invalid> - 2011-05-04 12:27 -0700
      Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 14:53 -0700
    Re: Today's fun and educational Python recipe Terry Reedy <tjreedy@udel.edu> - 2011-05-04 15:42 -0400
      Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 14:39 -0700
        Re: Today's fun and educational Python recipe Terry Reedy <tjreedy@udel.edu> - 2011-05-04 20:26 -0400
          Re: Today's fun and educational Python recipe Raymond Hettinger <python@rcn.com> - 2011-05-04 18:15 -0700
    Re: Today's fun and educational Python recipe nn <pruebauno@latinmail.com> - 2011-05-05 07:33 -0700

#4640 — Today's fun and educational Python recipe

From	Raymond Hettinger <python@rcn.com>
Date	2011-05-04 11:17 -0700
Subject	Today's fun and educational Python recipe
Message-ID	<dac673e7-fc1d-41fb-839e-97baa1bad360@s16g2000prf.googlegroups.com>

Here's a 22-line beauty for a classic and amazing algorithm:
http://bit.ly/bloom_filter

The wiki article on the algorithm is brief and well-written:
http://en.wikipedia.org/wiki/Bloom_filter

It turns out that people in the 1970's were pretty smart :-)


Raymond

-------
follow my other python tips and recipes on twitter:  @raymondh

[toc] | [next] | [standalone]

#4645

From	Irmen de Jong <irmen@-NOSPAM-xs4all.nl>
Date	2011-05-04 21:02 +0200
Message-ID	<4dc1a2b9$0$41110$e4fe514c@news.xs4all.nl>
In reply to	#4640

On 04-05-11 20:17, Raymond Hettinger wrote:
> Here's a 22-line beauty for a classic and amazing algorithm:
> http://bit.ly/bloom_filter
>
> The wiki article on the algorithm is brief and well-written:
> http://en.wikipedia.org/wiki/Bloom_filter
>
> It turns out that people in the 1970's were pretty smart :-)
>

I think that often, the cleverness of people is inversely proportional 
to the amount of CPU power and RAM that they have in their computer.

"Meh, what would I need such a thing for, I could just as well stick 
everything into a list"

Thankfully there are also people for whom this doesn't count.
(are they stuck in the '70s?)

Also: wasn't there a talk on Pycon in which a bloom filter was mentioned?

Irmen

[toc] | [prev] | [next] | [standalone]

#4646

From	Raymond Hettinger <python@rcn.com>
Date	2011-05-04 12:13 -0700
Message-ID	<00cf6d1f-518e-4638-abdf-9db2cad0735b@q12g2000prb.googlegroups.com>
In reply to	#4645

> > It turns out that people in the 1970's were pretty smart :-)
>
> I think that often, the cleverness of people is inversely proportional
> to the amount of CPU power and RAM that they have in their computer.

The Google guys have plenty of CPU power *and* plenty of
cleverness :-)

According to the wikipedia article, Google BigTable uses Bloom filters
to reduce the disk lookups for non-existent rows or column.  The
Google Chrome web browser also uses Bloom filters to speed up its Safe
Browsing service.


> Also: wasn't there a talk on Pycon in which a bloom filter was mentioned?

Yes!  As a matter of fact there was:
http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters


Raymond

-------
follow my other python tips and recipes on twitter:  @raymondh

[toc] | [prev] | [next] | [standalone]

#4652

From	Irmen de Jong <irmen@-NOSPAM-xs4all.nl>
Date	2011-05-04 21:35 +0200
Message-ID	<4dc1aa6f$0$41102$e4fe514c@news.xs4all.nl>
In reply to	#4646

On 04-05-11 21:13, Raymond Hettinger wrote:
>>> It turns out that people in the 1970's were pretty smart :-)
>>
>> I think that often, the cleverness of people is inversely proportional
>> to the amount of CPU power and RAM that they have in their computer.
>
> The Google guys have plenty of CPU power *and* plenty of
> cleverness :-)

Haha, true. We need to add a Googlyness factor in the equation.
Or perhaps: think what they could achieve if they only had a few 
machines instead of thousands ;-)

>> Also: wasn't there a talk on Pycon in which a bloom filter was mentioned?
>
> Yes!  As a matter of fact there was:
> http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters

Thanks, that was the one.

I didn't attend Pycon but I watched a truckload of talks on blip.tv and 
that one caught my attention because of its somewhat funny title
'handling ridiculous amounts of data with probabilistic data structures'

I didn't understand all of it but the name Bloom Filter stuck, it seems. 
  Adding it to my list of bookmarks of 
useful-stuff-I-intend-to-use-one-day-in-the-future...

Irmen

[toc] | [prev] | [next] | [standalone]

#4648

From	Grant Edwards <invalid@invalid.invalid>
Date	2011-05-04 19:17 +0000
Message-ID	<ips8nv$oec$1@reader1.panix.com>
In reply to	#4645

On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote:
> On 04-05-11 20:17, Raymond Hettinger wrote:
>> Here's a 22-line beauty for a classic and amazing algorithm:
>> http://bit.ly/bloom_filter
>>
>> The wiki article on the algorithm is brief and well-written:
>> http://en.wikipedia.org/wiki/Bloom_filter
>>
>> It turns out that people in the 1970's were pretty smart :-)
>>
>
> I think that often, the cleverness of people is inversely
> proportional to the amount of CPU power and RAM that they have in
> their computer.

True.

Unfortunately the difficulty in debugging and maintaining code is
often directly proportional to the cleverness exhibited by the
original programmer.

-- 
Grant Edwards               grant.b.edwards        Yow! I'm also against
                                  at               BODY-SURFING!!
                              gmail.com

[toc] | [prev] | [next] | [standalone]

#4672

From	Ben Finney <ben+python@benfinney.id.au>
Date	2011-05-05 09:33 +1000
Message-ID	<87ei4ef3bx.fsf@benfinney.id.au>
In reply to	#4648

Grant Edwards <invalid@invalid.invalid> writes:

> On 2011-05-04, Irmen de Jong <irmen@-NOSPAM-xs4all.nl> wrote:
> > I think that often, the cleverness of people is inversely
> > proportional to the amount of CPU power and RAM that they have in
> > their computer.
>
> True.
>
> Unfortunately the difficulty in debugging and maintaining code is
> often directly proportional to the cleverness exhibited by the
> original programmer.

+1 QOTW

-- 
 \            “Without cultural sanction, most or all of our religious |
  `\          beliefs and rituals would fall into the domain of mental |
_o__)                                 disturbance.” —John F. Schumaker |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#4684

From	Chris Angelico <rosuav@gmail.com>
Date	2011-05-05 12:22 +1000
Message-ID	<mailman.1180.1304562159.9059.python-list@python.org>
In reply to	#4645

On Thu, May 5, 2011 at 5:02 AM, Irmen de Jong <irmen@-nospam-xs4all.nl> wrote:
> I think that often, the cleverness of people is inversely proportional to
> the amount of CPU power and RAM that they have in their computer.

As Mark Rosewater is fond of saying, restrictions breed creativity.
Lack of computational resources is a major restriction (for an extreme
example of RAM shortage, look at how much code you can fit into a boot
sector without loading anything more from the disk). Take away all the
restrictions, and people will tend to code sloppily.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#4650

From	Paul Rubin <no.email@nospam.invalid>
Date	2011-05-04 12:27 -0700
Message-ID	<7xy62mxo3v.fsf@ruckus.brouhaha.com>
In reply to	#4640

Raymond Hettinger <python@rcn.com> writes:
> Here's a 22-line beauty for a classic and amazing algorithm:
> http://bit.ly/bloom_filter

The use of pickle to serialize the keys is a little bit suspicious if
there might be a reason to dump the filter to disk and re-use it in
another run of the program.  Pickle representation might change between
Python releases, for example.  It's just supposed to stay interoperable
between versions, not necessarily bitwise-identical.

Otherwise it's quite nice.  I'd suggest adding a .update() operation
that adds keys from a user-supplied iterator.

[toc] | [prev] | [next] | [standalone]

#4667

From	Raymond Hettinger <python@rcn.com>
Date	2011-05-04 14:53 -0700
Message-ID	<555ea465-0faf-428b-bfe0-c51efbc2d07f@35g2000prp.googlegroups.com>
In reply to	#4650

On May 4, 12:27 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Raymond Hettinger <pyt...@rcn.com> writes:
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> The use of pickle to serialize the keys is a little bit suspicious if
> there might be a reason to dump the filter to disk and re-use it in
> another run of the program.  Pickle representation might change between
> Python releases, for example.  It's just supposed to stay interoperable
> between versions, not necessarily bitwise-identical.
>
> Otherwise it's quite nice.  I'd suggest adding a .update() operation
> that adds keys from a user-supplied iterator.

I chose pickle because it solved the problem of turning arbitrary
objects into bytes which are needed as inputs to sha512.

It seems that this particular choice of hash function is distracting
some readers away from the interesting part of how a Bloom filter
works.

Since the choice of hash functions is completely arbitrary, I'm
thinking of substituting something a little more basic.  What do you
think of this alternative as a way to make it clearer that each
successive probe is uncorrelated, yet fully dependent on the key?

    def get_probes(self, key):
        hasher = Random(key).randrange
        num_words = len(self.arr)
        for _ in range(self.num_probes):
            array_index = hasher(num_words)
            bit_index = hasher(32)
            yield array_index, 1 << bit_index

Raymond
-------

follow my other python tips and recipes on twitter:  @raymondh

[toc] | [prev] | [next] | [standalone]

#4653

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-04 15:42 -0400
Message-ID	<mailman.1168.1304538174.9059.python-list@python.org>
In reply to	#4640

On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
> Here's a 22-line beauty for a classic and amazing algorithm:
> http://bit.ly/bloom_filter
>
> The wiki article on the algorithm is brief and well-written:
> http://en.wikipedia.org/wiki/Bloom_filter

As I understand the article, the array of num_bits should have 
num_probes (more or less independent) bits set for each key. But as I 
understand the code

         for i in range(self.num_probes):
             h, array_index = divmod(h, num_words)
             h, bit_index = divmod(h, 32)
             yield array_index, 1 << bit_index

the same bit is being set or tested num_probes times. The last three 
lines have no dependence on i that I can see, so they appear to do the 
same thing each time. This seems like a bug.

The article says "For a good hash function with a wide output, there 
should be little if any correlation between different bit-fields of such 
a hash, so this type of hash can be used to generate multiple 
"different" hash functions by slicing its output into multiple bit 
fields. Alternatively, one can pass k different initial values (such as 
0, 1, ..., k − 1) to a hash function that takes an initial value; or add 
(or append) these values to the key." I do not see the code doing either 
of these.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#4662

From	Raymond Hettinger <python@rcn.com>
Date	2011-05-04 14:39 -0700
Message-ID	<c3f5a37b-d268-4c28-8ec1-8f2ca8365aae@f31g2000pri.googlegroups.com>
In reply to	#4653

On May 4, 12:42 pm, Terry Reedy <tjre...@udel.edu> wrote:
> On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
>
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> > The wiki article on the algorithm is brief and well-written:
> >http://en.wikipedia.org/wiki/Bloom_filter
>
> As I understand the article, the array of num_bits should have
> num_probes (more or less independent) bits set for each key. But as I
> understand the code
>
>          for i in range(self.num_probes):
>              h, array_index = divmod(h, num_words)
>              h, bit_index = divmod(h, 32)
>              yield array_index, 1 << bit_index
>
> the same bit is being set or tested num_probes times. The last three
> lines have no dependence on i that I can see, so they appear to do the
> same thing each time. This seems like a bug.

The 512 bits in h are progressively eaten-up between iterations.  So
each pass yields a different (array index, bit_mask) pair.

It's easy to use the interactive prompt to show that different probes
are produced on each pass:

>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>> pprint(list(bf.get_probes('Alabama')))
[(19, 1073741824),
 (11, 64),
 (9, 134217728),
 (25, 1024),
 (24, 33554432),
 (6, 16),
 (7, 16777216),
 (22, 1048576)]

The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)

The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised.  It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives.  That should be somewhat
convincing that it all works.


Raymond

-------
follow my other python tips and recipes on twitter:  @raymondh

[toc] | [prev] | [next] | [standalone]

#4675

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-04 20:26 -0400
Message-ID	<mailman.1176.1304555225.9059.python-list@python.org>
In reply to	#4662

On 5/4/2011 5:39 PM, Raymond Hettinger wrote:

> The 512 bits in h are progressively eaten-up between iterations.  So
> each pass yields a different (array index, bit_mask) pair.

Yeh, obvious now that I see it.

> It's easy to use the interactive prompt to show that different probes
> are produced on each pass:
>
>>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>>> pprint(list(bf.get_probes('Alabama')))
> [(19, 1073741824),
>   (11, 64),
>   (9, 134217728),
>   (25, 1024),
>   (24, 33554432),
>   (6, 16),
>   (7, 16777216),
>   (22, 1048576)]

Should have tried that.

> The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
> a cryptographic hash ;)

> The fifty state example in the recipe is a reasonable demonstration
> that the recipe works as advertised.  It successfully finds all fifty
> states (the true positives) and it tries 100,000 negatives resulting
> in only a handful of false negatives.

I presume you mean 'false positives', as in the program comment and 
Wikipedia.

The test would be more convincing to many with 100000 other geographic 
names (hard to come by, I know), or other english names or words or even 
with longer random strings that matched the lengths of the state names. 
But an average of 5/100000 false positives in 5 runs is good.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#4682

From	Raymond Hettinger <python@rcn.com>
Date	2011-05-04 18:15 -0700
Message-ID	<1ba35556-6dd8-4ddb-a9ea-19d3ed3768bc@34g2000pru.googlegroups.com>
In reply to	#4675

On May 4, 5:26 pm, Terry Reedy <tjre...@udel.edu> wrote:
> The test would be more convincing to many with 100000 other geographic
> names (hard to come by, I know), or other english names or words or even
> with longer random strings that matched the lengths of the state names.
> But an average of 5/100000 false positives in 5 runs is good.

I've just posted an update with a spell checker using a large
dictionary.  That should make it easy to validate against some real
world text samples.


Raymond

[toc] | [prev] | [next] | [standalone]

#4735

From	nn <pruebauno@latinmail.com>
Date	2011-05-05 07:33 -0700
Message-ID	<aa19442f-6a06-4692-aee4-549e2b9e8f9e@c41g2000yqm.googlegroups.com>
In reply to	#4640

On May 4, 2:17 pm, Raymond Hettinger <pyt...@rcn.com> wrote:
> Here's a 22-line beauty for a classic and amazing algorithm:http://bit.ly/bloom_filter
>
> The wiki article on the algorithm is brief and well-written:http://en.wikipedia.org/wiki/Bloom_filter
>
> It turns out that people in the 1970's were pretty smart :-)
>
> Raymond
>
> -------
> follow my other python tips and recipes on twitter:  @raymondh

Very cool! Thanks.

[toc] | [prev] | [standalone]

csiph-web

Today's fun and educational Python recipe

Contents

#4640 — Today's fun and educational Python recipe

#4645

#4646

#4652

#4648

#4672

#4684

#4650

#4667

#4653

#4662

#4675

#4682

#4735