Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.074 X-Spam-Evidence: '*H*': 0.85; '*S*': 0.00; 'raises': 0.07; 'differently.': 0.09; 'namespace': 0.09; 'cc:addr:python-list': 0.10; '*any*': 0.16; 'hashes': 0.16; 'lookups.': 0.16; 'namespace.': 0.16; 'subject:Generate': 0.16; 'subject:URL': 0.16; 'two,': 0.16; 'wrote:': 0.17; 'items.': 0.17; 'saying': 0.18; 'trying': 0.21; 'cc:2**0': 0.23; '(this': 0.24; 'cc:no real name:2**0': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply- To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'charset:iso-8859-15': 0.26; 'am,': 0.27; 'cpu': 0.29; 'factor': 0.29; 'overhead': 0.29; 'probability': 0.29; "we're": 0.30; 'could': 0.32; 'certain': 0.33; "he's": 0.33; 'doing': 0.35; 'but': 0.36; 'two': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'received:192': 0.39; 'received:192.168': 0.40; "you've": 0.61; 'chance': 0.61; 'kind': 0.61; "you'll": 0.62; 'within': 0.64; 'header:Reply-To:1': 0.68; 'stated': 0.69; 'received:74.208': 0.71; 'million': 0.72; 'reply-to:no real name:2**0': 0.72; 'goal': 0.74; 'collision': 0.84; 'multiply': 0.84; 'received:74.208.4.194': 0.84; 'resolution,': 0.84; 'birthday': 0.91; 'was:': 0.91 Date: Wed, 14 Nov 2012 07:33:47 -0500 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: Johannes Bauer Subject: Re: Generate unique ID for URL References: <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:U8CbAg0C7LXAtoKy4R/toXoaOAvGf0LUareMYiPXyhB C4my2ldQdg/TzO52W34svnRFi/Dp79TXOPQkAptBVA0rARLGbs cNFTwKtTnGZOrl284JzBGfIYdBi+dymOXiz/RxxAo34G+DKYuj rfjdQ52+SEDrkpPBMpZWoFJ35UE1ZyZLgxUWe7/ZRU+zaQaQV3 l2MBeyRP73MgHxzdosZWWg+DPYLm72qmwsmmNHPEcYwjW6Qgq3 H2+ymCG874vBZdDN00Ns6JS2fH6wA8IZkW7E3jqeAevNDkhFdZ A1vFR+oGk9WabizNj0KQuYKhLy+Hmfjifp5XmTfvBrDti2L2g= = Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: d@davea.name List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 29 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1352896467 news.xs4all.nl 6936 [2001:888:2000:d::a6]:45683 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33329 On 11/14/2012 06:29 AM, Johannes Bauer wrote: > > > When doing these calculations, it's important to keep the birthday > paradox in mind (this is kind of counter-intuitive): The chance of a > collission raises tremendously when we're looking for *any* arbitrary > two hashes colliding within a certain namespace. The probability you've > calculated is the pre-image probability (which you also again need to > multiply with a factor of two, because when trying to collide one given > hash, in the mean case you'll only have to search *half* the namespace > before finding a collision). > Te birthday paradox could have been important had the OP stated his goal differently. What he said was: """Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable.""" That means that he's willing to do the necessary overhead of collision resolution, once in every 10 million lookups. That's not the same as saying that he wants only one chance in 10 million of having ANY collisions among his data items. -- DaveA