Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #33271 > unrolled thread
| Started by | Richard <richardbp@gmail.com> |
|---|---|
| First post | 2012-11-13 15:20 -0800 |
| Last post | 2012-11-14 14:00 +0100 |
| Articles | 8 on this page of 28 — 12 participants |
Back to article view | Back to comp.lang.python
Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 15:20 -0800
Re: Generate unique ID for URL John Gordon <gordon@panix.com> - 2012-11-13 23:34 +0000
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 15:56 -0800
Re: Generate unique ID for URL Chris Kaynor <ckaynor@zindagigames.com> - 2012-11-13 16:26 -0800
Re: Generate unique ID for URL Richard Baron Penman <richardbp@gmail.com> - 2012-11-14 11:41 +1100
Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 10:44 +0100
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-14 03:14 -0800
Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 01:43 +0100
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 16:50 -0800
Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 02:05 +0100
Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 01:59 +0100
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 17:18 -0800
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 17:18 -0800
Re: Generate unique ID for URL Miki Tebeka <miki.tebeka@gmail.com> - 2012-11-13 16:13 -0800
Re: Generate unique ID for URL Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-11-14 02:04 +0000
Re: Generate unique ID for URL Steve Howell <showell30@yahoo.com> - 2012-11-13 18:32 -0800
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:12 -0800
Re: Generate unique ID for URL Roy Smith <roy@panix.com> - 2012-11-13 20:39 -0500
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:25 -0800
Re: Generate unique ID for URL Roy Smith <roy@panix.com> - 2012-11-13 22:38 -0500
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:56 -0800
Re: Generate unique ID for URL Chris Angelico <rosuav@gmail.com> - 2012-11-14 15:06 +1100
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 20:14 -0800
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 20:14 -0800
Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:27 -0800
Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 12:29 +0100
Re: Generate unique ID for URL Dave Angel <d@davea.name> - 2012-11-14 07:33 -0500
Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 14:00 +0100
Page 2 of 2 — ← Prev page 1 [2]
| From | Richard <richardbp@gmail.com> |
|---|---|
| Date | 2012-11-13 19:56 -0800 |
| Message-ID | <2d7deb3f-0517-438f-8b98-fb9dc2979b1e@googlegroups.com> |
| In reply to | #33293 |
thanks for pointer to Varnish. I found MongoDB had a lot of size overhead so that it ended up using 4x the data stored.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-11-14 15:06 +1100 |
| Message-ID | <mailman.3661.1352865978.27098.python-list@python.org> |
| In reply to | #33290 |
On Wed, Nov 14, 2012 at 2:25 PM, Richard <richardbp@gmail.com> wrote: > So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. > I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. > For example to store data at URL "abc": a/b/c/index.html > This data is also viewed locally through a web app. > > If you can suggest a better approach I would welcome it. The cost of a crypto hash on the URL will be completely dwarfed by the cost of storing/retrieving on disk. You could probably do some arithmetic and figure out exactly how many URLs (at an average length of, say, 100 bytes) you can hash in the time of one disk seek. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Richard <richardbp@gmail.com> |
|---|---|
| Date | 2012-11-13 20:14 -0800 |
| Message-ID | <ca97b67a-9c3b-4ddd-a368-47cabb1da4a6@googlegroups.com> |
| In reply to | #33295 |
yeah good point - I have gone with md5 for now. On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote: > On Wed, Nov 14, 2012 at 2:25 PM, Richard <richardbp@gmail.com> wrote: > > > So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. > > > I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. > > > For example to store data at URL "abc": a/b/c/index.html > > > This data is also viewed locally through a web app. > > > > > > If you can suggest a better approach I would welcome it. > > > > The cost of a crypto hash on the URL will be completely dwarfed by the > > cost of storing/retrieving on disk. You could probably do some > > arithmetic and figure out exactly how many URLs (at an average length > > of, say, 100 bytes) you can hash in the time of one disk seek. > > > > ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Richard <richardbp@gmail.com> |
|---|---|
| Date | 2012-11-13 20:14 -0800 |
| Message-ID | <mailman.3663.1352866451.27098.python-list@python.org> |
| In reply to | #33295 |
yeah good point - I have gone with md5 for now. On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote: > On Wed, Nov 14, 2012 at 2:25 PM, Richard <richardbp@gmail.com> wrote: > > > So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. > > > I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. > > > For example to store data at URL "abc": a/b/c/index.html > > > This data is also viewed locally through a web app. > > > > > > If you can suggest a better approach I would welcome it. > > > > The cost of a crypto hash on the URL will be completely dwarfed by the > > cost of storing/retrieving on disk. You could probably do some > > arithmetic and figure out exactly how many URLs (at an average length > > of, say, 100 bytes) you can hash in the time of one disk seek. > > > > ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Richard <richardbp@gmail.com> |
|---|---|
| Date | 2012-11-13 19:27 -0800 |
| Message-ID | <61a09732-e7e4-466b-9316-96636a9241d7@googlegroups.com> |
| In reply to | #33286 |
> The next step is to reduce the number of bits you are encoding. You > > said in another post that "1 collision in 10 million hashes would be > > tolerable". So you need: > > > > >>> math.log(10*1000*1000, 2) > > 23.25349666421154 I think a difficulty would be finding a hash algorithm that maps evenly across those bits.
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-11-14 12:29 +0100 |
| Message-ID | <k7vva1$srq$1@news.albasani.net> |
| In reply to | #33286 |
On 14.11.2012 02:39, Roy Smith wrote: > The next step is to reduce the number of bits you are encoding. You > said in another post that "1 collision in 10 million hashes would be > tolerable". So you need: > >>>> math.log(10*1000*1000, 2) > 23.25349666421154 > > 24 bits worth of key. Nope :-) > Base64 encoded, that's only 4 characters. > Actually, I probably just proved that I don't really understand how > probabilities work, so maybe what you really need is 32 or 48 or 64 > bits. :-)) When doing these calculations, it's important to keep the birthday paradox in mind (this is kind of counter-intuitive): The chance of a collission raises tremendously when we're looking for *any* arbitrary two hashes colliding within a certain namespace. The probability you've calculated is the pre-image probability (which you also again need to multiply with a factor of two, because when trying to collide one given hash, in the mean case you'll only have to search *half* the namespace before finding a collision). There are three things you need to know before you can give an estimate: 1. The permissible probability of a collision (1e-7 in this case) 2. The hash size 3. The worst-case number of elements in the namespace You neglected 3 completely -- but knowing this is really important. This becomes obvious when looking at the extreme cases: Let's say you have a hash of arbitrary size, but only hash one element. The chance of a collision is *always* zero. Or look at a hash of size 2^n. Then put 2^n + 1 elements in the namespace. The chance of a collision is *always* one. Doing the calculations (formulas can be found on wikipedia on the site of the birthday phaenomenon), you can come up with these following bitlenghts of the hash with a 1e-7 probability of collision in respect to the worst-case number of elements 10k elements: 49 bit 100k elements: 56 bit 1e6 elements: 63 bit 100e6 elements: 76 bit 1e9 elements: 83 bit 1e12 elements: 102 bit Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-11-14 07:33 -0500 |
| Message-ID | <mailman.3681.1352896467.27098.python-list@python.org> |
| In reply to | #33324 |
On 11/14/2012 06:29 AM, Johannes Bauer wrote: > <snip> > > When doing these calculations, it's important to keep the birthday > paradox in mind (this is kind of counter-intuitive): The chance of a > collission raises tremendously when we're looking for *any* arbitrary > two hashes colliding within a certain namespace. The probability you've > calculated is the pre-image probability (which you also again need to > multiply with a factor of two, because when trying to collide one given > hash, in the mean case you'll only have to search *half* the namespace > before finding a collision). > <snip> Te birthday paradox could have been important had the OP stated his goal differently. What he said was: """Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable.""" That means that he's willing to do the necessary overhead of collision resolution, once in every 10 million lookups. That's not the same as saying that he wants only one chance in 10 million of having ANY collisions among his data items. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2012-11-14 14:00 +0100 |
| Message-ID | <k804l4$8ta$1@news.albasani.net> |
| In reply to | #33329 |
On 14.11.2012 13:33, Dave Angel wrote: > Te birthday paradox could have been important had the OP stated his goal > differently. What he said was: > > """Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable.""" > > That means that he's willing to do the necessary overhead of collision > resolution, once in every 10 million lookups. That's not the same as > saying that he wants only one chance in 10 million of having ANY > collisions among his data items. Since he stated in a later post that he actually went with MD5, the calculations are indeed relevant. They give the number of bits a perfect hash needs to have in order to get the desired low probablility of collision resolutions. And for that the birthday paradox probability must be considered instead of the (much lower) pre-image probability. In any case, it appeared to me as if the OP was rather looking for ideas and wasn't sure himself what approach to take -- so I find it quite appropriate to give suggestions one way or another (even if they might not fit the exact phrasing of one of his postings). Best regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web