Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33271 > unrolled thread

Generate unique ID for URL

Started byRichard <richardbp@gmail.com>
First post2012-11-13 15:20 -0800
Last post2012-11-14 14:00 +0100
Articles 20 on this page of 28 — 12 participants

Back to article view | Back to comp.lang.python


Contents

  Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 15:20 -0800
    Re: Generate unique ID for URL John Gordon <gordon@panix.com> - 2012-11-13 23:34 +0000
      Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 15:56 -0800
        Re: Generate unique ID for URL Chris Kaynor <ckaynor@zindagigames.com> - 2012-11-13 16:26 -0800
        Re: Generate unique ID for URL Richard Baron Penman <richardbp@gmail.com> - 2012-11-14 11:41 +1100
          Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 10:44 +0100
            Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-14 03:14 -0800
        Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 01:43 +0100
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 16:50 -0800
            Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 02:05 +0100
        Re: Generate unique ID for URL Christian Heimes <christian@python.org> - 2012-11-14 01:59 +0100
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 17:18 -0800
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 17:18 -0800
    Re: Generate unique ID for URL Miki Tebeka <miki.tebeka@gmail.com> - 2012-11-13 16:13 -0800
      Re: Generate unique ID for URL Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-11-14 02:04 +0000
        Re: Generate unique ID for URL Steve Howell <showell30@yahoo.com> - 2012-11-13 18:32 -0800
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:12 -0800
    Re: Generate unique ID for URL Roy Smith <roy@panix.com> - 2012-11-13 20:39 -0500
      Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:25 -0800
        Re: Generate unique ID for URL Roy Smith <roy@panix.com> - 2012-11-13 22:38 -0500
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:56 -0800
        Re: Generate unique ID for URL Chris Angelico <rosuav@gmail.com> - 2012-11-14 15:06 +1100
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 20:14 -0800
          Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 20:14 -0800
      Re: Generate unique ID for URL Richard <richardbp@gmail.com> - 2012-11-13 19:27 -0800
      Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 12:29 +0100
        Re: Generate unique ID for URL Dave Angel <d@davea.name> - 2012-11-14 07:33 -0500
          Re: Generate unique ID for URL Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-11-14 14:00 +0100

Page 1 of 2  [1] 2  Next page →


#33271 — Generate unique ID for URL

FromRichard <richardbp@gmail.com>
Date2012-11-13 15:20 -0800
SubjectGenerate unique ID for URL
Message-ID<0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com>
Hello,

I want to create a URL-safe unique ID for URL's.
Currently I use:
url_id = base64.urlsafe_b64encode(url)

>>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'

I would prefer more concise ID's. 
What do you recommend? - Compression?

Richard

[toc] | [next] | [standalone]


#33272

FromJohn Gordon <gordon@panix.com>
Date2012-11-13 23:34 +0000
Message-ID<k7ulda$geh$1@reader1.panix.com>
In reply to#33271
In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes:

> I want to create a URL-safe unique ID for URL's.
> Currently I use:
> url_id = base64.urlsafe_b64encode(url)

> >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
> 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'

> I would prefer more concise ID's. 
> What do you recommend? - Compression?

Does the ID need to contain all the information necessary to recreate the
original URL?

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon@panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"

[toc] | [prev] | [next] | [standalone]


#33273

FromRichard <richardbp@gmail.com>
Date2012-11-13 15:56 -0800
Message-ID<133e0be5-63af-4f72-9d0a-c59b04aa4ce4@googlegroups.com>
In reply to#33272
Good point - one way encoding would be fine.

Also this is performed millions of times so ideally efficient.


On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
> In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes:
> 
> 
> 
> > I want to create a URL-safe unique ID for URL's.
> 
> > Currently I use:
> 
> > url_id = base64.urlsafe_b64encode(url)
> 
> 
> 
> > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
> 
> > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
> 
> 
> 
> > I would prefer more concise ID's. 
> 
> > What do you recommend? - Compression?
> 
> 
> 
> Does the ID need to contain all the information necessary to recreate the
> 
> original URL?
> 
> 
> 
> -- 
> 
> John Gordon                   A is for Amy, who fell down the stairs
> 
> gordon@panix.com              B is for Basil, assaulted by bears
> 
>                                 -- Edward Gorey, "The Gashlycrumb Tinies"

[toc] | [prev] | [next] | [standalone]


#33275

FromChris Kaynor <ckaynor@zindagigames.com>
Date2012-11-13 16:26 -0800
Message-ID<mailman.3653.1352852802.27098.python-list@python.org>
In reply to#33273
One option would be using a hash. Python's built-in hash, a 32-bit
CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
depending on the needs. Higher bit counts will reduce the odds of
accidental collisions; cryptographically secure ones if outside
attacks matter. In such a case, you'd have to roll your own means of
converting the hash back into the string if you ever need it for
debugging, and there is always the possibility of collisions. A
similar solution would be using a pseudo-random GUID using the url as
the seed.

You could use a counter if all IDs are generated by a single process
(and even in other cases with some work).

If you want to be able to go both ways, using base64 encoding is
probably your best bet, though you might get benefits by using
compression.
Chris


On Tue, Nov 13, 2012 at 3:56 PM, Richard <richardbp@gmail.com> wrote:
> Good point - one way encoding would be fine.
>
> Also this is performed millions of times so ideally efficient.
>
>
> On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
>> In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes:
>>
>>
>>
>> > I want to create a URL-safe unique ID for URL's.
>>
>> > Currently I use:
>>
>> > url_id = base64.urlsafe_b64encode(url)
>>
>>
>>
>> > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
>>
>> > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
>>
>>
>>
>> > I would prefer more concise ID's.
>>
>> > What do you recommend? - Compression?
>>
>>
>>
>> Does the ID need to contain all the information necessary to recreate the
>>
>> original URL?
>>
>>
>>
>> --
>>
>> John Gordon                   A is for Amy, who fell down the stairs
>>
>> gordon@panix.com              B is for Basil, assaulted by bears
>>
>>                                 -- Edward Gorey, "The Gashlycrumb Tinies"
>
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]


#33278

FromRichard Baron Penman <richardbp@gmail.com>
Date2012-11-14 11:41 +1100
Message-ID<mailman.3655.1352853704.27098.python-list@python.org>
In reply to#33273
I found the MD5 and SHA hashes slow to calculate.
The builtin hash is fast but I was concerned about collisions. What
rate of collisions could I expect?

Outside attacks not an issue and multiple processes would be used.


On Wed, Nov 14, 2012 at 11:26 AM, Chris Kaynor <ckaynor@zindagigames.com> wrote:
> One option would be using a hash. Python's built-in hash, a 32-bit
> CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
> depending on the needs. Higher bit counts will reduce the odds of
> accidental collisions; cryptographically secure ones if outside
> attacks matter. In such a case, you'd have to roll your own means of
> converting the hash back into the string if you ever need it for
> debugging, and there is always the possibility of collisions. A
> similar solution would be using a pseudo-random GUID using the url as
> the seed.
>
> You could use a counter if all IDs are generated by a single process
> (and even in other cases with some work).
>
> If you want to be able to go both ways, using base64 encoding is
> probably your best bet, though you might get benefits by using
> compression.
> Chris
>
>
> On Tue, Nov 13, 2012 at 3:56 PM, Richard <richardbp@gmail.com> wrote:
>> Good point - one way encoding would be fine.
>>
>> Also this is performed millions of times so ideally efficient.
>>
>>
>> On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote:
>>> In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes:
>>>
>>>
>>>
>>> > I want to create a URL-safe unique ID for URL's.
>>>
>>> > Currently I use:
>>>
>>> > url_id = base64.urlsafe_b64encode(url)
>>>
>>>
>>>
>>> > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
>>>
>>> > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
>>>
>>>
>>>
>>> > I would prefer more concise ID's.
>>>
>>> > What do you recommend? - Compression?
>>>
>>>
>>>
>>> Does the ID need to contain all the information necessary to recreate the
>>>
>>> original URL?
>>>
>>>
>>>
>>> --
>>>
>>> John Gordon                   A is for Amy, who fell down the stairs
>>>
>>> gordon@panix.com              B is for Basil, assaulted by bears
>>>
>>>                                 -- Edward Gorey, "The Gashlycrumb Tinies"
>>
>> --
>> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]


#33319

FromJohannes Bauer <dfnsonfsduifb@gmx.de>
Date2012-11-14 10:44 +0100
Message-ID<k7vp6m$grn$1@news.albasani.net>
In reply to#33278
On 14.11.2012 01:41, Richard Baron Penman wrote:
> I found the MD5 and SHA hashes slow to calculate.

Slow? For URLs? Are you kidding? How many URLs per second do you want to
calculate?

> The builtin hash is fast but I was concerned about collisions. What
> rate of collisions could I expect?

MD5 has 16 bytes (128 bit), SHA1 has 20 bytes (160 bit). Utilizing the
birthday paradox and some approximations, I can tell you that when using
the full MD5 you'd need around 2.609e16 hashes in the same namespace to
get a one in a million chance of a collision. That is, 26090000000000000
filenames.

For SHA1 This number rises even further and you'd need around 1.71e21 or
1710000000000000000000 hashes in one namespace for the one-in-a-million.

I really have no clue about how many URLs you want to hash, and it seems
to be LOTS since the speed of MD5 seems to be an issue for you. Let me
estimate that you'd want to calculate a million hashes per second then
when you use MD5, you'd have about 827 years to fill the namespace up
enough to get a one-in-a-million.

If you need even more hashes (say a million million per second), I'd
suggest you go with SHA-1, giving you 54 years to get the one-in-a-million.

Then again, if you went for a million million hashes per second, Python
would probably not be the language of your choice.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]


#33322

FromRichard <richardbp@gmail.com>
Date2012-11-14 03:14 -0800
Message-ID<0c74dc68-ea47-45d2-a701-a334eea4c22e@googlegroups.com>
In reply to#33319
thanks for perspective!

[toc] | [prev] | [next] | [standalone]


#33279

FromChristian Heimes <christian@python.org>
Date2012-11-14 01:43 +0100
Message-ID<mailman.3656.1352853781.27098.python-list@python.org>
In reply to#33273
Am 14.11.2012 01:26, schrieb Chris Kaynor:
> One option would be using a hash. Python's built-in hash, a 32-bit
> CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist,
> depending on the needs. Higher bit counts will reduce the odds of
> accidental collisions; cryptographically secure ones if outside
> attacks matter. In such a case, you'd have to roll your own means of
> converting the hash back into the string if you ever need it for
> debugging, and there is always the possibility of collisions. A
> similar solution would be using a pseudo-random GUID using the url as
> the seed.

A hash is the wrong answer to the issue as a hash is open to all sorts
of attack vectors like length extension attack. If Robert needs to
ensure any kind of collision resistance than he needs a MAC, for example
a HMAC with a secret key.

If he needs some kind of persistent identifier than some like a URN or
DOI may be a better answer.

Christian

[toc] | [prev] | [next] | [standalone]


#33280

FromRichard <richardbp@gmail.com>
Date2012-11-13 16:50 -0800
Message-ID<9feb1237-b495-4367-8108-4d6291a5a05a@googlegroups.com>
In reply to#33279
These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application.

Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable.

[toc] | [prev] | [next] | [standalone]


#33282

FromChristian Heimes <christian@python.org>
Date2012-11-14 02:05 +0100
Message-ID<mailman.3658.1352855170.27098.python-list@python.org>
In reply to#33280
Am 14.11.2012 01:50, schrieb Richard:
> These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application.
> 
> Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable.

Are you storing the URLs in any kind of database like a SQL database? A
proper index on the data column will avoid full table scans. It will
give you almost O(1) complexity on lookups and O(n) worst case
complexity for collisions.

[toc] | [prev] | [next] | [standalone]


#33281

FromChristian Heimes <christian@python.org>
Date2012-11-14 01:59 +0100
Message-ID<mailman.3657.1352854794.27098.python-list@python.org>
In reply to#33273
Am 14.11.2012 01:41, schrieb Richard Baron Penman:
> I found the MD5 and SHA hashes slow to calculate.
> The builtin hash is fast but I was concerned about collisions. What
> rate of collisions could I expect?

Seriously? It takes about 1-5msec to sha1() one MB of data on a modern
CPU, 1.5 on my box. The openssl variants of Python's hash code release
the GIL so you use the power of all cores.

[toc] | [prev] | [next] | [standalone]


#33284

FromRichard <richardbp@gmail.com>
Date2012-11-13 17:18 -0800
Message-ID<90679521-52f9-409c-b6ad-5970863c0cff@googlegroups.com>
In reply to#33281
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower.

No database or else I would just use their ID.


On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote:
> Am 14.11.2012 01:41, schrieb Richard Baron Penman:
> 
> > I found the MD5 and SHA hashes slow to calculate.
> 
> > The builtin hash is fast but I was concerned about collisions. What
> 
> > rate of collisions could I expect?
> 
> 
> 
> Seriously? It takes about 1-5msec to sha1() one MB of data on a modern
> 
> CPU, 1.5 on my box. The openssl variants of Python's hash code release
> 
> the GIL so you use the power of all cores.

[toc] | [prev] | [next] | [standalone]


#33285

FromRichard <richardbp@gmail.com>
Date2012-11-13 17:18 -0800
Message-ID<mailman.3660.1352855890.27098.python-list@python.org>
In reply to#33281
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower.

No database or else I would just use their ID.


On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote:
> Am 14.11.2012 01:41, schrieb Richard Baron Penman:
> 
> > I found the MD5 and SHA hashes slow to calculate.
> 
> > The builtin hash is fast but I was concerned about collisions. What
> 
> > rate of collisions could I expect?
> 
> 
> 
> Seriously? It takes about 1-5msec to sha1() one MB of data on a modern
> 
> CPU, 1.5 on my box. The openssl variants of Python's hash code release
> 
> the GIL so you use the power of all cores.

[toc] | [prev] | [next] | [standalone]


#33274

FromMiki Tebeka <miki.tebeka@gmail.com>
Date2012-11-13 16:13 -0800
Message-ID<3ee369e5-aeea-426a-bf83-da3daeac6c4b@googlegroups.com>
In reply to#33271
> I want to create a URL-safe unique ID for URL's.
> What do you recommend? - Compression?
You can use base62 with a running counter, but then you'll need a (semi) centralized entity to come up with the next id.

You can see one implementation at http://bit.ly/PSJkHS (AppEngine environment).

[toc] | [prev] | [next] | [standalone]


#33287

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-11-14 02:04 +0000
Message-ID<50a2fc2a$0$21742$c3e8da3$76491128@news.astraweb.com>
In reply to#33274
On Tue, 13 Nov 2012 16:13:58 -0800, Miki Tebeka wrote:

>> I want to create a URL-safe unique ID for URL's. What do you recommend?
>> - Compression?
> You can use base62 with a running counter, but then you'll need a (semi)
> centralized entity to come up with the next id.
> 
> You can see one implementation at http://bit.ly/PSJkHS (AppEngine
> environment).

Perhaps this is a silly question, but if you're using a running counter, 
why bother with base64? Decimal or hex digits are URL safe. If there are 
no concerns about predictability, why not just use the counter directly?

You can encode a billion IDs in 8 hex digits compared to 16 base64 
characters:


py> base64.urlsafe_b64encode('1000000000')
'MTAwMDAwMDAwMA=='
py> "%x" % 1000000000
'3b9aca00'


Short and sweet and easy: no base64 calculation, no hash function, no 
database lookup, just a trivial int to string conversion.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#33288

FromSteve Howell <showell30@yahoo.com>
Date2012-11-13 18:32 -0800
Message-ID<a564508f-34a6-413a-86b8-ed7142c2899c@jj5g2000pbc.googlegroups.com>
In reply to#33287
On Nov 13, 6:04 pm, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Tue, 13 Nov 2012 16:13:58 -0800, Miki Tebeka wrote:
> >> I want to create a URL-safe unique ID for URL's. What do you recommend?
> >> - Compression?
> > You can use base62 with a running counter, but then you'll need a (semi)
> > centralized entity to come up with the next id.
>
> > You can see one implementation athttp://bit.ly/PSJkHS(AppEngine
> > environment).
>
> Perhaps this is a silly question, but if you're using a running counter,
> why bother with base64? Decimal or hex digits are URL safe. If there are
> no concerns about predictability, why not just use the counter directly?
>
> You can encode a billion IDs in 8 hex digits compared to 16 base64
> characters:
>
> py> base64.urlsafe_b64encode('1000000000')
> 'MTAwMDAwMDAwMA=='
> py> "%x" % 1000000000
> '3b9aca00'
>
> Short and sweet and easy: no base64 calculation, no hash function, no
> database lookup, just a trivial int to string conversion.
>
> --
> Steven

If you're dealing entirely with integers, then this works too:

    import base64

    def encode(n):
        s = ''
        while n > 0:
            s += chr(n % 256)
            n //= 256
        return base64.urlsafe_b64encode(s)

    def test():
        seen = set()
        for i in range(999900000, 1000000000):
            s = encode(i)
            if s in seen:
                raise Exception('non-unique encoding')
            seen.add(s)
        print encode(1000000000)

    test()

It prints this for 1000000000:

    AMqaOw==

[toc] | [prev] | [next] | [standalone]


#33289

FromRichard <richardbp@gmail.com>
Date2012-11-13 19:12 -0800
Message-ID<016c6b8a-6da4-4439-92af-8e223867ec52@googlegroups.com>
In reply to#33288
I am dealing with URL's rather than integers

[toc] | [prev] | [next] | [standalone]


#33286

FromRoy Smith <roy@panix.com>
Date2012-11-13 20:39 -0500
Message-ID<roy-862116.20390413112012@news.panix.com>
In reply to#33271
In article <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com>,
 Richard <richardbp@gmail.com> wrote:

> Hello,
> 
> I want to create a URL-safe unique ID for URL's.
> Currently I use:
> url_id = base64.urlsafe_b64encode(url)
> 
> >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html')
> 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s'
> 
> I would prefer more concise ID's. 
> What do you recommend? - Compression?

If you're generating random id strings, there's only two ways to make 
them shorter.  Either encode fewer bits of information, or encode them 
more compactly.

Let's start with the second one.  You're already using base64, so you're 
getting 6 bits per character.  You can do a little better than that, but 
not much.  The set of URL-safe characters is the 96-ish printable ascii 
set, minus a few pieces of punctuation.  Maybe you could get it up to 
6.3 or 6.4 bits per character, but that's about it.  For the complexity 
this would add it's probably not worth it.

The next step is to reduce the number of bits you are encoding.  You 
said in another post that "1 collision in 10 million hashes would be 
tolerable".  So you need:

>>> math.log(10*1000*1000, 2)
23.25349666421154

24 bits worth of key.  Base64 encoded, that's only 4 characters.  
Actually, I probably just proved that I don't really understand how 
probabilities work, so maybe what you really need is 32 or 48 or 64 
bits.  Certainly not the 264 bits you're encoding with your example 
above.

So, something like:

hash = md5.md5('docs.python.org/library/uuid.html').digest()
hash64 = base64.urlsafe_b64encode(hash)
id = hash64[:8]  # or 12, or whatever

But, I still don't really understand your use case.  You've already 
mentioned the following requirements:

"just be used internally for quick lookups, not exposed publicly"
"URL-safe"
"unique"
"1 collision in 10 million hashes would be tolerable"
"one way encoding would be fine"
"performed millions of times so ideally efficient"

but haven't really explained what it is that you're trying to do.

If they're not going to be exposed publicly, why do you care if they're 
URL-safe?

What's wrong with just using the URLs directly as dictionary keys and 
not worrying about it until you've got some hard data showing that this 
is not sufficient?

[toc] | [prev] | [next] | [standalone]


#33290

FromRichard <richardbp@gmail.com>
Date2012-11-13 19:25 -0800
Message-ID<1ce88f36-bfc7-4a55-89f8-70d1645d27ad@googlegroups.com>
In reply to#33286
So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. 
I can't store the files in a single directory because of OS limitations so have been using a sub folder structure.
For example to store data at URL "abc": a/b/c/index.html
This data is also viewed locally through a web app.

If you can suggest a better approach I would welcome it. 

[toc] | [prev] | [next] | [standalone]


#33293

FromRoy Smith <roy@panix.com>
Date2012-11-13 22:38 -0500
Message-ID<roy-DE96BF.22385013112012@news.panix.com>
In reply to#33290
In article <1ce88f36-bfc7-4a55-89f8-70d1645d27ad@googlegroups.com>,
 Richard <richardbp@gmail.com> wrote:

> So the use case - I'm storing webpages on disk and want a quick retrieval 
> system based on URL. 
> I can't store the files in a single directory because of OS limitations so 
> have been using a sub folder structure.
> For example to store data at URL "abc": a/b/c/index.html
> This data is also viewed locally through a web app.
> 
> If you can suggest a better approach I would welcome it. 

Ah, so basically, you're reinventing Varnish?

Maybe do what Varnish (and MongoDB, and a few other things) do?  Bypass 
the file system entirely.  Juar mmap() a chunk of memory large enough to 
hold everything and let the OS figure out how to page things to disk.

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web