Groups > comp.lang.python > #26087 > unrolled thread

Generating valid identifiers

Started by	Laszlo Nagy <gandalf@shopzeus.com>
First post	2012-07-26 14:26 +0200
Last post	2012-07-27 11:59 +0200
Articles	9 — 3 participants

Back to article view | Back to comp.lang.python

  Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-26 14:26 +0200
    Re: Generating valid identifiers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-26 15:30 +0000
      Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-26 20:08 +0200
      Re: Generating valid identifiers Ian Kelly <ian.g.kelly@gmail.com> - 2012-07-26 13:28 -0600
        Re: Generating valid identifiers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-27 01:54 +0000
          Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 09:34 +0200
      Re: Generating valid identifiers Ian Kelly <ian.g.kelly@gmail.com> - 2012-07-26 14:00 -0600
      Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 09:28 +0200
      Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 11:59 +0200

#26087 — Generating valid identifiers

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-07-26 14:26 +0200
Subject	Generating valid identifiers
Message-ID	<mailman.2604.1343305588.4697.python-list@python.org>

I have a program that creates various database objects in PostgreSQL. 
There is a DOM, and for each element in the DOM, a database object is 
created (schema, table, field, index and tablespace).

I do not want this program to generate very long identifiers. It would 
increase SQL parsing time, and don't look good. Let's just say that the 
limit should be 32 characters. But I also want to recognize the 
identifiers when I look at their modified/truncated names.

So I have come up with this solution:

- I have restricted original identifiers not to contain the dollar sign. 
They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here 
is a valid example:

"group1_group2_group3_some_field_name"

- I'm trying to use a hash function to reduce the length of the 
identifier when it is too long:

class Connection(object):
     # ... more code here
     @classmethod
     def makename(cls, basename):
         if len(basename)>32:
             h = hashlib.sha256()
             h.update(basename)
             tail = base64.b64encode(h.digest(),"_$")[:10]
             return basename[:30]+"$"+tail
         else:
             return basename

Here is the result:

print repr(Connection.makename("some_field_name"))
'some_field_name'
print repr(Connection.makename("group1_group2_group3_some_field_name"))
'group1_group2_group3_some_fiel$AyQVQUXoyf'

So, if the identifier is too long, then I use a modified version, that 
should be unique, and similar to the original name. Let's suppose that 
nobody wants to crack this modified hash on purpose.

And now, the questions:

* Would it be a problem to use CRC32 instead of SHA? (Since security is 
not a problem, and CRC32 is faster.)
* I'm truncating the digest value to 10 characters.  Is it safe enough? 
I don't want to use more than 10 characters, because then it wouldn't be 
possible to recognize the original name.
* Can somebody think of a better algorithm, that would give a bigger 
chance of recognizing the original identifier from the modified one?

Thanks,

    Laszlo

[toc] | [next] | [standalone]

#26099

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-07-26 15:30 +0000
Message-ID	<50116281$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#26087

On Thu, 26 Jul 2012 14:26:16 +0200, Laszlo Nagy wrote:

> I do not want this program to generate very long identifiers. It would
> increase SQL parsing time,

Will that increase in SQL parsing time be more, or less, than the time it 
takes to generate CRC32 or SHA hashsums and append them to a truncated 
identifier?

> * Would it be a problem to use CRC32 instead of SHA? (Since security is
> not a problem, and CRC32 is faster.)

What happens if you get a collision?

That is, you have two different long identifiers:

a.b.c.d...something
a.b.c.d...anotherthing

which by bad luck both hash to the same value:

a.b.c.d.$AABB99
a.b.c.d.$AABB99

(or whatever).

> * I'm truncating the digest value to 10 characters.  Is it safe enough?
> I don't want to use more than 10 characters, because then it wouldn't be
> possible to recognize the original name. 

> * Can somebody think of a
> better algorithm, that would give a bigger chance of recognizing the
> original identifier from the modified one?

Rather than truncating the most significant part of the identifier, the 
field name, you should truncate the least important part, the middle.

a.b.c.d.e.f.g.something

goes to:

a.b...g.something

or similar.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#26106

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-07-26 20:08 +0200
Message-ID	<mailman.2621.1343326154.4697.python-list@python.org>
In reply to	#26099

>> * Would it be a problem to use CRC32 instead of SHA? (Since security is
>> not a problem, and CRC32 is faster.)
> What happens if you get a collision?
>
> That is, you have two different long identifiers:
>
> a.b.c.d...something
> a.b.c.d...anotherthing
>
> which by bad luck both hash to the same value:
>
> a.b.c.d.$AABB99
> a.b.c.d.$AABB99
>
> (or whatever).
Yes, that was the question. How do I avoid that? (Of course I can avoid 
that by using a full sha256 hash value.)
>> * Can somebody think of a
>> better algorithm, that would give a bigger chance of recognizing the
>> original identifier from the modified one?
> Rather than truncating the most significant part of the identifier, the
> field name, you should truncate the least important part, the middle.
>
> a.b.c.d.e.f.g.something
>
> goes to:
>
> a.b...g.something
>
> or similar.
Yes, this is a good idea. Thank you.

[toc] | [prev] | [next] | [standalone]

#26110

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-07-26 13:28 -0600
Message-ID	<mailman.2628.1343330939.4697.python-list@python.org>
In reply to	#26099

On Thu, Jul 26, 2012 at 9:30 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> What happens if you get a collision?
>
> That is, you have two different long identifiers:
>
> a.b.c.d...something
> a.b.c.d...anotherthing
>
> which by bad luck both hash to the same value:
>
> a.b.c.d.$AABB99
> a.b.c.d.$AABB99
>
> (or whatever).

The odds of a given pair of identifiers having the same digest to 10
hex digits are 1 in 16^10, or approximately 1 in a trillion.  If you
bought one lottery ticket a day at those odds, you would win
approximately once every 3 billion years.  But it's not enough just to
have a hash collision, they also have to match exactly in the first 21
(or 30, or whatever) characters of their actual names, and they have
to both be long enough to invoke the truncating scheme in the first
place.

The Oracle backend for Django uses this same approach with an MD5 sum
to ensure that identifiers will be no more than 30 characters long (a
hard limit imposed by Oracle).  It actually truncates the hash to 4
digits, though, not 10.  This hasn't caused any problems that I'm
aware of.

[toc] | [prev] | [next] | [standalone]

#26131

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-07-27 01:54 +0000
Message-ID	<5011f4be$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#26110

On Thu, 26 Jul 2012 13:28:26 -0600, Ian Kelly wrote:

> The odds of a given pair of identifiers having the same digest to 10 hex
> digits are 1 in 16^10, or approximately 1 in a trillion.

Unless an attacker can select the field names, in which case they may be 
able to improve those odds significantly. In the case of MD5, they can 
possibly improve those odds to 1 in 1, since MD5 is vulnerable to 
collision attacks. Not so for some (all?) of the SHA hashes, at least not 
yet, but they're much more expensive to calculate.

If the OP sticks with his intention to use CRC32, the odds won't be 
anywhere near that low. CRC32 is neither collision-resistant nor 
cryptographically random, and only generates eight hex digits, not ten.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#26136

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-07-27 09:34 +0200
Message-ID	<mailman.2650.1343374492.4697.python-list@python.org>
In reply to	#26131

> Unless an attacker can select the field names, in which case they may be
> able to improve those odds significantly. In the case of MD5, they can
> possibly improve those odds to 1 in 1, since MD5 is vulnerable to
> collision attacks. Not so for some (all?) of the SHA hashes, at least not
> yet, but they're much more expensive to calculate.
>
> If the OP sticks with his intention to use CRC32, the odds won't be
> anywhere near that low. CRC32 is neither collision-resistant nor
> cryptographically random, and only generates eight hex digits, not ten.
>
>
I'm not affraid of attackers. As I said, nobody will want to "crack" the 
hash. This is for an in-house project. Only the dba can create new 
database objects. If the dba wants to do something wrong, he can simply 
delete whole database. He doesn't need to crack any hash value. :-)

So yes, CRC32 is not collision-resistant, and not cryptographically 
random. But in my case, sha256 is not collision resistant either 
(because I'm using the first few chars of the digest value only). And 
cryptographic randomness is not a requirement.

Given these circumstances, maybe using CRC32 would be fine too.

I wonder what kind of hash Django uses for Oracle.

[toc] | [prev] | [next] | [standalone]

#26112

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-07-26 14:00 -0600
Message-ID	<mailman.2630.1343332892.4697.python-list@python.org>
In reply to	#26099

On Thu, Jul 26, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> The odds of a given pair of identifiers having the same digest to 10
> hex digits are 1 in 16^10, or approximately 1 in a trillion.  If you
> bought one lottery ticket a day at those odds, you would win
> approximately once every 3 billion years.  But it's not enough just to
> have a hash collision, they also have to match exactly in the first 21
> (or 30, or whatever) characters of their actual names, and they have
> to both be long enough to invoke the truncating scheme in the first
> place.
>
> The Oracle backend for Django uses this same approach with an MD5 sum
> to ensure that identifiers will be no more than 30 characters long (a
> hard limit imposed by Oracle).  It actually truncates the hash to 4
> digits, though, not 10.  This hasn't caused any problems that I'm
> aware of.

As a side note, the odds of having at least one hash collision among
multiple tables are known as the birthday problem.  At 4 hex digits
there are 65536 possible digests, and it turns out that at 302 tables
there is a >50% chance that at least one pair of those names have the
same 4-digit digest.  That doesn't mean you should be concerned if you
have 302 tables in your Django Oracle database, though, because those
colliding tables also have to match completely in the first 26
characters of their generated names, which is not that common.  If a
collision ever did occur, the resolution would be simple: manually set
the name of one of the offending tables in the model definition.

With 16 ** 10 possible digests, the probability of collision hits 50%
at 1234605 tables.

[toc] | [prev] | [next] | [standalone]

#26135

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-07-27 09:28 +0200
Message-ID	<mailman.2649.1343374141.4697.python-list@python.org>
In reply to	#26099

> As a side note, the odds of having at least one hash collision among
> multiple tables are known as the birthday problem.  At 4 hex digits
> there are 65536 possible digests, and it turns out that at 302 tables
> there is a >50% chance that at least one pair of those names have the
> same 4-digit digest.  That doesn't mean you should be concerned if you
> have 302 tables in your Django Oracle database, though, because those
> colliding tables also have to match completely in the first 26
> characters of their generated names, which is not that common.  If a
> collision ever did occur, the resolution would be simple: manually set
> the name of one of the offending tables in the model definition.
>
> With 16 ** 10 possible digests, the probability of collision hits 50%
> at 1234605 tables.
Thank you for the precise explanation. :-) Well, if Django and Oracle 
uses this, then it can't be a very bad idea. :-)

[toc] | [prev] | [next] | [standalone]

#26138

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-07-27 11:59 +0200
Message-ID	<mailman.2652.1343383173.4697.python-list@python.org>
In reply to	#26099

 > With 16 ** 10 possible digests, the probability of collision hits 50% 
at 1234605 tables


Actually, I'm using base64 encoding. So it is 64**10. I guess using 6 
characters will enough.

[toc] | [prev] | [standalone]

csiph-web

Generating valid identifiers

Contents

#26087 — Generating valid identifiers

#26099

#26106

#26110

#26131

#26136

#26112

#26135

#26138