Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Thu, 26 Jul 2012 14:26:16 +0200
From: Laszlo Nagy <gandalf@shopzeus.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Generating valid identifiers
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2604.1343305588.4697.python-list@python.org>
Lines: 57
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:26087

I have a program that creates various database objects in PostgreSQL. 
There is a DOM, and for each element in the DOM, a database object is 
created (schema, table, field, index and tablespace).

I do not want this program to generate very long identifiers. It would 
increase SQL parsing time, and don't look good. Let's just say that the 
limit should be 32 characters. But I also want to recognize the 
identifiers when I look at their modified/truncated names.

So I have come up with this solution:

- I have restricted original identifiers not to contain the dollar sign. 
They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here 
is a valid example:

"group1_group2_group3_some_field_name"

- I'm trying to use a hash function to reduce the length of the 
identifier when it is too long:

class Connection(object):
     # ... more code here
     @classmethod
     def makename(cls, basename):
         if len(basename)>32:
             h = hashlib.sha256()
             h.update(basename)
             tail = base64.b64encode(h.digest(),"_$")[:10]
             return basename[:30]+"$"+tail
         else:
             return basename

Here is the result:

print repr(Connection.makename("some_field_name"))
'some_field_name'
print repr(Connection.makename("group1_group2_group3_some_field_name"))
'group1_group2_group3_some_fiel$AyQVQUXoyf'

So, if the identifier is too long, then I use a modified version, that 
should be unique, and similar to the original name. Let's suppose that 
nobody wants to crack this modified hash on purpose.

And now, the questions:

* Would it be a problem to use CRC32 instead of SHA? (Since security is 
not a problem, and CRC32 is faster.)
* I'm truncating the digest value to 10 characters.  Is it safe enough? 
I don't want to use more than 10 characters, because then it wouldn't be 
possible to recognize the original name.
* Can somebody think of a better algorithm, that would give a bigger 
chance of recognizing the original identifier from the modified one?

Thanks,

    Laszlo