Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'example:': 0.03; 'modified': 0.05; 'one?': 0.05; 'parsing': 0.07; 'purpose.': 0.07; 'suppose': 0.07; 'basename': 0.09; 'identifier': 0.09; 'postgresql.': 0.09; 'skip:r 60': 0.09; 'def': 0.10; "wouldn't": 0.11; 'index': 0.13; 'digest': 0.15; '@classmethod': 0.16; 'dom,': 0.16; 'enough?': 0.16; 'identifiers': 0.16; 'identifiers.': 0.16; 'recognizing': 0.16; 'result:': 0.16; 'sign.': 0.16; 'truncating': 0.16; 'underscore.': 0.16; 'element': 0.17; 'thanks,': 0.18; 'creates': 0.18; 'skip:" 30': 0.20; 'trying': 0.21; 'names.': 0.22; 'questions:': 0.22; 'recognize': 0.22; "skip:' 40": 0.22; 'somebody': 0.23; 'long,': 0.24; 'header:User-Agent:1': 0.26; 'skip:b 30': 0.27; '(since': 0.29; 'hash': 0.29; 'restricted': 0.29; 'table,': 0.29; 'tail': 0.29; 'unique,': 0.29; 'objects': 0.29; 'class': 0.29; "i'm": 0.29; "skip:' 10": 0.30; 'field,': 0.30; 'version,': 0.30; 'function': 0.30; 'code': 0.31; 'good.': 0.32; 'print': 0.32; 'problem': 0.33; 'to:addr:python-list': 0.33; 'skip:b 20': 0.34; 'bigger': 0.35; 'problem,': 0.35; 'so,': 0.35; 'similar': 0.35; 'there': 0.35; 'created': 0.36; 'but': 0.36; 'should': 0.36; 'too': 0.36; 'possible': 0.37; 'object': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'received:192.168': 0.40; 'think': 0.40; 'chance': 0.61; 'time,': 0.62; 'safe': 0.63; 'more': 0.63; 'here': 0.65; 'limit': 0.65; 'dollar': 0.71; 'increase': 0.72; 'received:204': 0.72; 'algorithm,': 0.84 Date: Thu, 26 Jul 2012 14:26:16 +0200 From: Laszlo Nagy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: python-list@python.org Subject: Generating valid identifiers Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 57 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1343305588 news.xs4all.nl 6988 [2001:888:2000:d::a6]:34357 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:26087 I have a program that creates various database objects in PostgreSQL. There is a DOM, and for each element in the DOM, a database object is created (schema, table, field, index and tablespace). I do not want this program to generate very long identifiers. It would increase SQL parsing time, and don't look good. Let's just say that the limit should be 32 characters. But I also want to recognize the identifiers when I look at their modified/truncated names. So I have come up with this solution: - I have restricted original identifiers not to contain the dollar sign. They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here is a valid example: "group1_group2_group3_some_field_name" - I'm trying to use a hash function to reduce the length of the identifier when it is too long: class Connection(object): # ... more code here @classmethod def makename(cls, basename): if len(basename)>32: h = hashlib.sha256() h.update(basename) tail = base64.b64encode(h.digest(),"_$")[:10] return basename[:30]+"$"+tail else: return basename Here is the result: print repr(Connection.makename("some_field_name")) 'some_field_name' print repr(Connection.makename("group1_group2_group3_some_field_name")) 'group1_group2_group3_some_fiel$AyQVQUXoyf' So, if the identifier is too long, then I use a modified version, that should be unique, and similar to the original name. Let's suppose that nobody wants to crack this modified hash on purpose. And now, the questions: * Would it be a problem to use CRC32 instead of SHA? (Since security is not a problem, and CRC32 is faster.) * I'm truncating the digest value to 10 characters. Is it safe enough? I don't want to use more than 10 characters, because then it wouldn't be possible to recognize the original name. * Can somebody think of a better algorithm, that would give a bigger chance of recognizing the original identifier from the modified one? Thanks, Laszlo