Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #26087 > unrolled thread
| Started by | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| First post | 2012-07-26 14:26 +0200 |
| Last post | 2012-07-27 11:59 +0200 |
| Articles | 9 — 3 participants |
Back to article view | Back to comp.lang.python
Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-26 14:26 +0200
Re: Generating valid identifiers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-26 15:30 +0000
Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-26 20:08 +0200
Re: Generating valid identifiers Ian Kelly <ian.g.kelly@gmail.com> - 2012-07-26 13:28 -0600
Re: Generating valid identifiers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-27 01:54 +0000
Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 09:34 +0200
Re: Generating valid identifiers Ian Kelly <ian.g.kelly@gmail.com> - 2012-07-26 14:00 -0600
Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 09:28 +0200
Re: Generating valid identifiers Laszlo Nagy <gandalf@shopzeus.com> - 2012-07-27 11:59 +0200
| From | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| Date | 2012-07-26 14:26 +0200 |
| Subject | Generating valid identifiers |
| Message-ID | <mailman.2604.1343305588.4697.python-list@python.org> |
I have a program that creates various database objects in PostgreSQL.
There is a DOM, and for each element in the DOM, a database object is
created (schema, table, field, index and tablespace).
I do not want this program to generate very long identifiers. It would
increase SQL parsing time, and don't look good. Let's just say that the
limit should be 32 characters. But I also want to recognize the
identifiers when I look at their modified/truncated names.
So I have come up with this solution:
- I have restricted original identifiers not to contain the dollar sign.
They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here
is a valid example:
"group1_group2_group3_some_field_name"
- I'm trying to use a hash function to reduce the length of the
identifier when it is too long:
class Connection(object):
# ... more code here
@classmethod
def makename(cls, basename):
if len(basename)>32:
h = hashlib.sha256()
h.update(basename)
tail = base64.b64encode(h.digest(),"_$")[:10]
return basename[:30]+"$"+tail
else:
return basename
Here is the result:
print repr(Connection.makename("some_field_name"))
'some_field_name'
print repr(Connection.makename("group1_group2_group3_some_field_name"))
'group1_group2_group3_some_fiel$AyQVQUXoyf'
So, if the identifier is too long, then I use a modified version, that
should be unique, and similar to the original name. Let's suppose that
nobody wants to crack this modified hash on purpose.
And now, the questions:
* Would it be a problem to use CRC32 instead of SHA? (Since security is
not a problem, and CRC32 is faster.)
* I'm truncating the digest value to 10 characters. Is it safe enough?
I don't want to use more than 10 characters, because then it wouldn't be
possible to recognize the original name.
* Can somebody think of a better algorithm, that would give a bigger
chance of recognizing the original identifier from the modified one?
Thanks,
Laszlo
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-07-26 15:30 +0000 |
| Message-ID | <50116281$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #26087 |
On Thu, 26 Jul 2012 14:26:16 +0200, Laszlo Nagy wrote: > I do not want this program to generate very long identifiers. It would > increase SQL parsing time, Will that increase in SQL parsing time be more, or less, than the time it takes to generate CRC32 or SHA hashsums and append them to a truncated identifier? > * Would it be a problem to use CRC32 instead of SHA? (Since security is > not a problem, and CRC32 is faster.) What happens if you get a collision? That is, you have two different long identifiers: a.b.c.d...something a.b.c.d...anotherthing which by bad luck both hash to the same value: a.b.c.d.$AABB99 a.b.c.d.$AABB99 (or whatever). > * I'm truncating the digest value to 10 characters. Is it safe enough? > I don't want to use more than 10 characters, because then it wouldn't be > possible to recognize the original name. > * Can somebody think of a > better algorithm, that would give a bigger chance of recognizing the > original identifier from the modified one? Rather than truncating the most significant part of the identifier, the field name, you should truncate the least important part, the middle. a.b.c.d.e.f.g.something goes to: a.b...g.something or similar. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| Date | 2012-07-26 20:08 +0200 |
| Message-ID | <mailman.2621.1343326154.4697.python-list@python.org> |
| In reply to | #26099 |
>> * Would it be a problem to use CRC32 instead of SHA? (Since security is >> not a problem, and CRC32 is faster.) > What happens if you get a collision? > > That is, you have two different long identifiers: > > a.b.c.d...something > a.b.c.d...anotherthing > > which by bad luck both hash to the same value: > > a.b.c.d.$AABB99 > a.b.c.d.$AABB99 > > (or whatever). Yes, that was the question. How do I avoid that? (Of course I can avoid that by using a full sha256 hash value.) >> * Can somebody think of a >> better algorithm, that would give a bigger chance of recognizing the >> original identifier from the modified one? > Rather than truncating the most significant part of the identifier, the > field name, you should truncate the least important part, the middle. > > a.b.c.d.e.f.g.something > > goes to: > > a.b...g.something > > or similar. Yes, this is a good idea. Thank you.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-07-26 13:28 -0600 |
| Message-ID | <mailman.2628.1343330939.4697.python-list@python.org> |
| In reply to | #26099 |
On Thu, Jul 26, 2012 at 9:30 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > What happens if you get a collision? > > That is, you have two different long identifiers: > > a.b.c.d...something > a.b.c.d...anotherthing > > which by bad luck both hash to the same value: > > a.b.c.d.$AABB99 > a.b.c.d.$AABB99 > > (or whatever). The odds of a given pair of identifiers having the same digest to 10 hex digits are 1 in 16^10, or approximately 1 in a trillion. If you bought one lottery ticket a day at those odds, you would win approximately once every 3 billion years. But it's not enough just to have a hash collision, they also have to match exactly in the first 21 (or 30, or whatever) characters of their actual names, and they have to both be long enough to invoke the truncating scheme in the first place. The Oracle backend for Django uses this same approach with an MD5 sum to ensure that identifiers will be no more than 30 characters long (a hard limit imposed by Oracle). It actually truncates the hash to 4 digits, though, not 10. This hasn't caused any problems that I'm aware of.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-07-27 01:54 +0000 |
| Message-ID | <5011f4be$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #26110 |
On Thu, 26 Jul 2012 13:28:26 -0600, Ian Kelly wrote: > The odds of a given pair of identifiers having the same digest to 10 hex > digits are 1 in 16^10, or approximately 1 in a trillion. Unless an attacker can select the field names, in which case they may be able to improve those odds significantly. In the case of MD5, they can possibly improve those odds to 1 in 1, since MD5 is vulnerable to collision attacks. Not so for some (all?) of the SHA hashes, at least not yet, but they're much more expensive to calculate. If the OP sticks with his intention to use CRC32, the odds won't be anywhere near that low. CRC32 is neither collision-resistant nor cryptographically random, and only generates eight hex digits, not ten. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| Date | 2012-07-27 09:34 +0200 |
| Message-ID | <mailman.2650.1343374492.4697.python-list@python.org> |
| In reply to | #26131 |
> Unless an attacker can select the field names, in which case they may be > able to improve those odds significantly. In the case of MD5, they can > possibly improve those odds to 1 in 1, since MD5 is vulnerable to > collision attacks. Not so for some (all?) of the SHA hashes, at least not > yet, but they're much more expensive to calculate. > > If the OP sticks with his intention to use CRC32, the odds won't be > anywhere near that low. CRC32 is neither collision-resistant nor > cryptographically random, and only generates eight hex digits, not ten. > > I'm not affraid of attackers. As I said, nobody will want to "crack" the hash. This is for an in-house project. Only the dba can create new database objects. If the dba wants to do something wrong, he can simply delete whole database. He doesn't need to crack any hash value. :-) So yes, CRC32 is not collision-resistant, and not cryptographically random. But in my case, sha256 is not collision resistant either (because I'm using the first few chars of the digest value only). And cryptographic randomness is not a requirement. Given these circumstances, maybe using CRC32 would be fine too. I wonder what kind of hash Django uses for Oracle.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-07-26 14:00 -0600 |
| Message-ID | <mailman.2630.1343332892.4697.python-list@python.org> |
| In reply to | #26099 |
On Thu, Jul 26, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > The odds of a given pair of identifiers having the same digest to 10 > hex digits are 1 in 16^10, or approximately 1 in a trillion. If you > bought one lottery ticket a day at those odds, you would win > approximately once every 3 billion years. But it's not enough just to > have a hash collision, they also have to match exactly in the first 21 > (or 30, or whatever) characters of their actual names, and they have > to both be long enough to invoke the truncating scheme in the first > place. > > The Oracle backend for Django uses this same approach with an MD5 sum > to ensure that identifiers will be no more than 30 characters long (a > hard limit imposed by Oracle). It actually truncates the hash to 4 > digits, though, not 10. This hasn't caused any problems that I'm > aware of. As a side note, the odds of having at least one hash collision among multiple tables are known as the birthday problem. At 4 hex digits there are 65536 possible digests, and it turns out that at 302 tables there is a >50% chance that at least one pair of those names have the same 4-digit digest. That doesn't mean you should be concerned if you have 302 tables in your Django Oracle database, though, because those colliding tables also have to match completely in the first 26 characters of their generated names, which is not that common. If a collision ever did occur, the resolution would be simple: manually set the name of one of the offending tables in the model definition. With 16 ** 10 possible digests, the probability of collision hits 50% at 1234605 tables.
[toc] | [prev] | [next] | [standalone]
| From | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| Date | 2012-07-27 09:28 +0200 |
| Message-ID | <mailman.2649.1343374141.4697.python-list@python.org> |
| In reply to | #26099 |
> As a side note, the odds of having at least one hash collision among > multiple tables are known as the birthday problem. At 4 hex digits > there are 65536 possible digests, and it turns out that at 302 tables > there is a >50% chance that at least one pair of those names have the > same 4-digit digest. That doesn't mean you should be concerned if you > have 302 tables in your Django Oracle database, though, because those > colliding tables also have to match completely in the first 26 > characters of their generated names, which is not that common. If a > collision ever did occur, the resolution would be simple: manually set > the name of one of the offending tables in the model definition. > > With 16 ** 10 possible digests, the probability of collision hits 50% > at 1234605 tables. Thank you for the precise explanation. :-) Well, if Django and Oracle uses this, then it can't be a very bad idea. :-)
[toc] | [prev] | [next] | [standalone]
| From | Laszlo Nagy <gandalf@shopzeus.com> |
|---|---|
| Date | 2012-07-27 11:59 +0200 |
| Message-ID | <mailman.2652.1343383173.4697.python-list@python.org> |
| In reply to | #26099 |
> With 16 ** 10 possible digests, the probability of collision hits 50% at 1234605 tables Actually, I'm using base64 encoding. So it is 64**10. I guess using 6 characters will enough.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web