Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #71882 > unrolled thread
| Started by | Adam Funk <a24061@ducksburg.com> |
|---|---|
| First post | 2014-05-22 12:47 +0100 |
| Last post | 2014-05-22 14:48 +0000 |
| Articles | 20 on this page of 26 — 8 participants |
Back to article view | Back to comp.lang.python
hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 12:47 +0100
Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 14:58 +0200
Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:41 +0100
Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:08 +1000
Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:40 +0100
Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-22 23:03 +1000
Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:47 +0100
Re: hashing strings to integers for sqlite3 keys Tim Chase <python.list@tim.thechases.com> - 2014-05-22 08:09 -0500
Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:54 +0100
Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:14 +1000
Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:47 +0100
Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 01:09 +1000
Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 17:34 +0200
hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:27 +0100
Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:36 +0100
Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-23 21:01 +1000
Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Chris Angelico <rosuav@gmail.com> - 2014-05-23 20:59 +1000
Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:13 +0100
Re: hashing strings to integers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-27 17:02 +0000
Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-28 05:16 +1000
Re: hashing strings to integers Dan Sommers <dan@tombstonezero.net> - 2014-05-28 01:55 +0000
Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:29 +0100
Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:32 +0100
Re: hashing strings to integers Terry Reedy <tjreedy@udel.edu> - 2014-05-23 15:10 -0400
Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:20 +0100
Re: hashing strings to integers for sqlite3 keys alister <alister.nospam.ware@ntlworld.com> - 2014-05-22 14:48 +0000
Page 1 of 2 [1] 2 Next page →
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 12:47 +0100 |
| Subject | hashing strings to integers for sqlite3 keys |
| Message-ID | <jn415bx6uf.ln2@news.ducksburg.com> |
I'm using Python 3.3 and the sqlite3 module in the standard library. I'm processing a lot of strings from input files (among other things, values of headers in e-mail & news messages) and suppressing duplicates using a table of seen strings in the database. It seems to me --- from past experience with other things, where testing integers for equality is faster than testing strings, as well as from reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that right? If so, what sort of hashing function should I use? The "maxint" for SQLite3 is a lot smaller than the size of even MD5 hashes. The only thing I've thought of so far is to use MD5 or SHA-something modulo the maxint value. (Security isn't an issue --- i.e., I'm not worried about someone trying to create a hash collision.) Thanks, Adam -- "It is the role of librarians to keep government running in difficult times," replied Dramoren. "Librarians are the last line of defence against chaos." (McMullen 2001)
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-05-22 14:58 +0200 |
| Message-ID | <mailman.10218.1400763549.18130.python-list@python.org> |
| In reply to | #71882 |
Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other things, > values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as well > as from reading the SQLite3 documentation about INTEGER PRIMARY KEY > --- that the SELECT tests should be faster if I am looking up an > INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that > right? My gut feeling tells me that this would matter more for join operations than lookup of a value. If you plan to do joins you could use an autoinc integer as the primary key and an additional string key for lookup. > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo the > maxint value. (Security isn't an issue --- i.e., I'm not worried > about someone trying to create a hash collision.) Start with the cheapest operation you can think of, md5(s) % MAXINT or even hash(s) % MAXINT # don't forget to set PYTHONHASHSEED then compare performance with just s and only if you can demonstrate a significant speedup keep the complication in your code. If you find such a speedup I'd like to see the numbers because this cries PREMATURE OPTIMIZATION...
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 14:41 +0100 |
| Message-ID | <tcb15bx6oj.ln2@news.ducksburg.com> |
| In reply to | #71884 |
On 2014-05-22, Peter Otten wrote:
> Adam Funk wrote:
>
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other things,
>> values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>>
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as well
>> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
>> --- that the SELECT tests should be faster if I am looking up an
>> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
>> right?
>
> My gut feeling tells me that this would matter more for join operations than
> lookup of a value. If you plan to do joins you could use an autoinc integer
> as the primary key and an additional string key for lookup.
I'm not doing any join operations. I'm using sqlite3 for storing big
piles of data & persistence between runs --- not really "proper
relational database use". In this particular case, I'm getting header
values out of messages & doing this:
for this_string in these_strings:
if not already_seen(this_string):
process(this_string)
# ignore if already seen
...
> and only if you can demonstrate a significant speedup keep the complication
> in your code.
>
> If you find such a speedup I'd like to see the numbers because this cries
> PREMATURE OPTIMIZATION...
On further reflection, I think I asked for that. In fact, the table
I'm using only has one column for the hashes --- I wasn't going to
store the strings at all in order to save disk space (maybe my mind is
stuck in the 1980s).
--
But the government always tries to coax well-known writers into the
Establishment; it makes them feel educated. [Robert Graves]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-23 00:08 +1000 |
| Message-ID | <mailman.10222.1400767725.18130.python-list@python.org> |
| In reply to | #71888 |
On Thu, May 22, 2014 at 11:41 PM, Adam Funk <a24061@ducksburg.com> wrote: > On further reflection, I think I asked for that. In fact, the table > I'm using only has one column for the hashes --- I wasn't going to > store the strings at all in order to save disk space (maybe my mind is > stuck in the 1980s). That's a problem, then, because you will see hash collisions. Maybe not often, but they definitely will occur if you have enough strings (look up the birthday paradox - with a 32-bit arbitrarily selected integer (such as a good crypto hash that you then truncate to 32 bits), you have a 50% chance of a collision at just 77,000 strings). Do you have enough RAM to hold all the strings directly? Just load 'em all up into a Python set. Set operations are fast, clean, and easy. Your already_seen function becomes a simple 'in' check. These days you can get 16GB or 32GB of RAM in a PC inexpensively enough; with an average string size of 80 characters, and assuming Python 3.3+, that's about 128 bytes each - close enough, and a nice figure. 16GB divided by 128 gives 128M strings - obviously you won't get all of that, but that's your ball-park. Anything less than, say, a hundred million strings, and you can dump the lot into memory. Easy! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 15:40 +0100 |
| Message-ID | <3se15bx8pl.ln2@news.ducksburg.com> |
| In reply to | #71892 |
On 2014-05-22, Chris Angelico wrote:
> On Thu, May 22, 2014 at 11:41 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> On further reflection, I think I asked for that. In fact, the table
>> I'm using only has one column for the hashes --- I wasn't going to
>> store the strings at all in order to save disk space (maybe my mind is
>> stuck in the 1980s).
>
> That's a problem, then, because you will see hash collisions. Maybe
> not often, but they definitely will occur if you have enough strings
> (look up the birthday paradox - with a 32-bit arbitrarily selected
> integer (such as a good crypto hash that you then truncate to 32
> bits), you have a 50% chance of a collision at just 77,000 strings).
Ah yes, there's a handy table for that:
https://en.wikipedia.org/wiki/Birthday_attack#Mathematics
> Do you have enough RAM to hold all the strings directly? Just load 'em
> all up into a Python set. Set operations are fast, clean, and easy.
> Your already_seen function becomes a simple 'in' check. These days you
> can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
> average string size of 80 characters, and assuming Python 3.3+, that's
> about 128 bytes each - close enough, and a nice figure. 16GB divided
> by 128 gives 128M strings - obviously you won't get all of that, but
> that's your ball-park. Anything less than, say, a hundred million
> strings, and you can dump the lot into memory. Easy!
Good point, & since (as I explained in my other post) the substrings
are being deduplicated in their own table anyway it's probably not
worth bothering with persistence between runs for this bit.
--
Some say the world will end in fire; some say in segfaults.
[XKCD 312]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-22 23:03 +1000 |
| Message-ID | <mailman.10219.1400763822.18130.python-list@python.org> |
| In reply to | #71882 |
On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24061@ducksburg.com> wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other things, > values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as well > as from reading the SQLite3 documentation about INTEGER PRIMARY KEY > --- that the SELECT tests should be faster if I am looking up an > INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that > right? It might be faster to use an integer primary key, but the possibility of even a single collision means you can't guarantee uniqueness without a separate check. I don't know sqlite3 well enough to say, but based on what I know of PostgreSQL, it's usually best to make your schema mimic your logical structure, rather than warping it for the sake of performance. With a good indexing function, the performance of a textual PK won't be all that much worse than an integral one, and everything you do will read correctly in the code - no fiddling around with hashes and collision checks. Stick with the TEXT PRIMARY KEY and let the database do the database's job. If you're processing a really large number of strings, you might want to consider moving from sqlite3 to PostgreSQL anyway (I've used psycopg2 quite happily), as you'll get better concurrency; and that might solve your performance problem as well, as Pg plays very nicely with caches. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 14:47 +0100 |
| Message-ID | <mob15bx6oj.ln2@news.ducksburg.com> |
| In reply to | #71885 |
On 2014-05-22, Chris Angelico wrote:
> On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other things,
>> values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>>
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as well
>> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
>> --- that the SELECT tests should be faster if I am looking up an
>> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
>> right?
>
> It might be faster to use an integer primary key, but the possibility
> of even a single collision means you can't guarantee uniqueness
> without a separate check. I don't know sqlite3 well enough to say, but
> based on what I know of PostgreSQL, it's usually best to make your
> schema mimic your logical structure, rather than warping it for the
> sake of performance. With a good indexing function, the performance of
> a textual PK won't be all that much worse than an integral one, and
> everything you do will read correctly in the code - no fiddling around
> with hashes and collision checks.
>
> Stick with the TEXT PRIMARY KEY and let the database do the database's
> job. If you're processing a really large number of strings, you might
> want to consider moving from sqlite3 to PostgreSQL anyway (I've used
> psycopg2 quite happily), as you'll get better concurrency; and that
> might solve your performance problem as well, as Pg plays very nicely
> with caches.
Well, actually I'm thinking about doing away with checking for
duplicates at this stage, since the substrings that I pick out of the
deduplicated header values go into another table as the TEXT PRIMARY
KEY anyway, with deduplication there. So I think this stage reeks of
premature optimization.
--
The history of the world is the history of a privileged few.
--- Henry Miller
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-05-22 08:09 -0500 |
| Message-ID | <mailman.10220.1400764235.18130.python-list@python.org> |
| In reply to | #71882 |
On 2014-05-22 12:47, Adam Funk wrote: > I'm using Python 3.3 and the sqlite3 module in the standard library. > I'm processing a lot of strings from input files (among other > things, values of headers in e-mail & news messages) and suppressing > duplicates using a table of seen strings in the database. > > It seems to me --- from past experience with other things, where > testing integers for equality is faster than testing strings, as > well as from reading the SQLite3 documentation about INTEGER > PRIMARY KEY --- that the SELECT tests should be faster if I am > looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY > KEY. Is that right? If sqlite can handle the absurd length of a Python long, you *can* do it as ints: >>> from hashlib import sha1 >>> s = "Hello world" >>> h = sha1(s) >>> h.hexdigest() '7b502c3a1f48c8609ae212cdfb639dee39673f5e' >>> int(h.hexdigest(), 16) 703993777145756967576188115661016000849227759454L That's a pretty honkin' huge int for a DB key, but you can use it. And it's pretty capped on length regardless of the underlying string's length. > If so, what sort of hashing function should I use? The "maxint" for > SQLite3 is a lot smaller than the size of even MD5 hashes. The only > thing I've thought of so far is to use MD5 or SHA-something modulo > the maxint value. (Security isn't an issue --- i.e., I'm not > worried about someone trying to create a hash collision.) You could truncate that to something like >>> int(h.hexdigest()[-8:], 16) which should give you something that would result in a 32-bit number that should fit in sqlite's int. -tkc
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 14:54 +0100 |
| Message-ID | <05c15bxrpj.ln2@news.ducksburg.com> |
| In reply to | #71886 |
On 2014-05-22, Tim Chase wrote:
> On 2014-05-22 12:47, Adam Funk wrote:
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other
>> things, values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>>
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as
>> well as from reading the SQLite3 documentation about INTEGER
>> PRIMARY KEY --- that the SELECT tests should be faster if I am
>> looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY
>> KEY. Is that right?
>
> If sqlite can handle the absurd length of a Python long, you *can* do
> it as ints:
It can't. SQLite3 INTEGER is an 8-byte signed one.
https://www.sqlite.org/datatype3.html
But after reading the other replies to my question, I've concluded
that what I was trying to do is pointless.
> >>> from hashlib import sha1
> >>> s = "Hello world"
> >>> h = sha1(s)
> >>> h.hexdigest()
> '7b502c3a1f48c8609ae212cdfb639dee39673f5e'
> >>> int(h.hexdigest(), 16)
> 703993777145756967576188115661016000849227759454L
That ties in with a related question I've been wondering about lately
(using MD5s & SHAs for other things) --- getting a hash value (which
is internally numeric, rather than string, right?) out as a hex string
& then converting that to an int looks inefficient to me --- is there
any better way to get an int? (I haven't seen any other way in the
API.)
--
A firm rule must be imposed upon our nation before it destroys
itself. The United States needs some theology and geometry, some taste
and decency. I suspect that we are teetering on the edge of the abyss.
--- Ignatius J Reilly
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-23 00:14 +1000 |
| Message-ID | <mailman.10223.1400768058.18130.python-list@python.org> |
| In reply to | #71891 |
On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote: >> >>> from hashlib import sha1 >> >>> s = "Hello world" >> >>> h = sha1(s) >> >>> h.hexdigest() >> '7b502c3a1f48c8609ae212cdfb639dee39673f5e' >> >>> int(h.hexdigest(), 16) >> 703993777145756967576188115661016000849227759454L > > That ties in with a related question I've been wondering about lately > (using MD5s & SHAs for other things) --- getting a hash value (which > is internally numeric, rather than string, right?) out as a hex string > & then converting that to an int looks inefficient to me --- is there > any better way to get an int? (I haven't seen any other way in the > API.) I don't know that there is, at least not with hashlib. You might be able to use digest() followed by the struct module, but it's no less convoluted. It's the same in several other languages' hashing functions; the result is a string, not an integer. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-22 15:47 +0100 |
| Message-ID | <k9f15bxoql.ln2@news.ducksburg.com> |
| In reply to | #71893 |
On 2014-05-22, Chris Angelico wrote:
> On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> That ties in with a related question I've been wondering about lately
>> (using MD5s & SHAs for other things) --- getting a hash value (which
>> is internally numeric, rather than string, right?) out as a hex string
>> & then converting that to an int looks inefficient to me --- is there
>> any better way to get an int? (I haven't seen any other way in the
>> API.)
>
> I don't know that there is, at least not with hashlib. You might be
> able to use digest() followed by the struct module, but it's no less
> convoluted. It's the same in several other languages' hashing
> functions; the result is a string, not an integer.
Well, J*v* returns a byte array, so I used to do this:
digester = MessageDigest.getInstance("MD5");
...
digester.reset();
byte[] digest = digester.digest(bytes);
return new BigInteger(+1, digest);
I dunno why language designers don't make it easy to get a single big
number directly out of these things.
I just had a look at the struct module's fearsome documentation &
think it would present a good shoot(self, foot) opportunity.
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-23 01:09 +1000 |
| Message-ID | <mailman.10224.1400771378.18130.python-list@python.org> |
| In reply to | #71897 |
On Fri, May 23, 2014 at 12:47 AM, Adam Funk <a24061@ducksburg.com> wrote: >> I don't know that there is, at least not with hashlib. You might be >> able to use digest() followed by the struct module, but it's no less >> convoluted. It's the same in several other languages' hashing >> functions; the result is a string, not an integer. > > Well, J*v* returns a byte array... I counted byte arrays along with strings. Whether it's notionally a string of bytes or characters makes no difference - it's not an integer. > I dunno why language designers don't make it easy to get a single big > number directly out of these things. It's probably because these sorts of hashes are usually done on large puddles of memory, to create a smaller puddle of memory. How you interpret the resulting puddle is up to you; maybe you want to think of it as a number, maybe as a string, but really it's just a sequence of bytes. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-05-22 17:34 +0200 |
| Message-ID | <mailman.10225.1400772863.18130.python-list@python.org> |
| In reply to | #71897 |
Adam Funk wrote:
> On 2014-05-22, Chris Angelico wrote:
>
>> On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote:
>
>>> That ties in with a related question I've been wondering about lately
>>> (using MD5s & SHAs for other things) --- getting a hash value (which
>>> is internally numeric, rather than string, right?) out as a hex string
>>> & then converting that to an int looks inefficient to me --- is there
>>> any better way to get an int? (I haven't seen any other way in the
>>> API.)
>>
>> I don't know that there is, at least not with hashlib. You might be
>> able to use digest() followed by the struct module, but it's no less
>> convoluted. It's the same in several other languages' hashing
>> functions; the result is a string, not an integer.
>
> Well, J*v* returns a byte array, so I used to do this:
>
> digester = MessageDigest.getInstance("MD5");
> ...
> digester.reset();
> byte[] digest = digester.digest(bytes);
> return new BigInteger(+1, digest);
In Python 3 there's int.from_bytes()
>>> h = hashlib.sha1(b"Hello world")
>>> int.from_bytes(h.digest(), "little")
538059071683667711846616050503420899184350089339
> I dunno why language designers don't make it easy to get a single big
> number directly out of these things.
You hardly ever need to manipulate the numerical value of the digest. And on
its way into the database it will be re-serialized anyway.
> I just had a look at the struct module's fearsome documentation &
> think it would present a good shoot(self, foot) opportunity.
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-23 11:27 +0100 |
| Subject | hashing strings to integers (was: hashing strings to integers for sqlite3 keys) |
| Message-ID | <sck35bxdc5.ln2@news.ducksburg.com> |
| In reply to | #71901 |
On 2014-05-22, Peter Otten wrote:
> Adam Funk wrote:
>> Well, J*v* returns a byte array, so I used to do this:
>>
>> digester = MessageDigest.getInstance("MD5");
>> ...
>> digester.reset();
>> byte[] digest = digester.digest(bytes);
>> return new BigInteger(+1, digest);
>
> In Python 3 there's int.from_bytes()
>
>>>> h = hashlib.sha1(b"Hello world")
>>>> int.from_bytes(h.digest(), "little")
> 538059071683667711846616050503420899184350089339
Excellent, thanks for pointing that out. I've just recently started
using Python 3 instead of 2, & appreciate pointers to new things like
that. The only thing that really bugs me in Python 3 is that execfile
has been removed (I find it useful for testing things interactively).
>> I dunno why language designers don't make it easy to get a single big
>> number directly out of these things.
>
> You hardly ever need to manipulate the numerical value of the digest. And on
> its way into the database it will be re-serialized anyway.
I now agree that my original plan to hash strings for the SQLite3
table was pointless, so I've changed the subject header. :-)
I have had good reason to use int hashes in the past, however. I was
doing some experiments with Andrei Broder's "sketches of shingles"
technique for finding partial duplication between documents, & you
need integers for that so you can do modulo arithmetic.
I've also used hashes of strings for other things involving
deduplication or fast lookups (because integer equality is faster than
string equality). I guess if it's just for deduplication, though, a
set of byte arrays is as good as a set of int?
--
Classical Greek lent itself to the promulgation of a rich culture,
indeed, to Western civilization. Computer languages bring us
doorbells that chime with thirty-two tunes, alt.sex.bestiality, and
Tetris clones. (Stoll 1995)
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-23 11:36 +0100 |
| Subject | Re: hashing strings to integers |
| Message-ID | <htk35bxdc5.ln2@news.ducksburg.com> |
| In reply to | #71922 |
On 2014-05-23, Adam Funk wrote:
> On 2014-05-22, Peter Otten wrote:
>> In Python 3 there's int.from_bytes()
>>
>>>>> h = hashlib.sha1(b"Hello world")
>>>>> int.from_bytes(h.digest(), "little")
>> 538059071683667711846616050503420899184350089339
>
> Excellent, thanks for pointing that out. I've just recently started
> using Python 3 instead of 2, & appreciate pointers to new things like
> that.
BTW, I just tested that & it should be "big" for consistency with the
hexdigest:
Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> h0 = hashlib.sha1(bytes('pants', 'UTF-8')).digest()
>>> h1 = hashlib.sha1(bytes('pants', 'UTF-8')).hexdigest()
>>> int.from_bytes(h0, 'big')
1315090007003469710610607131160586168131917298749
>>> int.from_bytes(h0, 'little')
352462323236431222976527983157432783788229548774
>>> int(h1, 16)
1315090007003469710610607131160586168131917298749
Thanks.
--
The kid's a hot prospect. He's got a good head for merchandising, an
agent who can take you downtown and one of the best urine samples I've
seen in a long time. [Dead Kennedys t-shirt]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-23 21:01 +1000 |
| Subject | Re: hashing strings to integers |
| Message-ID | <mailman.10240.1400842874.18130.python-list@python.org> |
| In reply to | #71923 |
On Fri, May 23, 2014 at 8:36 PM, Adam Funk <a24061@ducksburg.com> wrote: > BTW, I just tested that & it should be "big" for consistency with the > hexdigest: Yes, it definitely should be parsed big-endianly. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-23 20:59 +1000 |
| Subject | Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys) |
| Message-ID | <mailman.10239.1400842778.18130.python-list@python.org> |
| In reply to | #71922 |
On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com> wrote: > I've also used hashes of strings for other things involving > deduplication or fast lookups (because integer equality is faster than > string equality). I guess if it's just for deduplication, though, a > set of byte arrays is as good as a set of int? Want a better way to do that? Use a set or dict and let Python do the hashing. It'll be every bit as fast as explicit hashing, plus you get the bonus of simplicity. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2014-05-27 16:13 +0100 |
| Subject | Re: hashing strings to integers |
| Message-ID | <amme5bxc3v.ln2@news.ducksburg.com> |
| In reply to | #71924 |
On 2014-05-23, Chris Angelico wrote:
> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> I've also used hashes of strings for other things involving
>> deduplication or fast lookups (because integer equality is faster than
>> string equality). I guess if it's just for deduplication, though, a
>> set of byte arrays is as good as a set of int?
>
> Want a better way to do that? Use a set or dict and let Python do the
> hashing. It'll be every bit as fast as explicit hashing, plus you get
> the bonus of simplicity.
Well, here's the way it works in my mind:
I can store a set of a zillion strings (or a dict with a zillion
string keys), but every time I test "if new_string in
seen_strings", the computer hashes the new_string using some kind
of "short hash", checks the set for matching buckets (I'm assuming
this is how python tests set membership --- is that right?), then
checks any hash-matches for string equality. Testing string
equality is slower than integer equality, and strings (unless they
are really short) take up a lot more memory than long integers.
So for that kind of thing, I tend to store MD5 hashes or something
similar. Is that wrong?
--
With the breakdown of the medieval system, the gods of chaos, lunacy,
and bad taste gained ascendancy.
--- Ignatius J Reilly
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-05-27 17:02 +0000 |
| Subject | Re: hashing strings to integers |
| Message-ID | <5384c539$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72118 |
On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote:
> On 2014-05-23, Chris Angelico wrote:
>
>> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com>
>> wrote:
>>> I've also used hashes of strings for other things involving
>>> deduplication or fast lookups (because integer equality is faster than
>>> string equality). I guess if it's just for deduplication, though, a
>>> set of byte arrays is as good as a set of int?
>>
>> Want a better way to do that? Use a set or dict and let Python do the
>> hashing. It'll be every bit as fast as explicit hashing, plus you get
>> the bonus of simplicity.
>
> Well, here's the way it works in my mind:
>
> I can store a set of a zillion strings (or a dict with a zillion
> string keys), but every time I test "if new_string in seen_strings",
> the computer hashes the new_string using some kind of "short hash",
> checks the set for matching buckets (I'm assuming this is how python
> tests set membership --- is that right?),
So far so good. That applies to all objects, not just strings.
> then checks any
> hash-matches for string equality. Testing string equality is slower
> than integer equality, and strings (unless they are really short)
> take up a lot more memory than long integers.
But presumably you have to keep the string around anyway. It's going to
be somewhere, you can't just throw the string away and garbage collect
it. The dict doesn't store a copy of the string, it stores a reference to
it, and extra references don't cost much.
As for string equality, in pseudo-code it looks something like this:
def __eq__(self, other):
# Consider only the case where other is a string as well.
if self is other: # object identity
return True
if len(self) != len(other): # checking the length is fast
return False
# iterate over both strings in lock-step
for a in self, b in other:
if a != b: return False
return True
only written in fast C code. So the equality test bails out as quickly as
possible, and the only time you have to look at every character is if the
two strings are equal but not the same object.
MD5 hashing, on the other hand, *always* has to look at every character,
and perform quite a complicated set of computations. It's expensive.
> So for that kind of thing, I tend to store MD5 hashes or something
> similar. Is that wrong?
First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.
You're making the cardinal mistake of optimization, not what is slow, but
what you *assume* will be slow. (Or, replace memory consumption for speed
if you prefer.) Python is not C, and your intuitions for what's fast and
slow (low and high memory consumption) are likely to be way off. Consider
trying to store an MD5 hash instead of the string "Hello World!". If I
try this in Python 2.7, I get this:
py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30
I save a lousy 3 bytes. But in order to save that 3 bytes, I have to
generate an md5 object (32 bytes) and a hex string (53 bytes), both of
which take time to create and then time to garbage collect.
Actually, I don't even save those 3 bytes, since for nearly all
reasonable applications, I'll need to keep the string around somewhere.
So instead of saving three bytes, it's costing me thirty bytes for the
hash, plus who-knows-what needed to manage the book keeping to link that
hash to the actual string.
Ah, but maybe we can save time hashing them?
py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191
Times are in seconds for one million iterations of the code being timed.
Or alternatively, it costs about 0.14 microseconds to hash the string,
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate
the MD5 hash in the first place, that takes *much* longer:
py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188
but merely to perform a lightweight hash on the long int so as to find
which bucket of the dict it belongs to.
So, yes, chances are **very** good that you're supposed optimizations in
this area are actually pessimizations. If you have not measured where the
bottlenecks in your code are, you have no idea which parts of the code
need to be optimized.
Now, it is conceivable that there may be some circumstances where your
strategy makes sense. I'm guessing you would need something like this
before you even come close to saving memory and/or time:
- rather than short keys like "Hello World!", you are dealing with
keys that are typically many tens of thousands of characters long;
- most of the strings are the same length and have very similar
prefixes (e.g. the first 3000 chars are the same);
- rather than "zillions" of them, there are few enough of them that
the chances of an MD5 collision is insignificant;
(Any MD5 collision is going to play havoc with your strategy of
using hashes as a proxy for the real string.)
- and you can arrange matters so that you never need to MD5 hash a
string twice.
Is my guess closer to the actuality than yours? I don't know. I haven't
measured it either. But I know that Python is a high-level language with
lots of high-level data structures like dicts which trade-off time and
memory for programmer convenience, and that I'd want to see some real
benchmarks proving that my application was too slow before giving up that
convenience with a complicated strategy like this.
--
Steven D'Aprano
http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-28 05:16 +1000 |
| Subject | Re: hashing strings to integers |
| Message-ID | <mailman.10372.1401218186.18130.python-list@python.org> |
| In reply to | #72121 |
On Wed, May 28, 2014 at 3:02 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > But I know that Python is a high-level language with > lots of high-level data structures like dicts which trade-off time and > memory for programmer convenience, and that I'd want to see some real > benchmarks proving that my application was too slow before giving up that > convenience with a complicated strategy like this. And they trade off that time and memory on the basis of X years of development expertise. A while ago (about ten years or so, now... wow, that's quite a while) I wrote a C++ program that needed an ever-growing array; for simplicity, I went with a very basic system of doubling the size every time, from a base of something like 1024 or 8192. (Note that it was storing and moving around only pointers, so it's comparable to Python's list.) That means it has an average 25% slack space at all times, more until its first allocation, and every now and then it has a huge job of copying a pile of pointers into a new array. (Imagine it's now at 16777216 and it needs to add the 16,777,217th string to the array. Bye-bye CPU caches.) These boundaries became *user-visible pauses*, fortunately at predictable points, but on a not-terrible computer it could cause a >1s pause just copying heaps of pointers. How do you think a Python list will perform, under the same workload (periodic appending of single strings or small groups of strings, very frequent retrieval based on index - probably about a 20:1 read:write ratio)? Not only would it be far more convenient, it's probably going to outperform my hand-rolled code - unless I'm so brilliant (or lucky) that I can stumble to something as good as can be achieved with years of dedicated development. Yeah, I don't think so. Same goes for hashing algorithms. I can at least boast that I've never written one of those... although I have several times written a binary tree of one form or another, in order to provide comparable features to a dict. And once again, now that I know how convenient and performant high level languages can be (which may or may not have been true 15 years ago, but I certainly didn't know it), I don't rewrite what can be done better by just tossing in a couple of curly braces and saying "here, map this to that". ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web