Groups > comp.lang.python > #71882 > unrolled thread

hashing strings to integers for sqlite3 keys

Started by	Adam Funk <a24061@ducksburg.com>
First post	2014-05-22 12:47 +0100
Last post	2014-05-22 14:48 +0000
Articles	20 on this page of 26 — 8 participants

Back to article view | Back to comp.lang.python

  hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 12:47 +0100
    Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 14:58 +0200
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:41 +0100
        Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:08 +1000
          Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:40 +0100
    Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-22 23:03 +1000
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:47 +0100
    Re: hashing strings to integers for sqlite3 keys Tim Chase <python.list@tim.thechases.com> - 2014-05-22 08:09 -0500
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:54 +0100
        Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:14 +1000
          Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:47 +0100
            Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 01:09 +1000
            Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 17:34 +0200
              hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:27 +0100
                Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:36 +0100
                  Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-23 21:01 +1000
                Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Chris Angelico <rosuav@gmail.com> - 2014-05-23 20:59 +1000
                  Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:13 +0100
                    Re: hashing strings to integers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-27 17:02 +0000
                      Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-28 05:16 +1000
                      Re: hashing strings to integers Dan Sommers <dan@tombstonezero.net> - 2014-05-28 01:55 +0000
                        Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:29 +0100
                      Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:32 +0100
                Re: hashing strings to integers Terry Reedy <tjreedy@udel.edu> - 2014-05-23 15:10 -0400
                  Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:20 +0100
    Re: hashing strings to integers for sqlite3 keys alister <alister.nospam.ware@ntlworld.com> - 2014-05-22 14:48 +0000

Page 1 of 2 [1] 2 Next page →

#71882 — hashing strings to integers for sqlite3 keys

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 12:47 +0100
Subject	hashing strings to integers for sqlite3 keys
Message-ID	<jn415bx6uf.ln2@news.ducksburg.com>

I'm using Python 3.3 and the sqlite3 module in the standard library.
I'm processing a lot of strings from input files (among other things,
values of headers in e-mail & news messages) and suppressing
duplicates using a table of seen strings in the database.

It seems to me --- from past experience with other things, where
testing integers for equality is faster than testing strings, as well
as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
--- that the SELECT tests should be faster if I am looking up an
INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
right?

If so, what sort of hashing function should I use?  The "maxint" for
SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
thing I've thought of so far is to use MD5 or SHA-something modulo the
maxint value.  (Security isn't an issue --- i.e., I'm not worried
about someone trying to create a hash collision.)

Thanks,
Adam


-- 
"It is the role of librarians to keep government running in difficult
times," replied Dramoren.  "Librarians are the last line of defence
against chaos."                                       (McMullen 2001)

[toc] | [next] | [standalone]

#71884

From	Peter Otten <__peter__@web.de>
Date	2014-05-22 14:58 +0200
Message-ID	<mailman.10218.1400763549.18130.python-list@python.org>
In reply to	#71882

Adam Funk wrote:

> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other things,
> values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
> 
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as well
> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
> --- that the SELECT tests should be faster if I am looking up an
> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
> right?

My gut feeling tells me that this would matter more for join operations than 
lookup of a value. If you plan to do joins you could use an autoinc integer 
as the primary key and an additional string key for lookup.
 
> If so, what sort of hashing function should I use?  The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
> thing I've thought of so far is to use MD5 or SHA-something modulo the
> maxint value.  (Security isn't an issue --- i.e., I'm not worried
> about someone trying to create a hash collision.)

Start with the cheapest operation you can think of, 

md5(s) % MAXINT

or even

hash(s) % MAXINT # don't forget to set PYTHONHASHSEED

then compare performance with just

s

and only if you can demonstrate a significant speedup keep the complication 
in your code.

If you find such a speedup I'd like to see the numbers because this cries 
PREMATURE OPTIMIZATION...

[toc] | [prev] | [next] | [standalone]

#71888

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 14:41 +0100
Message-ID	<tcb15bx6oj.ln2@news.ducksburg.com>
In reply to	#71884

On 2014-05-22, Peter Otten wrote:

> Adam Funk wrote:
>
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other things,
>> values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>> 
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as well
>> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
>> --- that the SELECT tests should be faster if I am looking up an
>> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
>> right?
>
> My gut feeling tells me that this would matter more for join operations than 
> lookup of a value. If you plan to do joins you could use an autoinc integer 
> as the primary key and an additional string key for lookup.

I'm not doing any join operations.  I'm using sqlite3 for storing big
piles of data & persistence between runs --- not really "proper
relational database use".  In this particular case, I'm getting header
values out of messages & doing this:

  for this_string in these_strings:
    if not already_seen(this_string):
      process(this_string)
    # ignore if already seen	 

...
> and only if you can demonstrate a significant speedup keep the complication 
> in your code.
>
> If you find such a speedup I'd like to see the numbers because this cries 
> PREMATURE OPTIMIZATION...

On further reflection, I think I asked for that.  In fact, the table
I'm using only has one column for the hashes --- I wasn't going to
store the strings at all in order to save disk space (maybe my mind is
stuck in the 1980s).


-- 
But the government always tries to coax well-known writers into the
Establishment; it makes them feel educated.         [Robert Graves]

[toc] | [prev] | [next] | [standalone]

#71892

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-23 00:08 +1000
Message-ID	<mailman.10222.1400767725.18130.python-list@python.org>
In reply to	#71888

On Thu, May 22, 2014 at 11:41 PM, Adam Funk <a24061@ducksburg.com> wrote:
> On further reflection, I think I asked for that.  In fact, the table
> I'm using only has one column for the hashes --- I wasn't going to
> store the strings at all in order to save disk space (maybe my mind is
> stuck in the 1980s).

That's a problem, then, because you will see hash collisions. Maybe
not often, but they definitely will occur if you have enough strings
(look up the birthday paradox - with a 32-bit arbitrarily selected
integer (such as a good crypto hash that you then truncate to 32
bits), you have a 50% chance of a collision at just 77,000 strings).

Do you have enough RAM to hold all the strings directly? Just load 'em
all up into a Python set. Set operations are fast, clean, and easy.
Your already_seen function becomes a simple 'in' check. These days you
can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
average string size of 80 characters, and assuming Python 3.3+, that's
about 128 bytes each - close enough, and a nice figure. 16GB divided
by 128 gives 128M strings - obviously you won't get all of that, but
that's your ball-park. Anything less than, say, a hundred million
strings, and you can dump the lot into memory. Easy!

ChrisA

[toc] | [prev] | [next] | [standalone]

#71895

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 15:40 +0100
Message-ID	<3se15bx8pl.ln2@news.ducksburg.com>
In reply to	#71892

On 2014-05-22, Chris Angelico wrote:

> On Thu, May 22, 2014 at 11:41 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> On further reflection, I think I asked for that.  In fact, the table
>> I'm using only has one column for the hashes --- I wasn't going to
>> store the strings at all in order to save disk space (maybe my mind is
>> stuck in the 1980s).
>
> That's a problem, then, because you will see hash collisions. Maybe
> not often, but they definitely will occur if you have enough strings
> (look up the birthday paradox - with a 32-bit arbitrarily selected
> integer (such as a good crypto hash that you then truncate to 32
> bits), you have a 50% chance of a collision at just 77,000 strings).

Ah yes, there's a handy table for that:

https://en.wikipedia.org/wiki/Birthday_attack#Mathematics


> Do you have enough RAM to hold all the strings directly? Just load 'em
> all up into a Python set. Set operations are fast, clean, and easy.
> Your already_seen function becomes a simple 'in' check. These days you
> can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
> average string size of 80 characters, and assuming Python 3.3+, that's
> about 128 bytes each - close enough, and a nice figure. 16GB divided
> by 128 gives 128M strings - obviously you won't get all of that, but
> that's your ball-park. Anything less than, say, a hundred million
> strings, and you can dump the lot into memory. Easy!

Good point, & since (as I explained in my other post) the substrings
are being deduplicated in their own table anyway it's probably not
worth bothering with persistence between runs for this bit.


-- 
Some say the world will end in fire; some say in segfaults.
                                                 [XKCD 312]

[toc] | [prev] | [next] | [standalone]

#71885

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-22 23:03 +1000
Message-ID	<mailman.10219.1400763822.18130.python-list@python.org>
In reply to	#71882

On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24061@ducksburg.com> wrote:
> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other things,
> values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
>
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as well
> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
> --- that the SELECT tests should be faster if I am looking up an
> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
> right?

It might be faster to use an integer primary key, but the possibility
of even a single collision means you can't guarantee uniqueness
without a separate check. I don't know sqlite3 well enough to say, but
based on what I know of PostgreSQL, it's usually best to make your
schema mimic your logical structure, rather than warping it for the
sake of performance. With a good indexing function, the performance of
a textual PK won't be all that much worse than an integral one, and
everything you do will read correctly in the code - no fiddling around
with hashes and collision checks.

Stick with the TEXT PRIMARY KEY and let the database do the database's
job. If you're processing a really large number of strings, you might
want to consider moving from sqlite3 to PostgreSQL anyway (I've used
psycopg2 quite happily), as you'll get better concurrency; and that
might solve your performance problem as well, as Pg plays very nicely
with caches.

ChrisA

[toc] | [prev] | [next] | [standalone]

#71890

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 14:47 +0100
Message-ID	<mob15bx6oj.ln2@news.ducksburg.com>
In reply to	#71885

On 2014-05-22, Chris Angelico wrote:

> On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other things,
>> values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>>
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as well
>> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
>> --- that the SELECT tests should be faster if I am looking up an
>> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
>> right?
>
> It might be faster to use an integer primary key, but the possibility
> of even a single collision means you can't guarantee uniqueness
> without a separate check. I don't know sqlite3 well enough to say, but
> based on what I know of PostgreSQL, it's usually best to make your
> schema mimic your logical structure, rather than warping it for the
> sake of performance. With a good indexing function, the performance of
> a textual PK won't be all that much worse than an integral one, and
> everything you do will read correctly in the code - no fiddling around
> with hashes and collision checks.
>
> Stick with the TEXT PRIMARY KEY and let the database do the database's
> job. If you're processing a really large number of strings, you might
> want to consider moving from sqlite3 to PostgreSQL anyway (I've used
> psycopg2 quite happily), as you'll get better concurrency; and that
> might solve your performance problem as well, as Pg plays very nicely
> with caches.

Well, actually I'm thinking about doing away with checking for
duplicates at this stage, since the substrings that I pick out of the
deduplicated header values go into another table as the TEXT PRIMARY
KEY anyway, with deduplication there.  So I think this stage reeks of
premature optimization.


-- 
The history of the world is the history of a privileged few.
                                            --- Henry Miller

[toc] | [prev] | [next] | [standalone]

#71886

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-05-22 08:09 -0500
Message-ID	<mailman.10220.1400764235.18130.python-list@python.org>
In reply to	#71882

On 2014-05-22 12:47, Adam Funk wrote:
> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other
> things, values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
> 
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as
> well as from reading the SQLite3 documentation about INTEGER
> PRIMARY KEY --- that the SELECT tests should be faster if I am
> looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY
> KEY.  Is that right?

If sqlite can handle the absurd length of a Python long, you *can* do
it as ints:

  >>> from hashlib import sha1
  >>> s = "Hello world"
  >>> h = sha1(s)
  >>> h.hexdigest()
  '7b502c3a1f48c8609ae212cdfb639dee39673f5e'
  >>> int(h.hexdigest(), 16)
  703993777145756967576188115661016000849227759454L

That's a pretty honkin' huge int for a DB key, but you can use it.
And it's pretty capped on length regardless of the underlying
string's length.

> If so, what sort of hashing function should I use?  The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
> thing I've thought of so far is to use MD5 or SHA-something modulo
> the maxint value.  (Security isn't an issue --- i.e., I'm not
> worried about someone trying to create a hash collision.)

You could truncate that to something like

  >>> int(h.hexdigest()[-8:], 16)

which should give you something that would result in a 32-bit number
that should fit in sqlite's int.

-tkc

[toc] | [prev] | [next] | [standalone]

#71891

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 14:54 +0100
Message-ID	<05c15bxrpj.ln2@news.ducksburg.com>
In reply to	#71886

On 2014-05-22, Tim Chase wrote:

> On 2014-05-22 12:47, Adam Funk wrote:
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other
>> things, values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>> 
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as
>> well as from reading the SQLite3 documentation about INTEGER
>> PRIMARY KEY --- that the SELECT tests should be faster if I am
>> looking up an INTEGER PRIMARY KEY value rather than TEXT PRIMARY
>> KEY.  Is that right?
>
> If sqlite can handle the absurd length of a Python long, you *can* do
> it as ints:

It can't.  SQLite3 INTEGER is an 8-byte signed one.

https://www.sqlite.org/datatype3.html

But after reading the other replies to my question, I've concluded
that what I was trying to do is pointless.


>  >>> from hashlib import sha1
>  >>> s = "Hello world"
>  >>> h = sha1(s)
>  >>> h.hexdigest()
>   '7b502c3a1f48c8609ae212cdfb639dee39673f5e'
>  >>> int(h.hexdigest(), 16)
>   703993777145756967576188115661016000849227759454L

That ties in with a related question I've been wondering about lately
(using MD5s & SHAs for other things) --- getting a hash value (which
is internally numeric, rather than string, right?) out as a hex string
& then converting that to an int looks inefficient to me --- is there
any better way to get an int?  (I haven't seen any other way in the
API.)


-- 
A firm rule must be imposed upon our nation before it destroys
itself. The United States needs some theology and geometry, some taste
and decency. I suspect that we are teetering on the edge of the abyss.
                                                 --- Ignatius J Reilly

[toc] | [prev] | [next] | [standalone]

#71893

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-23 00:14 +1000
Message-ID	<mailman.10223.1400768058.18130.python-list@python.org>
In reply to	#71891

On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote:
>>  >>> from hashlib import sha1
>>  >>> s = "Hello world"
>>  >>> h = sha1(s)
>>  >>> h.hexdigest()
>>   '7b502c3a1f48c8609ae212cdfb639dee39673f5e'
>>  >>> int(h.hexdigest(), 16)
>>   703993777145756967576188115661016000849227759454L
>
> That ties in with a related question I've been wondering about lately
> (using MD5s & SHAs for other things) --- getting a hash value (which
> is internally numeric, rather than string, right?) out as a hex string
> & then converting that to an int looks inefficient to me --- is there
> any better way to get an int?  (I haven't seen any other way in the
> API.)

I don't know that there is, at least not with hashlib. You might be
able to use digest() followed by the struct module, but it's no less
convoluted. It's the same in several other languages' hashing
functions; the result is a string, not an integer.

ChrisA

[toc] | [prev] | [next] | [standalone]

#71897

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-22 15:47 +0100
Message-ID	<k9f15bxoql.ln2@news.ducksburg.com>
In reply to	#71893

On 2014-05-22, Chris Angelico wrote:

> On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote:

>> That ties in with a related question I've been wondering about lately
>> (using MD5s & SHAs for other things) --- getting a hash value (which
>> is internally numeric, rather than string, right?) out as a hex string
>> & then converting that to an int looks inefficient to me --- is there
>> any better way to get an int?  (I haven't seen any other way in the
>> API.)
>
> I don't know that there is, at least not with hashlib. You might be
> able to use digest() followed by the struct module, but it's no less
> convoluted. It's the same in several other languages' hashing
> functions; the result is a string, not an integer.

Well, J*v* returns a byte array, so I used to do this:

    digester = MessageDigest.getInstance("MD5");
    ...
    digester.reset();
    byte[] digest = digester.digest(bytes);
    return new BigInteger(+1, digest);

I dunno why language designers don't make it easy to get a single big
number directly out of these things.


I just had a look at the struct module's fearsome documentation &
think it would present a good shoot(self, foot) opportunity.


-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[toc] | [prev] | [next] | [standalone]

#71898

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-23 01:09 +1000
Message-ID	<mailman.10224.1400771378.18130.python-list@python.org>
In reply to	#71897

On Fri, May 23, 2014 at 12:47 AM, Adam Funk <a24061@ducksburg.com> wrote:
>> I don't know that there is, at least not with hashlib. You might be
>> able to use digest() followed by the struct module, but it's no less
>> convoluted. It's the same in several other languages' hashing
>> functions; the result is a string, not an integer.
>
> Well, J*v* returns a byte array...

I counted byte arrays along with strings. Whether it's notionally a
string of bytes or characters makes no difference - it's not an
integer.

> I dunno why language designers don't make it easy to get a single big
> number directly out of these things.

It's probably because these sorts of hashes are usually done on large
puddles of memory, to create a smaller puddle of memory. How you
interpret the resulting puddle is up to you; maybe you want to think
of it as a number, maybe as a string, but really it's just a sequence
of bytes.

ChrisA

[toc] | [prev] | [next] | [standalone]

#71901

From	Peter Otten <__peter__@web.de>
Date	2014-05-22 17:34 +0200
Message-ID	<mailman.10225.1400772863.18130.python-list@python.org>
In reply to	#71897

Adam Funk wrote:

> On 2014-05-22, Chris Angelico wrote:
> 
>> On Thu, May 22, 2014 at 11:54 PM, Adam Funk <a24061@ducksburg.com> wrote:
> 
>>> That ties in with a related question I've been wondering about lately
>>> (using MD5s & SHAs for other things) --- getting a hash value (which
>>> is internally numeric, rather than string, right?) out as a hex string
>>> & then converting that to an int looks inefficient to me --- is there
>>> any better way to get an int?  (I haven't seen any other way in the
>>> API.)
>>
>> I don't know that there is, at least not with hashlib. You might be
>> able to use digest() followed by the struct module, but it's no less
>> convoluted. It's the same in several other languages' hashing
>> functions; the result is a string, not an integer.
> 
> Well, J*v* returns a byte array, so I used to do this:
> 
>     digester = MessageDigest.getInstance("MD5");
>     ...
>     digester.reset();
>     byte[] digest = digester.digest(bytes);
>     return new BigInteger(+1, digest);

In Python 3 there's int.from_bytes()

>>> h = hashlib.sha1(b"Hello world")
>>> int.from_bytes(h.digest(), "little")
538059071683667711846616050503420899184350089339

> I dunno why language designers don't make it easy to get a single big
> number directly out of these things.
 
You hardly ever need to manipulate the numerical value of the digest. And on 
its way into the database it will be re-serialized anyway.
 
> I just had a look at the struct module's fearsome documentation &
> think it would present a good shoot(self, foot) opportunity.

[toc] | [prev] | [next] | [standalone]

#71922 — hashing strings to integers (was: hashing strings to integers for sqlite3 keys)

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-23 11:27 +0100
Subject	hashing strings to integers (was: hashing strings to integers for sqlite3 keys)
Message-ID	<sck35bxdc5.ln2@news.ducksburg.com>
In reply to	#71901

On 2014-05-22, Peter Otten wrote:

> Adam Funk wrote:

>> Well, J*v* returns a byte array, so I used to do this:
>> 
>>     digester = MessageDigest.getInstance("MD5");
>>     ...
>>     digester.reset();
>>     byte[] digest = digester.digest(bytes);
>>     return new BigInteger(+1, digest);
>
> In Python 3 there's int.from_bytes()
>
>>>> h = hashlib.sha1(b"Hello world")
>>>> int.from_bytes(h.digest(), "little")
> 538059071683667711846616050503420899184350089339

Excellent, thanks for pointing that out.  I've just recently started
using Python 3 instead of 2, & appreciate pointers to new things like
that.  The only thing that really bugs me in Python 3 is that execfile
has been removed (I find it useful for testing things interactively).

>> I dunno why language designers don't make it easy to get a single big
>> number directly out of these things.
>  
> You hardly ever need to manipulate the numerical value of the digest. And on 
> its way into the database it will be re-serialized anyway.

I now agree that my original plan to hash strings for the SQLite3
table was pointless, so I've changed the subject header.  :-)

I have had good reason to use int hashes in the past, however.  I was
doing some experiments with Andrei Broder's "sketches of shingles"
technique for finding partial duplication between documents, & you
need integers for that so you can do modulo arithmetic.

I've also used hashes of strings for other things involving
deduplication or fast lookups (because integer equality is faster than
string equality).  I guess if it's just for deduplication, though, a
set of byte arrays is as good as a set of int?

-- 
Classical Greek lent itself to the promulgation of a rich culture,
indeed, to Western civilization.  Computer languages bring us
doorbells that chime with thirty-two tunes, alt.sex.bestiality, and
Tetris clones.                                         (Stoll 1995)

[toc] | [prev] | [next] | [standalone]

#71923 — Re: hashing strings to integers

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-23 11:36 +0100
Subject	Re: hashing strings to integers
Message-ID	<htk35bxdc5.ln2@news.ducksburg.com>
In reply to	#71922

On 2014-05-23, Adam Funk wrote:

> On 2014-05-22, Peter Otten wrote:

>> In Python 3 there's int.from_bytes()
>>
>>>>> h = hashlib.sha1(b"Hello world")
>>>>> int.from_bytes(h.digest(), "little")
>> 538059071683667711846616050503420899184350089339
>
> Excellent, thanks for pointing that out.  I've just recently started
> using Python 3 instead of 2, & appreciate pointers to new things like
> that.

BTW, I just tested that & it should be "big" for consistency with the
hexdigest:

Python 3.3.2+ (default, Feb 28 2014, 00:52:16) 
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> h0 = hashlib.sha1(bytes('pants', 'UTF-8')).digest()
>>> h1 = hashlib.sha1(bytes('pants', 'UTF-8')).hexdigest()
>>> int.from_bytes(h0, 'big')
1315090007003469710610607131160586168131917298749
>>> int.from_bytes(h0, 'little')
352462323236431222976527983157432783788229548774
>>> int(h1, 16)
1315090007003469710610607131160586168131917298749

Thanks.


-- 
The kid's a hot prospect. He's got a good head for merchandising, an
agent who can take you downtown and one of the best urine samples I've
seen in a long time.                           [Dead Kennedys t-shirt]

[toc] | [prev] | [next] | [standalone]

#71925 — Re: hashing strings to integers

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-23 21:01 +1000
Subject	Re: hashing strings to integers
Message-ID	<mailman.10240.1400842874.18130.python-list@python.org>
In reply to	#71923

On Fri, May 23, 2014 at 8:36 PM, Adam Funk <a24061@ducksburg.com> wrote:
> BTW, I just tested that & it should be "big" for consistency with the
> hexdigest:

Yes, it definitely should be parsed big-endianly.

ChrisA

[toc] | [prev] | [next] | [standalone]

#71924 — Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys)

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-23 20:59 +1000
Subject	Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys)
Message-ID	<mailman.10239.1400842778.18130.python-list@python.org>
In reply to	#71922

On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com> wrote:
> I've also used hashes of strings for other things involving
> deduplication or fast lookups (because integer equality is faster than
> string equality).  I guess if it's just for deduplication, though, a
> set of byte arrays is as good as a set of int?

Want a better way to do that? Use a set or dict and let Python do the
hashing. It'll be every bit as fast as explicit hashing, plus you get
the bonus of simplicity.

ChrisA

[toc] | [prev] | [next] | [standalone]

#72118 — Re: hashing strings to integers

From	Adam Funk <a24061@ducksburg.com>
Date	2014-05-27 16:13 +0100
Subject	Re: hashing strings to integers
Message-ID	<amme5bxc3v.ln2@news.ducksburg.com>
In reply to	#71924

On 2014-05-23, Chris Angelico wrote:

> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com> wrote:
>> I've also used hashes of strings for other things involving
>> deduplication or fast lookups (because integer equality is faster than
>> string equality).  I guess if it's just for deduplication, though, a
>> set of byte arrays is as good as a set of int?
>
> Want a better way to do that? Use a set or dict and let Python do the
> hashing. It'll be every bit as fast as explicit hashing, plus you get
> the bonus of simplicity.

Well, here's the way it works in my mind:

   I can store a set of a zillion strings (or a dict with a zillion
   string keys), but every time I test "if new_string in
   seen_strings", the computer hashes the new_string using some kind
   of "short hash", checks the set for matching buckets (I'm assuming
   this is how python tests set membership --- is that right?), then
   checks any hash-matches for string equality.  Testing string
   equality is slower than integer equality, and strings (unless they
   are really short) take up a lot more memory than long integers.

So for that kind of thing, I tend to store MD5 hashes or something
similar.  Is that wrong?

-- 
With the breakdown of the medieval system, the gods of chaos, lunacy,
and bad taste gained ascendancy.
                                                --- Ignatius J Reilly

[toc] | [prev] | [next] | [standalone]

#72121 — Re: hashing strings to integers

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-05-27 17:02 +0000
Subject	Re: hashing strings to integers
Message-ID	<5384c539$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#72118

On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote:

> On 2014-05-23, Chris Angelico wrote:
> 
>> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24061@ducksburg.com>
>> wrote:
>>> I've also used hashes of strings for other things involving
>>> deduplication or fast lookups (because integer equality is faster than
>>> string equality).  I guess if it's just for deduplication, though, a
>>> set of byte arrays is as good as a set of int?
>>
>> Want a better way to do that? Use a set or dict and let Python do the
>> hashing. It'll be every bit as fast as explicit hashing, plus you get
>> the bonus of simplicity.
> 
> Well, here's the way it works in my mind:
> 
>    I can store a set of a zillion strings (or a dict with a zillion
>    string keys), but every time I test "if new_string in seen_strings",
>    the computer hashes the new_string using some kind of "short hash",
>    checks the set for matching buckets (I'm assuming this is how python
>    tests set membership --- is that right?), 

So far so good. That applies to all objects, not just strings.

>    then checks any
>    hash-matches for string equality.  Testing string equality is slower
>    than integer equality, and strings (unless they are really short)
>    take up a lot more memory than long integers.

But presumably you have to keep the string around anyway. It's going to 
be somewhere, you can't just throw the string away and garbage collect 
it. The dict doesn't store a copy of the string, it stores a reference to 
it, and extra references don't cost much.

As for string equality, in pseudo-code it looks something like this:

    def __eq__(self, other):
        # Consider only the case where other is a string as well.
        if self is other:  # object identity
            return True
        if len(self) != len(other):  # checking the length is fast
            return False
        # iterate over both strings in lock-step
        for a in self, b in other:
            if a != b: return False
        return True

only written in fast C code. So the equality test bails out as quickly as 
possible, and the only time you have to look at every character is if the 
two strings are equal but not the same object.

MD5 hashing, on the other hand, *always* has to look at every character, 
and perform quite a complicated set of computations. It's expensive.

> So for that kind of thing, I tend to store MD5 hashes or something
> similar.  Is that wrong?

First rule of optimization: Don't do it.
Second rule of optimization (for experts only): Don't do it yet.

You're making the cardinal mistake of optimization, not what is slow, but 
what you *assume* will be slow. (Or, replace memory consumption for speed 
if you prefer.) Python is not C, and your intuitions for what's fast and 
slow (low and high memory consumption) are likely to be way off. Consider 
trying to store an MD5 hash instead of the string "Hello World!". If I 
try this in Python 2.7, I get this:

py> import md5, sys
py> s = "Hello World!"
py> n = int(md5.md5(s).hexdigest(), 16)
py> sys.getsizeof(s)
33
py> sys.getsizeof(n)
30

I save a lousy 3 bytes. But in order to save that 3 bytes, I have to 
generate an md5 object (32 bytes) and a hex string (53 bytes), both of 
which take time to create and then time to garbage collect.

Actually, I don't even save those 3 bytes, since for nearly all 
reasonable applications, I'll need to keep the string around somewhere. 
So instead of saving three bytes, it's costing me thirty bytes for the 
hash, plus who-knows-what needed to manage the book keeping to link that 
hash to the actual string.

Ah, but maybe we can save time hashing them?

py> from timeit import Timer
py> setup = "from __main__ import s, n"
py> t1 = Timer("hash(s)", setup)
py> t2 = Timer("hash(n)", setup)
py> min(t1.repeat(5))
0.14668607711791992
py> min(t2.repeat(5))
0.15973114967346191

Times are in seconds for one million iterations of the code being timed. 
Or alternatively, it costs about 0.14 microseconds to hash the string, 
and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate 
the MD5 hash in the first place, that takes *much* longer:

py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'")
py> min(t3.repeat(5))
2.3326950073242188

but merely to perform a lightweight hash on the long int so as to find 
which bucket of the dict it belongs to.

So, yes, chances are **very** good that you're supposed optimizations in 
this area are actually pessimizations. If you have not measured where the 
bottlenecks in your code are, you have no idea which parts of the code 
need to be optimized.

Now, it is conceivable that there may be some circumstances where your 
strategy makes sense. I'm guessing you would need something like this 
before you even come close to saving memory and/or time:

- rather than short keys like "Hello World!", you are dealing with 
  keys that are typically many tens of thousands of characters long;

- most of the strings are the same length and have very similar 
  prefixes (e.g. the first 3000 chars are the same);

- rather than "zillions" of them, there are few enough of them that
  the chances of an MD5 collision is insignificant;

  (Any MD5 collision is going to play havoc with your strategy of
  using hashes as a proxy for the real string.)

- and you can arrange matters so that you never need to MD5 hash a 
  string twice.

Is my guess closer to the actuality than yours? I don't know. I haven't 
measured it either. But I know that Python is a high-level language with 
lots of high-level data structures like dicts which trade-off time and 
memory for programmer convenience, and that I'd want to see some real 
benchmarks proving that my application was too slow before giving up that 
convenience with a complicated strategy like this.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#72123 — Re: hashing strings to integers

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-28 05:16 +1000
Subject	Re: hashing strings to integers
Message-ID	<mailman.10372.1401218186.18130.python-list@python.org>
In reply to	#72121

On Wed, May 28, 2014 at 3:02 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> But I know that Python is a high-level language with
> lots of high-level data structures like dicts which trade-off time and
> memory for programmer convenience, and that I'd want to see some real
> benchmarks proving that my application was too slow before giving up that
> convenience with a complicated strategy like this.

And they trade off that time and memory on the basis of X years of
development expertise. A while ago (about ten years or so, now... wow,
that's quite a while) I wrote a C++ program that needed an
ever-growing array; for simplicity, I went with a very basic system of
doubling the size every time, from a base of something like 1024 or
8192. (Note that it was storing and moving around only pointers, so
it's comparable to Python's list.) That means it has an average 25%
slack space at all times, more until its first allocation, and every
now and then it has a huge job of copying a pile of pointers into a
new array. (Imagine it's now at 16777216 and it needs to add the
16,777,217th string to the array. Bye-bye CPU caches.) These
boundaries became *user-visible pauses*, fortunately at predictable
points, but on a not-terrible computer it could cause a >1s pause just
copying heaps of pointers. How do you think a Python list will
perform, under the same workload (periodic appending of single strings
or small groups of strings, very frequent retrieval based on index -
probably about a 20:1 read:write ratio)? Not only would it be far more
convenient, it's probably going to outperform my hand-rolled code -
unless I'm so brilliant (or lucky) that I can stumble to something as
good as can be achieved with years of dedicated development. Yeah, I
don't think so.

Same goes for hashing algorithms. I can at least boast that I've never
written one of those... although I have several times written a binary
tree of one form or another, in order to provide comparable features
to a dict. And once again, now that I know how convenient and
performant high level languages can be (which may or may not have been
true 15 years ago, but I certainly didn't know it), I don't rewrite
what can be done better by just tossing in a couple of curly braces
and saying "here, map this to that".

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

hashing strings to integers for sqlite3 keys

Contents

#71882 — hashing strings to integers for sqlite3 keys

#71884

#71888

#71892

#71895

#71885

#71890

#71886

#71891

#71893

#71897

#71898

#71901

#71922 — hashing strings to integers (was: hashing strings to integers for sqlite3 keys)

#71923 — Re: hashing strings to integers

#71925 — Re: hashing strings to integers

#71924 — Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys)

#72118 — Re: hashing strings to integers

#72121 — Re: hashing strings to integers

#72123 — Re: hashing strings to integers