Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71882 > unrolled thread

hashing strings to integers for sqlite3 keys

Started byAdam Funk <a24061@ducksburg.com>
First post2014-05-22 12:47 +0100
Last post2014-05-22 14:48 +0000
Articles 6 on this page of 26 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 12:47 +0100
    Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 14:58 +0200
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:41 +0100
        Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:08 +1000
          Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:40 +0100
    Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-22 23:03 +1000
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:47 +0100
    Re: hashing strings to integers for sqlite3 keys Tim Chase <python.list@tim.thechases.com> - 2014-05-22 08:09 -0500
      Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 14:54 +0100
        Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 00:14 +1000
          Re: hashing strings to integers for sqlite3 keys Adam Funk <a24061@ducksburg.com> - 2014-05-22 15:47 +0100
            Re: hashing strings to integers for sqlite3 keys Chris Angelico <rosuav@gmail.com> - 2014-05-23 01:09 +1000
            Re: hashing strings to integers for sqlite3 keys Peter Otten <__peter__@web.de> - 2014-05-22 17:34 +0200
              hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:27 +0100
                Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-23 11:36 +0100
                  Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-23 21:01 +1000
                Re: hashing strings to integers (was: hashing strings to integers for sqlite3 keys) Chris Angelico <rosuav@gmail.com> - 2014-05-23 20:59 +1000
                  Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:13 +0100
                    Re: hashing strings to integers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-27 17:02 +0000
                      Re: hashing strings to integers Chris Angelico <rosuav@gmail.com> - 2014-05-28 05:16 +1000
                      Re: hashing strings to integers Dan Sommers <dan@tombstonezero.net> - 2014-05-28 01:55 +0000
                        Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:29 +0100
                      Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-06-03 11:32 +0100
                Re: hashing strings to integers Terry Reedy <tjreedy@udel.edu> - 2014-05-23 15:10 -0400
                  Re: hashing strings to integers Adam Funk <a24061@ducksburg.com> - 2014-05-27 16:20 +0100
    Re: hashing strings to integers for sqlite3 keys alister <alister.nospam.ware@ntlworld.com> - 2014-05-22 14:48 +0000

Page 2 of 2 — ← Prev page 1 [2]


#72138 — Re: hashing strings to integers

FromDan Sommers <dan@tombstonezero.net>
Date2014-05-28 01:55 +0000
SubjectRe: hashing strings to integers
Message-ID<lm3fm3$eb5$1@dont-email.me>
In reply to#72121
On Tue, 27 May 2014 17:02:50 +0000, Steven D'Aprano wrote:

> - rather than "zillions" of them, there are few enough of them that
>  the chances of an MD5 collision is insignificant;

>   (Any MD5 collision is going to play havoc with your strategy of
>   using hashes as a proxy for the real string.)

> - and you can arrange matters so that you never need to MD5 hash a
>   string twice.

Hmmm...  I'll use the MD5 hashes of the strings as a key, and the
strinsgs as the value (to detect MD5 collisions) ...

(But I'm sure that Steven was just waiting for someone to take that
bait...)

[toc] | [prev] | [next] | [standalone]


#72502 — Re: hashing strings to integers

FromAdam Funk <a24061@ducksburg.com>
Date2014-06-03 11:29 +0100
SubjectRe: hashing strings to integers
Message-ID<9lk06bx87j.ln2@news.ducksburg.com>
In reply to#72138
On 2014-05-28, Dan Sommers wrote:

> On Tue, 27 May 2014 17:02:50 +0000, Steven D'Aprano wrote:
>
>> - rather than "zillions" of them, there are few enough of them that
>>  the chances of an MD5 collision is insignificant;
>
>>   (Any MD5 collision is going to play havoc with your strategy of
>>   using hashes as a proxy for the real string.)
>
>> - and you can arrange matters so that you never need to MD5 hash a
>>   string twice.
>
> Hmmm...  I'll use the MD5 hashes of the strings as a key, and the
> strinsgs as the value (to detect MD5 collisions) ...

Hey, I'm not *that* stupid.


-- 
In the 1970s, people began receiving utility bills for
-£999,999,996.32 and it became harder to sustain the 
myth of the infallible electronic brain. (Verity Stob)

[toc] | [prev] | [next] | [standalone]


#72503 — Re: hashing strings to integers

FromAdam Funk <a24061@ducksburg.com>
Date2014-06-03 11:32 +0100
SubjectRe: hashing strings to integers
Message-ID<crk06bx87j.ln2@news.ducksburg.com>
In reply to#72121
On 2014-05-27, Steven D'Aprano wrote:

> On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote:

>> Well, here's the way it works in my mind:
>> 
>>    I can store a set of a zillion strings (or a dict with a zillion
>>    string keys), but every time I test "if new_string in seen_strings",
>>    the computer hashes the new_string using some kind of "short hash",
>>    checks the set for matching buckets (I'm assuming this is how python
>>    tests set membership --- is that right?), 
>
> So far so good. That applies to all objects, not just strings.
>
>
>>    then checks any
>>    hash-matches for string equality.  Testing string equality is slower
>>    than integer equality, and strings (unless they are really short)
>>    take up a lot more memory than long integers.
>
> But presumably you have to keep the string around anyway. It's going to 
> be somewhere, you can't just throw the string away and garbage collect 
> it. The dict doesn't store a copy of the string, it stores a reference to 
> it, and extra references don't cost much.

In the case where I did something like that, I wasn't keeping copies
of the strings in memory after hashing (& otherwise processing them).
I know that putting the strings' pointers in the set is a light memory
load.



[snipping the rest because...]

You've convinced me.  Thanks.



-- 
I heard that Hans Christian Andersen lifted the title for "The Little
Mermaid" off a Red Lobster Menu.                         [Bucky Katt]

[toc] | [prev] | [next] | [standalone]


#71940 — Re: hashing strings to integers

FromTerry Reedy <tjreedy@udel.edu>
Date2014-05-23 15:10 -0400
SubjectRe: hashing strings to integers
Message-ID<mailman.10252.1400872237.18130.python-list@python.org>
In reply to#71922
On 5/23/2014 6:27 AM, Adam Funk wrote:

> that.  The only thing that really bugs me in Python 3 is that execfile
> has been removed (I find it useful for testing things interactively).

The spelling has been changed to exec(open(...).read(), ... . It you use 
it a lot, add a customized def execfile(filename, ... to your site 
module or local utils module.

Execfile was a separate statement *only) because exec was a statememt. 
Once exec was was changed to a function taking arguments, that 
justification disappeared.

Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#72119 — Re: hashing strings to integers

FromAdam Funk <a24061@ducksburg.com>
Date2014-05-27 16:20 +0100
SubjectRe: hashing strings to integers
Message-ID<53ne5bxen.ln2@news.ducksburg.com>
In reply to#71940
On 2014-05-23, Terry Reedy wrote:

> On 5/23/2014 6:27 AM, Adam Funk wrote:
>
>> that.  The only thing that really bugs me in Python 3 is that execfile
>> has been removed (I find it useful for testing things interactively).
>
> The spelling has been changed to exec(open(...).read(), ... . It you use 
> it a lot, add a customized def execfile(filename, ... to your site 
> module or local utils module.

Are you talking about this?

https://docs.python.org/3/library/site.html

Is there a dummies/quick-start guide to using USER_SITE stuff?


-- 
No sport is less organized than Calvinball!

[toc] | [prev] | [next] | [standalone]


#71896

Fromalister <alister.nospam.ware@ntlworld.com>
Date2014-05-22 14:48 +0000
Message-ID<T6ofv.184276$kM7.6749@fx27.am4>
In reply to#71882
On Thu, 22 May 2014 12:47:31 +0100, Adam Funk wrote:

> I'm using Python 3.3 and the sqlite3 module in the standard library. I'm
> processing a lot of strings from input files (among other things, values
> of headers in e-mail & news messages) and suppressing duplicates using a
> table of seen strings in the database.
> 
> It seems to me --- from past experience with other things, where testing
> integers for equality is faster than testing strings, as well as from
> reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the
> SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY
> value rather than TEXT PRIMARY KEY.  Is that right?
> 
> If so, what sort of hashing function should I use?  The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
> thing I've thought of so far is to use MD5 or SHA-something modulo the
> maxint value.  (Security isn't an issue --- i.e., I'm not worried about
> someone trying to create a hash collision.)
> 
> Thanks,
> Adam

why not just set the filed in the DB to be unique & then catch the error 
when you try to Wright a duplicate?

let the DB engine handle the task


-- 
Your step will soil many countries.

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web