Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder2-2.proxad.net!npeer.de.kpn-eurorings.net!npeer-ng0.de.kpn-eurorings.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Adam Funk Newsgroups: comp.lang.python Subject: Re: hashing strings to integers for sqlite3 keys Date: Thu, 22 May 2014 14:41:17 +0100 Organization: $CABAL Lines: 46 Message-ID: References: X-Trace: individual.net UWhGCUKMo92dcAMYbAuRcASoz1gAdFoMJd8Xr6pPVZpAGG/jip X-Orig-Path: news.ducksburg.com!not-for-mail Cancel-Lock: sha1:9zsnVtTwfN+DGSLhtd6ugZiCXa0= sha1:Pjm2QSqNcTga0/n9Wxevm2zGZw8= User-Agent: slrn/pre1.0.2-9 (Linux) Xref: csiph.com comp.lang.python:71888 On 2014-05-22, Peter Otten wrote: > Adam Funk wrote: > >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > My gut feeling tells me that this would matter more for join operations than > lookup of a value. If you plan to do joins you could use an autoinc integer > as the primary key and an additional string key for lookup. I'm not doing any join operations. I'm using sqlite3 for storing big piles of data & persistence between runs --- not really "proper relational database use". In this particular case, I'm getting header values out of messages & doing this: for this_string in these_strings: if not already_seen(this_string): process(this_string) # ignore if already seen ... > and only if you can demonstrate a significant speedup keep the complication > in your code. > > If you find such a speedup I'd like to see the numbers because this cries > PREMATURE OPTIMIZATION... On further reflection, I think I asked for that. In fact, the table I'm using only has one column for the hashes --- I wasn't going to store the strings at all in order to save disk space (maybe my mind is stuck in the 1980s). -- But the government always tries to coax well-known writers into the Establishment; it makes them feel educated. [Robert Graves]