Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Adam Funk Newsgroups: comp.lang.python Subject: Re: hashing strings to integers for sqlite3 keys Date: Thu, 22 May 2014 14:47:34 +0100 Organization: $CABAL Lines: 42 Message-ID: References: X-Trace: individual.net cY1E60c9VCYmmmDec0uqWwBdbJbKiqxNQKRiHZOylXlAPEzoOw X-Orig-Path: news.ducksburg.com!not-for-mail Cancel-Lock: sha1:SQG4QFflSLSF0wRBNqLrsHSwE98= sha1:RRTWOe3BOtG3SR1ap9MctJbuY0k= User-Agent: slrn/pre1.0.2-9 (Linux) Xref: csiph.com comp.lang.python:71890 On 2014-05-22, Chris Angelico wrote: > On Thu, May 22, 2014 at 9:47 PM, Adam Funk wrote: >> I'm using Python 3.3 and the sqlite3 module in the standard library. >> I'm processing a lot of strings from input files (among other things, >> values of headers in e-mail & news messages) and suppressing >> duplicates using a table of seen strings in the database. >> >> It seems to me --- from past experience with other things, where >> testing integers for equality is faster than testing strings, as well >> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY >> --- that the SELECT tests should be faster if I am looking up an >> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that >> right? > > It might be faster to use an integer primary key, but the possibility > of even a single collision means you can't guarantee uniqueness > without a separate check. I don't know sqlite3 well enough to say, but > based on what I know of PostgreSQL, it's usually best to make your > schema mimic your logical structure, rather than warping it for the > sake of performance. With a good indexing function, the performance of > a textual PK won't be all that much worse than an integral one, and > everything you do will read correctly in the code - no fiddling around > with hashes and collision checks. > > Stick with the TEXT PRIMARY KEY and let the database do the database's > job. If you're processing a really large number of strings, you might > want to consider moving from sqlite3 to PostgreSQL anyway (I've used > psycopg2 quite happily), as you'll get better concurrency; and that > might solve your performance problem as well, as Pg plays very nicely > with caches. Well, actually I'm thinking about doing away with checking for duplicates at this stage, since the substrings that I pick out of the deduplicated header values go into another table as the TEXT PRIMARY KEY anyway, with deduplication there. So I think this stage reeks of premature optimization. -- The history of the world is the history of a privileged few. --- Henry Miller