Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73784
| Date | 2014-07-01 13:13 +0100 |
|---|---|
| From | MRAB <python@mrabarnett.plus.com> |
| Subject | Re: Searching for lots of similar strings (filenames) in sqlite3 database |
| References | <tgia8bx2sf.ln2@news.ducksburg.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.11363.1404216846.18130.python-list@python.org> (permalink) |
On 2014-07-01 12:26, Adam Funk wrote:
> I have some code that reads files in a leafnode2 news spool & needs to
> check for new files periodically. The full paths are all like
> '/var/spool/news/message.id/345/<123456@example.com>' with a 3-digit
> subdirectory & a Message-ID for the filename itself. I'm using Python
> 3 & sqlite3 in the standard library.
>
> I have a table of filenames created with the following command:
>
> cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used INTEGER)')
>
> To check for new files in one of the subdirectories, I run A then
> either B or C below (I've tried both).
>
> A.
> listing1 = os.listdir(directory)
> listing [os.path.join(directory, x) for x in listing1]
>
> B.
> cursor = db_conn.cursor()
> for filename in listing:
> cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))
> row = cursor.fetchone()
> if not row:
> cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
> files_new += 1
> db_conn.commit()
>
> C.
> cursor = db_conn.cursor()
> subdir_like = directory + '/%'
> cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', (subdir_like,))
> rows = cursor.fetchall()
> known_files = [row[0] for row in rows]
> for filename in listing:
> if filename not in known_files:
> cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
> files_new += 1
> db_conn.commit()
>
> A+B was the first method I came up with, because it looks like the
> "keep it simple & let the database do its job" approach, but it was
> very time-consuming, so I tested A+C out. A is quick (a second); B
> can drag on for over an hour to check 2000 filenames (for example) in
> a subdirectory; C always takes less than a minute. So C is much
> better than B, but it looks (to me) like one of those attempts to
> bypass & ignore the database's built-in optimizations.
>
> Comments?
>
In C, 'known_files' is a list, so it performs a linear search for each
of the filenames. If you make 'known_files' a set, it'll probably be
even faster!
Anyway, I'm sure there's something in SQL for "insert or update" or "on
duplicate", but that's an SQL question, not a Python question.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Searching for lots of similar strings (filenames) in sqlite3 database Adam Funk <a24061@ducksburg.com> - 2014-07-01 12:26 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database Chris Angelico <rosuav@gmail.com> - 2014-07-01 22:06 +1000
Re: Searching for lots of similar strings (filenames) in sqlite3 database Adam Funk <a24061@ducksburg.com> - 2014-07-01 16:15 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database Chris Angelico <rosuav@gmail.com> - 2014-07-02 01:57 +1000
Re: Searching for lots of similar strings (filenames) in sqlite3 database Adam Funk <a24061@ducksburg.com> - 2014-07-02 10:30 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database Adam Funk <a24061@ducksburg.com> - 2014-07-02 10:32 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database Chris Angelico <rosuav@gmail.com> - 2014-07-02 20:03 +1000
Re: Searching for lots of similar strings (filenames) in sqlite3 database Adam Funk <a24061@ducksburg.com> - 2014-07-02 13:30 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database MRAB <python@mrabarnett.plus.com> - 2014-07-01 13:13 +0100
Re: Searching for lots of similar strings (filenames) in sqlite3 database Chris Angelico <rosuav@gmail.com> - 2014-07-02 00:02 +1000
csiph-web