Path: csiph.com!usenet.pasdenom.info!news.albasani.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'much!': 0.05; 'method.': 0.07; 'sql.': 0.07; 'suppose': 0.07; 'table.': 0.07; 'string': 0.09; 'consistency': 0.09; 'escape': 0.09; 'filename': 0.09; 'filenames': 0.09; 'literal': 0.09; 'oh,': 0.09; 'underscore': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'itself.': 0.14; "wouldn't": 0.14; '"is': 0.16; '"is"': 0.16; "?',": 0.16; 'adam': 0.16; 'ah!': 0.16; 'ah,': 0.16; 'backslash': 0.16; 'backslash,': 0.16; 'better:': 0.16; 'bonus,': 0.16; 'delimiters': 0.16; 'digits.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'operator.': 0.16; 'quoted': 0.16; 'regex,': 0.16; 'sqlite': 0.16; 'subject:sqlite3': 0.16; 'substitution': 0.16; 'url:sqlite': 0.16; 'wrote:': 0.18; 'library': 0.18; 'wed,': 0.18; 'normally': 0.19; '(the': 0.22; '>>>': 0.22; 'handles': 0.22; 'separate': 0.22; 'cc:addr:python.org': 0.22; "shouldn't": 0.24; 'string,': 0.24; 'question': 0.24; 'cc:2**0': 0.24; 'define': 0.26; 'possibly': 0.26; 'skip:" 30': 0.26; 'header:In-Reply-To:1': 0.27; 'rest': 0.29; 'chris': 0.29; 'feature': 0.29; 'am,': 0.29; 'character': 0.29; 'subject:) ': 0.29; 'see,': 0.30; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'getting': 0.31; 'that.': 0.31; '(possibly': 0.31; 'equality': 0.31; 'searches': 0.31; 'though.': 0.31; 'probably': 0.32; 'critical': 0.32; 'supposed': 0.32; "i'd": 0.34; 'subject: (': 0.35; 'something': 0.35; 'case,': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'data,': 0.36; "didn't": 0.36; 'thanks': 0.36; 'possible': 0.36; 'should': 0.36; 'too': 0.37; 'two': 0.37; 'files': 0.38; 'pm,': 0.38; 'rather': 0.38; 'little': 0.38; 'expect': 0.39; 'sure': 0.39; 'either': 0.39; 'how': 0.40; 'even': 0.60; 'remove': 0.60; "you're": 0.61; 'guarantee': 0.63; 'name': 0.63; 'valuable': 0.63; 'different': 0.65; 'levels': 0.65; 'worth': 0.66; 'here': 0.66; 'percent': 0.68; 'safe': 0.72; 'url:htm': 0.73; 'carefully': 0.74; 'jul': 0.74; 'avoids': 0.84; 'safe.': 0.84; 'significance': 0.84; 'absolutely': 0.87; 'engines.': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=0TjMge8D610cmUEKY7i4gKG4EQOnRfpvNaX/Px2qyfo=; b=q6qEJO6fyjD6CRw/GfO5zEloJDk5a5qQnb5R4Oy45avzPCLz8GOka+jjdPPynDg1nd MTQcOhZrPEm8O1HSSxxs2TUqM01VznjasiZ92251usKKekf6KU3jA+4ITvxT+htr2Qn9 8x5d/SKSgy5fRJQvNWSkMcdk031RLMcasbJpYNTpLV965k/NcCVrCHkeHHi5sI4x+ywv qgU4OZX5D1f7aCmBxzHQBXYtENqVW3sX2fBE/vRZB2vzeU6ZrJh8E6nFocrrhjJYCt0k NIqYnQt5/En3NIzjdO4mxJ3HahpmXRYxunEnLaMktmWsmuWrvIRqrq4CZ4o5GWmPjFmr DKaQ== MIME-Version: 1.0 X-Received: by 10.221.68.135 with SMTP id xy7mr923919vcb.65.1404230241211; Tue, 01 Jul 2014 08:57:21 -0700 (PDT) In-Reply-To: <3uva8bx3dn.ln2@news.ducksburg.com> References: <3uva8bx3dn.ln2@news.ducksburg.com> Date: Wed, 2 Jul 2014 01:57:21 +1000 Subject: Re: Searching for lots of similar strings (filenames) in sqlite3 database From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 70 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1404230249 news.xs4all.nl 2925 [2001:888:2000:d::a6]:51213 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:73795 On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk wrote: > On 2014-07-01, Chris Angelico wrote: > >> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk wrote: >>> cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,)) >> >> Shouldn't this be an equality check rather than IS, which normally I'd >> expect to be "IS NULL" or "IS NOT NULL"? > > Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', & > 'IS' are interchangeable. > > http://www.tutorialspoint.com/sqlite/sqlite_operators.htm Ah, okay. In that case, I'd advise going with either == for consistency with the rest of Python, or (preferably) = for consistency with other SQL engines. You wouldn't use "is" to test if two Python strings are equal, so there's no particular reason to use it here :) > Oh, even better: > > add_files = listing - known_files > delete_files = known_files - listing > > and then I can remove files that have disappeared off the spool from > the table. Thanks very much! Ah! Didn't know that was a valuable feature for you, but getting that "for free" is an extra little bonus, so that's awesome! >> There is one critical consideration, though. What happens if the >> directory name contains an underscore or percent sign? Or can you >> absolutely guarantee that they won't? You may need to escape them, and >> I'm not sure how SQLite handles that. (Possibly \_ will match literal >> _, and \\ will match literal \, or something like that.) > > I can guarantee that the directory names are all > '/var/spool/news/message.id/' then 3 digits. (The filenames are > pretty wild, since they are MIDs.) AIUI, using the '?' substitution > in the sqlite3 library is supposed to be safe. This is nothing to do with question-mark substitution. There are two separate levels of character significance here - it's like a quoted string with a regex. Suppose you want to make a regex that searches for an apostrophe. If you try to define that in a single-quoted string, you need to escape it: regex = '^\'$' However, if you ask the user to enter a regex, that wouldn't be necessary: regex = input("Enter a pattern: ") # raw_input in Python 2 Enter a pattern: ^'$ This is what the question mark substitution is like - it avoids the need to carefully manage string delimiters and so on. However, if you want to make a regex that searches for a backslash, then you need to escape it, because the backslash is important to the regex itself. In the same way, the underscore and percent sign are significant to the LIKE operator. If it were possible to have a directory name with a percent sign in it, it would match far too much - because you'd construct a LIKE pattern something like (ahem) "/var/spool/news/message%20id/142/%" - and as you can see, the percent sign at the end is no different from the percent sign in the middle. But you're safe because you know your data, unrelated to your substitution method. Possibly merits a comment... but possibly not worth it. ChrisA