Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail From: Tim Chase Newsgroups: comp.lang.python Subject: Re: Regular expressions Date: Thu, 5 Nov 2015 08:00:00 -0600 Lines: 54 Message-ID: References: <662g3blobme52hfoududj27err185v2npm@4ax.com> <56397a18$0$11094$c3e8da3@news.astraweb.com> <56397FC6.9040700@gmail.com> <563abee1$0$1614$c3e8da3$5496439d@news.astraweb.com> <563b45fa$0$1593$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de KKbNiogcHvLU+taoGPW/rAEx9Yscn6BSiKqNUfKtrCQA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.015 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'string.': 0.04; 'builtin': 0.07; 'literal': 0.09; 'oh,': 0.09; 'underscore': 0.09; '-tkc': 0.16; 'fnmatch': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'simplicity,': 0.16; 'subject:Regular': 0.16; 'subject:expressions': 0.16; 'verbose': 0.16; 'wrote:': 0.16; 'string': 0.17; 'test.': 0.18; 'cloud': 0.20; 'extension': 0.20; 'code.': 0.23; "haven't": 0.24; 'header :In-Reply-To:1': 0.24; 'module': 0.25; 'followed': 0.27; 'finally,': 0.27; '1-3': 0.29; 'that.': 0.30; 'too.': 0.30; 'code': 0.30; 'another': 0.32; "d'aprano": 0.33; 'steven': 0.33; 'file': 0.34; 'skip:. 20': 0.35; 'express': 0.35; 'sometimes': 0.35; 'but': 0.36; 'should': 0.36; 'modules': 0.36; 'to:addr :python-list': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'charset:us-ascii': 0.37; 'presence': 0.38; 'end': 0.39; 'test': 0.39; 'rather': 0.39; 'to:addr:python.org': 0.40; 'back': 0.62; 'course': 0.62; 'more': 0.63; 'capture': 0.66; 'received:50': 0.66; 'afraid': 0.67; 'letters': 0.67; 'add-on': 0.84; 'by*': 0.84; 'clarity.': 0.84; 'clearer': 0.84; 'commenting': 0.84; 'touched': 0.84 X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-MC-Relay: Neutral X-MailChannels-SenderId: wwwh|x-authuser|tim@thechases.com X-MailChannels-Auth-Id: wwwh X-MC-Loop-Signature: 1446732082911:1436015162 X-MC-Ingress-Time: 1446732082910 In-Reply-To: <563b45fa$0$1593$c3e8da3$5496439d@news.astraweb.com> X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.25; x86_64-pc-linux-gnu) X-AuthUser: tim@thechases.com X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:98300 On 2015-11-05 23:05, Steven D'Aprano wrote: > Oh the shame, I knew that. Somehow I tangled myself in a knot, > thinking that it had to be 1 *followed by* zero or more characters. > But of course it's not a glob, it's a regex. But that's a good reminder of fnmatch/glob modules too. Sometimes all you need is to express a simple glob, in which case using a regexp can cloud the clarity. The overarching principle is to go for clarity & simplicity, rather than favoring built-ins/glob/regex/parser modules all the time. Want to test for presence in a string? Just use the builtin "a in b" test. At the beginning/end? Use .startswith()/.endswith() for clarity. Need to check if a string is purely digits/alpha/alphanumerics/etc? Use the string .is{alnum,alpha,decimal,digit,identifier,lower,numeric,printable,space,title,upper} methods on the string. For simple wild-carding, use the fnmatch module to do simple globbing. For more complex pattern matching, you've got regexps. Finally, for occasions when you're searching for repeated/nested structures, using an add-on module like pyparsing will give you clearer code. Oh, and with regexps, people should be less afraid of verbose multi-line strings with commenting r = re.compile(r""" ^ # start of the string (?P\d{4}) # capture 4 digits - # a literal dash (?P\d{1,2}) # capture 1-2 digits - # another literal dash (?P\d{1,2}) # capture 1-2 digits _ # a literal underscore (?P # capture the account-number [A-Z]{1,3} # 1-3 letters \d+ # followed by 1+ digits ) \.txt # the extension of the file (ignored) $ # the end of the string """, re.VERBOSE) They are a LOT easier to come back to if you haven't touched the code for a year. -tkc