Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!rt.uk.eu.org!news-transit.tcx.org.uk!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.022 X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'infinite': 0.07; 'python': 0.08; '>>>>': 0.09; 'am,': 0.12; '"should': 0.16; 'abort': 0.16; 'assertion': 0.16; 'matched': 0.16; 'repetition': 0.16; 'russ': 0.16; 'subject:expression': 0.16; 'subject:regular': 0.16; 'wording': 0.16; 'meant': 0.17; 'wrote:': 0.18; 'instance': 0.18; 'repeated': 0.18; 'jan': 0.19; "haven't": 0.20; '(most': 0.21; "aren't": 0.21; 'input': 0.22; "doesn't": 0.22; 'header:In-Reply- To:1': 0.22; 'explains': 0.24; "python's": 0.24; 'traceback': 0.24; 'obviously': 0.25; "i'm": 0.26; 'import': 0.27; 'tried': 0.27; 'url:mailman': 0.28; 'invalid': 0.28; 'message- id:@mail.gmail.com': 0.28; 'compile': 0.29; 'looks': 0.29; 'print': 0.29; 'fairly': 0.30; 'times.': 0.30; 'yes.': 0.30; '(as': 0.31; 'equivalent': 0.31; 'anyone': 0.31; 'does': 0.32; 'implementing': 0.32; 'error.': 0.32; 'tue,': 0.32; 'url:listinfo': 0.32; "can't": 0.32; 'skip:- 30': 0.33; 'instead': 0.33; 'there': 0.33; 'received:209.85.160': 0.33; 'match': 0.34; 'to:addr:python-list': 0.34; 'certain': 0.34; 'things': 0.34; 'last):': 0.34; 'location,': 0.34; 'nested': 0.34; 'rest': 0.35; 'received:209.85.160.46': 0.35; 'received:mail- pw0-f46.google.com': 0.35; 'skip:" 20': 0.35; 'regular': 0.35; 'url:python': 0.36; 'but': 0.37; 'run': 0.37; 'received:google.com': 0.37; 'another': 0.37; 'problems': 0.37; 'skip:_ 10': 0.37; 'comments': 0.38; 'received:209.85': 0.38; 'characters': 0.39; 'e.g.': 0.39; 'url:docs': 0.39; "i'd": 0.39; 'url:org': 0.39; 'should': 0.39; 'why': 0.39; "it's": 0.40; 'received:209': 0.40; 'to:addr:python.org': 0.40; 'move': 0.40; 'once': 0.60; 'more': 0.61; 'your': 0.61; 'happen': 0.61; "we've": 0.62; 'believe': 0.65; '2012': 0.67; 'engine': 0.68; '1.33': 0.84; 'appreciative': 0.84; 'candide': 0.84; 'notion': 0.84; 'stars': 0.84; '0.00': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=Gxl04pGIfftoWSJCS3PTQWvnQNuRQ+HftqdgWnDsRls=; b=OFJIU8uY2xLjrcRoYM6M+By0RmcFys1HU/L0VZCopEG1z292SVKcJr5yGCKjSVFegA S7yZF67HgijFCYE2PbidFGeUyKtjbgzHG1vRWaqLrnBYlmU+n/2BxdsfQ7kVEuqOM0fu ayUcpbMte9Ng7uTqEBXo0UhUlHQ2YGZMQ2zpo= MIME-Version: 1.0 In-Reply-To: <4f02c069$0$690$426a74cc@news.free.fr> References: <4f02c069$0$690$426a74cc@news.free.fr> From: Devin Jeanpierre Date: Tue, 3 Jan 2012 04:45:40 -0500 Subject: Re: Repeating assertions in regular expression To: "comp.lang.python" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 117 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1325583985 news.xs4all.nl 6891 [2001:888:2000:d::a6]:36966 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:18390 > \\b\\b and \\b{2} aren't equivalent ? This sounds suspiciously like a bug! > Why the wording is "should never" ? Repeating a zero-width assertion is n= ot > forbidden, for instance : > >>>> import re >>>> re.compile("\\b\\b\w+\\b\\b") > <_sre.SRE_Pattern object at 0xb7831140> >>>> I believe this is meant to refer to arbitrary-length repetitions, such as r'\b*', not simple concatenations like that. r'\b*' will abort the whole match if is run on a boundary, because Python detects a repetition of a zero-width match and decides this is an error. A little OT: I'm not really sure why Python does it this way. It's possible to match r'\b*' fairly painlessly: instead of aborting the whole match on a repeat of nothing, it can backtrack. So what should happen is that the RE engine looks at \b* and does the following: Is there one \b? Yes. Is there another \b? Backtrack: You haven't read any more characters since the last match in the * operation # The result now is that we've matched '\b*' with exactly one '\b' and can move on to the rest of the input This is possible (I haven't tried implementing it, but it's e.g. mentioned in Russ Cox's regular expression papers), you need to augment the backtracking search with a notion of the regular expression having "progress" so that you can require progress at certain points. Otherwise you obviously get problems with nested stars and things like that. (As it happens, Python's regex engine can't handle nested stars either). I'd be appreciative if anyone who works on re or regex can discuss this a little. -- Devin On Tue, Jan 3, 2012 at 3:46 AM, candide wrote: > The regular expression HOWTO > (http://docs.python.org/howto/regex.html#more-metacharacters) explains th= e > following > > # ------------------------------ > zero-width assertions should never be repeated, because if they match onc= e > at a given location, they can obviously be matched an infinite number of > times. > # ------------------------------ > > > Why the wording is "should never" ? Repeating a zero-width assertion is n= ot > forbidden, for instance : > >>>> import re >>>> re.compile("\\b\\b\w+\\b\\b") > <_sre.SRE_Pattern object at 0xb7831140> >>>> > > Nevertheless, the following doesn't execute : > >>>> re.compile("\\b{2}\w+\\b\\b") > Traceback (most recent call last): > =C2=A0File "", line 1, in > =C2=A0File "/usr/lib/python2.7/re.py", line 190, in compile > =C2=A0 =C2=A0return _compile(pattern, flags) > =C2=A0File "/usr/lib/python2.7/re.py", line 245, in _compile > =C2=A0 =C2=A0raise error, v # invalid expression > sre_constants.error: nothing to repeat >>>> > > > \\b\\b and \\b{2} aren't equivalent ? > > > Surprisingly, the engine doesn't optimize repeated boundary assertions, f= or > instance > > # ------------------------------ > import re > import time > > a=3Dtime.clock() > len("\\b\\b\\b"*100000+"\w+") > b=3Dtime.clock() > print "CPU time : %.2f s" %(b - a) > > a=3Dtime.clock() > re.compile("\\b\\b\\b"*100000+"\w+") > b=3Dtime.clock() > print "CPU time : %.2f s" %(b - a) > # ------------------------------ > > outputs: > > # ------------------------------ > CPU time : 0.00 s > CPU time : 1.33 s > # ------------------------------ > > > Your comments are welcome! > -- > http://mail.python.org/mailman/listinfo/python-list