Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Terry Reedy Newsgroups: comp.lang.python Subject: Re: Help for a complex RE Date: Sun, 8 May 2016 13:17:50 -0400 Lines: 85 Message-ID: References: <2aa55bd8-2ea4-41f7-b188-d45dff7d3bb7@googlegroups.com> <6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de nIzw/0flqp7IJIzG+VBQWQFz1sYjPMgeTOFzevP9xJEw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'string.': 0.04; '"""': 0.05; "'a'": 0.07; 'matches': 0.07; 'stops': 0.07; '[1]:': 0.09; '[2]:': 0.09; '[3]:': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:Help': 0.10; 'python': 0.10; 'python.': 0.11; 'jan': 0.11; '":"': 0.16; '2016': 0.16; '[4]:': 0.16; 'consume': 0.16; 'matching.': 0.16; 'r"""': 0.16; 'received:80.91.229.3': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'reedy': 0.16; 'stops.': 0.16; 'succeeds,': 0.16; 'wrote:': 0.16; '>>>': 0.20; 'meant': 0.22; 'next,': 0.22; 'space.': 0.22; 'feb': 0.23; 'matching': 0.23; 'import': 0.24; 'header:In-Reply-To:1': 0.24; 'header:User-Agent:1': 0.26; "doesn't": 0.26; 'header:X -Complaints-To:1': 0.26; 'followed': 0.27; 'colon': 0.29; 'checks': 0.30; 'initially': 0.30; 'problem': 0.33; 'enhanced': 0.33; 'replace': 0.35; 'but': 0.36; 'should': 0.36; 'possible.': 0.36; 'to:addr:python-list': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'why': 0.39; 'to:addr:python.org': 0.40; 'space': 0.40; 'called': 0.40; 'your': 0.60; 'default': 0.61; 'engine': 0.62; 'back': 0.62; 'received:96': 0.63; 'more': 0.63; '>>>>>': 0.66; '<<<': 0.84; 'assertion.': 0.84; 'compare:': 0.84; 'isaac': 0.84; 'otten': 0.84; 'sergio': 0.84; 'greedy': 0.91; 'received:fios.verizon.net': 0.91 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: pool-96-227-207-81.phlapa.fios.verizon.net User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.0 In-Reply-To: <6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <2aa55bd8-2ea4-41f7-b188-d45dff7d3bb7@googlegroups.com> <6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com> Xref: csiph.com comp.lang.python:108374 On 5/8/2016 12:32 PM, Sergio Spina wrote: > Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto: >> Sergio Spina wrote: >> >>> In the following ipython session: >>> >>>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57) >>>> Type "copyright", "credits" or "license" for more information. >>>> >>>> IPython 2.3.0 -- An enhanced Interactive Python. >>>> >>>> In [1]: import re >>>> >>>> In [2]: patt = r""" # the match pattern is: >>>> ...: .+ # one or more characters >>>> ...: [ ] # followed by a space >>>> ...: (?=[@#D]:) # that is followed by one of the >>>> ...: # chars "@#D" and a colon ":" >>>> ...: """ >>>> >>>> In [3]: pattern = re.compile(patt, re.VERBOSE) >>>> >>>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji") >>>> >>>> In [5]: m.group() >>>> Out[5]: 'Jun@i Bun#i ' >>>> >>>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji") >>>> >>>> In [7]: m.group() >>>> Out[7]: 'Jun@i Bun#i @:Janji ' >>>> >>>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji") >>>> >>>> In [9]: m.group() >>>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji ' >>> >>> Why the regex engine stops the search at last piece of string? >>> Why not at the first match of the group "@:"? >>> What can it be a regex pattern with the following result? >>> >>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji") >>>> >>>> In [2]: m.group() >>>> Out[2]: 'Jun@i Bun#i ' >> >> Compare: >> >>>>> re.compile("a+").match("aaaa").group() >> 'aaaa' >>>>> re.compile("a+?").match("aaaa").group() >> 'a' >> >> By default pattern matching is "greedy" -- the ".+" part of your regex >> matches as many characters as possible. Adding a ? like in ".+?" triggers >> non-greedy matching. > >> In [2]: patt = r""" # the match pattern is: >> ...: .+ # one or more characters Peter meant that you should replace '.+' with '.+?' to get the non-greedy match. >> ...: [ ] # followed by a space >> ...: (?=[@#D]:) # ONLY IF is followed by one of the <<< please note >> ...: # chars "@#D" and a colon ":" >> ...: """ > > From the python documentation > >> (?=...) >> Matches if ... matches next, but doesn't consume any of the string. >> This is called a lookahead assertion. For example, >> Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'. > > I know about greedy and not-greedy, but the problem remains. Greedy '.+' matches the whole string. The matcher then back up to find a space -- initially the last space. It then, and only then, checks the lookahead assertion. If that failed, it would back up again. In your examples, it succeeds, and the matcher stops. -- Terry Jan Reedy