Re: Help for a complex RE

From	Terry Reedy <tjreedy@udel.edu>
Newsgroups	comp.lang.python
Subject	Re: Help for a complex RE
Date	2016-05-08 13:17 -0400
Message-ID	<mailman.524.1462727911.32212.python-list@python.org> (permalink)
References	<2aa55bd8-2ea4-41f7-b188-d45dff7d3bb7@googlegroups.com> <ngnomu$n3i$1@ger.gmane.org> <mailman.520.1462724202.32212.python-list@python.org> <6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com> <ngnsc0$c22$1@ger.gmane.org>

Show all headers | View raw

On 5/8/2016 12:32 PM, Sergio Spina wrote:
> Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto:
>> Sergio Spina wrote:
>>
>>> In the following ipython session:
>>>
>>>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
>>>> Type "copyright", "credits" or "license" for more information.
>>>>
>>>> IPython 2.3.0 -- An enhanced Interactive Python.
>>>>
>>>> In [1]: import re
>>>>
>>>> In [2]: patt = r"""  # the match pattern is:
>>>> ...:     .+          # one or more characters
>>>> ...:     [ ]         # followed by a space
>>>> ...:     (?=[@#D]:)  # that is followed by one of the
>>>> ...:                 # chars "@#D" and a colon ":"
>>>> ...:    """
>>>>
>>>> In [3]: pattern = re.compile(patt, re.VERBOSE)
>>>>
>>>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>>>>
>>>> In [5]: m.group()
>>>> Out[5]: 'Jun@i Bun#i '
>>>>
>>>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>>>>
>>>> In [7]: m.group()
>>>> Out[7]: 'Jun@i Bun#i @:Janji '
>>>>
>>>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [9]: m.group()
>>>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
>>>
>>> Why the regex engine stops the search at last piece of string?
>>> Why not at the first match of the group "@:"?
>>> What can it be a regex pattern with the following result?
>>>
>>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [2]: m.group()
>>>> Out[2]: 'Jun@i Bun#i '
>>
>> Compare:
>>
>>>>> re.compile("a+").match("aaaa").group()
>> 'aaaa'
>>>>> re.compile("a+?").match("aaaa").group()
>> 'a'
>>
>> By default pattern matching is "greedy" -- the ".+" part of your regex
>> matches as many characters as possible. Adding a ? like in ".+?" triggers
>> non-greedy matching.
>
>>  In [2]: patt = r"""  # the match pattern is:
>>  ...:     .+          # one or more characters

Peter meant that you should replace '.+' with '.+?' to get the 
non-greedy match.

>>  ...:     [ ]         # followed by a space
>>  ...:     (?=[@#D]:)  # ONLY IF is followed by one of the <<< please note
>>  ...:                 # chars "@#D" and a colon ":"
>>  ...:    """
>
> From the python documentation
>
>>  (?=...)
>>      Matches if ... matches next, but doesn't consume any of the string.
>>      This is called a lookahead assertion. For example,
>>      Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.
>
> I know about greedy and not-greedy, but the problem remains.

Greedy '.+' matches the whole string.  The matcher then back up to find 
a space -- initially the last space.  It then, and only then, checks the 
lookahead assertion.  If that failed, it would back up again.  In your 
examples, it succeeds, and the matcher stops.

-- 
Terry Jan Reedy

Thread

Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 08:18 -0700
  Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 18:15 +0200
    Re: Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 09:32 -0700
      Re: Help for a complex RE Terry Reedy <tjreedy@udel.edu> - 2016-05-08 13:17 -0400
      Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 20:19 +0200

csiph-web