Groups > comp.lang.python > #108363 > unrolled thread

Help for a complex RE

Started by	Sergio Spina <sergio.am.spina@gmail.com>
First post	2016-05-08 08:18 -0700
Last post	2016-05-08 20:19 +0200
Articles	5 — 3 participants

Back to article view | Back to comp.lang.python

  Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 08:18 -0700
    Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 18:15 +0200
      Re: Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 09:32 -0700
        Re: Help for a complex RE Terry Reedy <tjreedy@udel.edu> - 2016-05-08 13:17 -0400
        Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 20:19 +0200

#108363 — Help for a complex RE

From	Sergio Spina <sergio.am.spina@gmail.com>
Date	2016-05-08 08:18 -0700
Subject	Help for a complex RE
Message-ID	<2aa55bd8-2ea4-41f7-b188-d45dff7d3bb7@googlegroups.com>

In the following ipython session:

> Python 3.5.1+ (default, Feb 24 2016, 11:28:57) 
> Type "copyright", "credits" or "license" for more information.
>
> IPython 2.3.0 -- An enhanced Interactive Python.
>
> In [1]: import re
>
> In [2]: patt = r"""  # the match pattern is:
> ...:     .+          # one or more characters
> ...:     [ ]         # followed by a space
> ...:     (?=[@#D]:)  # that is followed by one of the
> ...:                 # chars "@#D" and a colon ":"
> ...:    """
> 
> In [3]: pattern = re.compile(patt, re.VERBOSE)
> 
> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
> 
> In [5]: m.group()
> Out[5]: 'Jun@i Bun#i '
> 
> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
> 
> In [7]: m.group()
> Out[7]: 'Jun@i Bun#i @:Janji '
> 
> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> 
> In [9]: m.group()
> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '

Why the regex engine stops the search at last piece of string?
Why not at the first match of the group "@:"?
What can it be a regex pattern with the following result?

> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> 
> In [2]: m.group()
> Out[2]: 'Jun@i Bun#i '

[toc] | [next] | [standalone]

#108367

From	Peter Otten <__peter__@web.de>
Date	2016-05-08 18:15 +0200
Message-ID	<mailman.520.1462724202.32212.python-list@python.org>
In reply to	#108363

Sergio Spina wrote:

> In the following ipython session:
> 
>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
>> Type "copyright", "credits" or "license" for more information.
>>
>> IPython 2.3.0 -- An enhanced Interactive Python.
>>
>> In [1]: import re
>>
>> In [2]: patt = r"""  # the match pattern is:
>> ...:     .+          # one or more characters
>> ...:     [ ]         # followed by a space
>> ...:     (?=[@#D]:)  # that is followed by one of the
>> ...:                 # chars "@#D" and a colon ":"
>> ...:    """
>> 
>> In [3]: pattern = re.compile(patt, re.VERBOSE)
>> 
>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>> 
>> In [5]: m.group()
>> Out[5]: 'Jun@i Bun#i '
>> 
>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>> 
>> In [7]: m.group()
>> Out[7]: 'Jun@i Bun#i @:Janji '
>> 
>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>> 
>> In [9]: m.group()
>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
> 
> Why the regex engine stops the search at last piece of string?
> Why not at the first match of the group "@:"?
> What can it be a regex pattern with the following result?
> 
>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>> 
>> In [2]: m.group()
>> Out[2]: 'Jun@i Bun#i '

Compare:

>>> re.compile("a+").match("aaaa").group()
'aaaa'
>>> re.compile("a+?").match("aaaa").group()
'a'

By default pattern matching is "greedy" -- the ".+" part of your regex 
matches as many characters as possible. Adding a ? like in ".+?" triggers 
non-greedy matching.

[toc] | [prev] | [next] | [standalone]

#108370

From	Sergio Spina <sergio.am.spina@gmail.com>
Date	2016-05-08 09:32 -0700
Message-ID	<6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com>
In reply to	#108367

Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto:
> Sergio Spina wrote:
> 
> > In the following ipython session:
> > 
> >> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
> >> Type "copyright", "credits" or "license" for more information.
> >>
> >> IPython 2.3.0 -- An enhanced Interactive Python.
> >>
> >> In [1]: import re
> >>
> >> In [2]: patt = r"""  # the match pattern is:
> >> ...:     .+          # one or more characters
> >> ...:     [ ]         # followed by a space
> >> ...:     (?=[@#D]:)  # that is followed by one of the
> >> ...:                 # chars "@#D" and a colon ":"
> >> ...:    """
> >> 
> >> In [3]: pattern = re.compile(patt, re.VERBOSE)
> >> 
> >> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
> >> 
> >> In [5]: m.group()
> >> Out[5]: 'Jun@i Bun#i '
> >> 
> >> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
> >> 
> >> In [7]: m.group()
> >> Out[7]: 'Jun@i Bun#i @:Janji '
> >> 
> >> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> >> 
> >> In [9]: m.group()
> >> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
> > 
> > Why the regex engine stops the search at last piece of string?
> > Why not at the first match of the group "@:"?
> > What can it be a regex pattern with the following result?
> > 
> >> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> >> 
> >> In [2]: m.group()
> >> Out[2]: 'Jun@i Bun#i '
> 
> Compare:
> 
> >>> re.compile("a+").match("aaaa").group()
> 'aaaa'
> >>> re.compile("a+?").match("aaaa").group()
> 'a'
> 
> By default pattern matching is "greedy" -- the ".+" part of your regex 
> matches as many characters as possible. Adding a ? like in ".+?" triggers 
> non-greedy matching.

>  In [2]: patt = r"""  # the match pattern is:
>  ...:     .+          # one or more characters
>  ...:     [ ]         # followed by a space
>  ...:     (?=[@#D]:)  # ONLY IF is followed by one of the <<< please note
>  ...:                 # chars "@#D" and a colon ":"
>  ...:    """ 

From the python documentation

>  (?=...)
>      Matches if ... matches next, but doesn't consume any of the string.
>      This is called a lookahead assertion. For example,
>      Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.

I know about greedy and not-greedy, but the problem remains.

[toc] | [prev] | [next] | [standalone]

#108374

From	Terry Reedy <tjreedy@udel.edu>
Date	2016-05-08 13:17 -0400
Message-ID	<mailman.524.1462727911.32212.python-list@python.org>
In reply to	#108370

On 5/8/2016 12:32 PM, Sergio Spina wrote:
> Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto:
>> Sergio Spina wrote:
>>
>>> In the following ipython session:
>>>
>>>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
>>>> Type "copyright", "credits" or "license" for more information.
>>>>
>>>> IPython 2.3.0 -- An enhanced Interactive Python.
>>>>
>>>> In [1]: import re
>>>>
>>>> In [2]: patt = r"""  # the match pattern is:
>>>> ...:     .+          # one or more characters
>>>> ...:     [ ]         # followed by a space
>>>> ...:     (?=[@#D]:)  # that is followed by one of the
>>>> ...:                 # chars "@#D" and a colon ":"
>>>> ...:    """
>>>>
>>>> In [3]: pattern = re.compile(patt, re.VERBOSE)
>>>>
>>>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>>>>
>>>> In [5]: m.group()
>>>> Out[5]: 'Jun@i Bun#i '
>>>>
>>>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>>>>
>>>> In [7]: m.group()
>>>> Out[7]: 'Jun@i Bun#i @:Janji '
>>>>
>>>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [9]: m.group()
>>>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
>>>
>>> Why the regex engine stops the search at last piece of string?
>>> Why not at the first match of the group "@:"?
>>> What can it be a regex pattern with the following result?
>>>
>>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [2]: m.group()
>>>> Out[2]: 'Jun@i Bun#i '
>>
>> Compare:
>>
>>>>> re.compile("a+").match("aaaa").group()
>> 'aaaa'
>>>>> re.compile("a+?").match("aaaa").group()
>> 'a'
>>
>> By default pattern matching is "greedy" -- the ".+" part of your regex
>> matches as many characters as possible. Adding a ? like in ".+?" triggers
>> non-greedy matching.
>
>>  In [2]: patt = r"""  # the match pattern is:
>>  ...:     .+          # one or more characters

Peter meant that you should replace '.+' with '.+?' to get the 
non-greedy match.

>>  ...:     [ ]         # followed by a space
>>  ...:     (?=[@#D]:)  # ONLY IF is followed by one of the <<< please note
>>  ...:                 # chars "@#D" and a colon ":"
>>  ...:    """
>
> From the python documentation
>
>>  (?=...)
>>      Matches if ... matches next, but doesn't consume any of the string.
>>      This is called a lookahead assertion. For example,
>>      Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.
>
> I know about greedy and not-greedy, but the problem remains.

Greedy '.+' matches the whole string.  The matcher then back up to find 
a space -- initially the last space.  It then, and only then, checks the 
lookahead assertion.  If that failed, it would back up again.  In your 
examples, it succeeds, and the matcher stops.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#108379

From	Peter Otten <__peter__@web.de>
Date	2016-05-08 20:19 +0200
Message-ID	<mailman.527.1462731580.32212.python-list@python.org>
In reply to	#108370

Sergio Spina wrote:

> I know about greedy and not-greedy, but the problem remains.

This makes me wonder why you had to ask

>>> Why the regex engine stops the search at last piece of string?
>>> Why not at the first match of the group "@:"?

To make it crystal clear this time:

>>> import re
>>> 
>>> patt = r"""  # the match pattern is:
... .+?         # one or more characters
... [ ]         # followed by a space
... (?=[@#D]:)  # that is followed by one of the
...             # chars "@#D" and a colon ":"
... """
>>> pattern = re.compile(patt, re.VERBOSE)
>>> m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>> m.group()
'Jun@i Bun#i '

That's exactly what you asked for in

>>> What can it be a regex pattern with the following result?
>>> 
>>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>> 
>>>> In [2]: m.group()
>>>> Out[2]: 'Jun@i Bun#i '

[toc] | [prev] | [standalone]

csiph-web

Help for a complex RE

Contents

#108363 — Help for a complex RE

#108367

#108370

#108374

#108379