Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #108363 > unrolled thread
| Started by | Sergio Spina <sergio.am.spina@gmail.com> |
|---|---|
| First post | 2016-05-08 08:18 -0700 |
| Last post | 2016-05-08 20:19 +0200 |
| Articles | 5 — 3 participants |
Back to article view | Back to comp.lang.python
Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 08:18 -0700
Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 18:15 +0200
Re: Help for a complex RE Sergio Spina <sergio.am.spina@gmail.com> - 2016-05-08 09:32 -0700
Re: Help for a complex RE Terry Reedy <tjreedy@udel.edu> - 2016-05-08 13:17 -0400
Re: Help for a complex RE Peter Otten <__peter__@web.de> - 2016-05-08 20:19 +0200
| From | Sergio Spina <sergio.am.spina@gmail.com> |
|---|---|
| Date | 2016-05-08 08:18 -0700 |
| Subject | Help for a complex RE |
| Message-ID | <2aa55bd8-2ea4-41f7-b188-d45dff7d3bb7@googlegroups.com> |
In the following ipython session:
> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
> Type "copyright", "credits" or "license" for more information.
>
> IPython 2.3.0 -- An enhanced Interactive Python.
>
> In [1]: import re
>
> In [2]: patt = r""" # the match pattern is:
> ...: .+ # one or more characters
> ...: [ ] # followed by a space
> ...: (?=[@#D]:) # that is followed by one of the
> ...: # chars "@#D" and a colon ":"
> ...: """
>
> In [3]: pattern = re.compile(patt, re.VERBOSE)
>
> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>
> In [5]: m.group()
> Out[5]: 'Jun@i Bun#i '
>
> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>
> In [7]: m.group()
> Out[7]: 'Jun@i Bun#i @:Janji '
>
> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>
> In [9]: m.group()
> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
Why the regex engine stops the search at last piece of string?
Why not at the first match of the group "@:"?
What can it be a regex pattern with the following result?
> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>
> In [2]: m.group()
> Out[2]: 'Jun@i Bun#i '
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-05-08 18:15 +0200 |
| Message-ID | <mailman.520.1462724202.32212.python-list@python.org> |
| In reply to | #108363 |
Sergio Spina wrote:
> In the following ipython session:
>
>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
>> Type "copyright", "credits" or "license" for more information.
>>
>> IPython 2.3.0 -- An enhanced Interactive Python.
>>
>> In [1]: import re
>>
>> In [2]: patt = r""" # the match pattern is:
>> ...: .+ # one or more characters
>> ...: [ ] # followed by a space
>> ...: (?=[@#D]:) # that is followed by one of the
>> ...: # chars "@#D" and a colon ":"
>> ...: """
>>
>> In [3]: pattern = re.compile(patt, re.VERBOSE)
>>
>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>>
>> In [5]: m.group()
>> Out[5]: 'Jun@i Bun#i '
>>
>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>>
>> In [7]: m.group()
>> Out[7]: 'Jun@i Bun#i @:Janji '
>>
>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>
>> In [9]: m.group()
>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
>
> Why the regex engine stops the search at last piece of string?
> Why not at the first match of the group "@:"?
> What can it be a regex pattern with the following result?
>
>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>
>> In [2]: m.group()
>> Out[2]: 'Jun@i Bun#i '
Compare:
>>> re.compile("a+").match("aaaa").group()
'aaaa'
>>> re.compile("a+?").match("aaaa").group()
'a'
By default pattern matching is "greedy" -- the ".+" part of your regex
matches as many characters as possible. Adding a ? like in ".+?" triggers
non-greedy matching.
[toc] | [prev] | [next] | [standalone]
| From | Sergio Spina <sergio.am.spina@gmail.com> |
|---|---|
| Date | 2016-05-08 09:32 -0700 |
| Message-ID | <6a3fe5cd-0ba9-4017-a763-76c896b8c843@googlegroups.com> |
| In reply to | #108367 |
Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto:
> Sergio Spina wrote:
>
> > In the following ipython session:
> >
> >> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
> >> Type "copyright", "credits" or "license" for more information.
> >>
> >> IPython 2.3.0 -- An enhanced Interactive Python.
> >>
> >> In [1]: import re
> >>
> >> In [2]: patt = r""" # the match pattern is:
> >> ...: .+ # one or more characters
> >> ...: [ ] # followed by a space
> >> ...: (?=[@#D]:) # that is followed by one of the
> >> ...: # chars "@#D" and a colon ":"
> >> ...: """
> >>
> >> In [3]: pattern = re.compile(patt, re.VERBOSE)
> >>
> >> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
> >>
> >> In [5]: m.group()
> >> Out[5]: 'Jun@i Bun#i '
> >>
> >> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
> >>
> >> In [7]: m.group()
> >> Out[7]: 'Jun@i Bun#i @:Janji '
> >>
> >> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> >>
> >> In [9]: m.group()
> >> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
> >
> > Why the regex engine stops the search at last piece of string?
> > Why not at the first match of the group "@:"?
> > What can it be a regex pattern with the following result?
> >
> >> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
> >>
> >> In [2]: m.group()
> >> Out[2]: 'Jun@i Bun#i '
>
> Compare:
>
> >>> re.compile("a+").match("aaaa").group()
> 'aaaa'
> >>> re.compile("a+?").match("aaaa").group()
> 'a'
>
> By default pattern matching is "greedy" -- the ".+" part of your regex
> matches as many characters as possible. Adding a ? like in ".+?" triggers
> non-greedy matching.
> In [2]: patt = r""" # the match pattern is:
> ...: .+ # one or more characters
> ...: [ ] # followed by a space
> ...: (?=[@#D]:) # ONLY IF is followed by one of the <<< please note
> ...: # chars "@#D" and a colon ":"
> ...: """
From the python documentation
> (?=...)
> Matches if ... matches next, but doesn't consume any of the string.
> This is called a lookahead assertion. For example,
> Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.
I know about greedy and not-greedy, but the problem remains.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2016-05-08 13:17 -0400 |
| Message-ID | <mailman.524.1462727911.32212.python-list@python.org> |
| In reply to | #108370 |
On 5/8/2016 12:32 PM, Sergio Spina wrote:
> Il giorno domenica 8 maggio 2016 18:16:56 UTC+2, Peter Otten ha scritto:
>> Sergio Spina wrote:
>>
>>> In the following ipython session:
>>>
>>>> Python 3.5.1+ (default, Feb 24 2016, 11:28:57)
>>>> Type "copyright", "credits" or "license" for more information.
>>>>
>>>> IPython 2.3.0 -- An enhanced Interactive Python.
>>>>
>>>> In [1]: import re
>>>>
>>>> In [2]: patt = r""" # the match pattern is:
>>>> ...: .+ # one or more characters
>>>> ...: [ ] # followed by a space
>>>> ...: (?=[@#D]:) # that is followed by one of the
>>>> ...: # chars "@#D" and a colon ":"
>>>> ...: """
>>>>
>>>> In [3]: pattern = re.compile(patt, re.VERBOSE)
>>>>
>>>> In [4]: m = pattern.match("Jun@i Bun#i @:Janji")
>>>>
>>>> In [5]: m.group()
>>>> Out[5]: 'Jun@i Bun#i '
>>>>
>>>> In [6]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji")
>>>>
>>>> In [7]: m.group()
>>>> Out[7]: 'Jun@i Bun#i @:Janji '
>>>>
>>>> In [8]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [9]: m.group()
>>>> Out[9]: 'Jun@i Bun#i @:Janji D:Banji '
>>>
>>> Why the regex engine stops the search at last piece of string?
>>> Why not at the first match of the group "@:"?
>>> What can it be a regex pattern with the following result?
>>>
>>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [2]: m.group()
>>>> Out[2]: 'Jun@i Bun#i '
>>
>> Compare:
>>
>>>>> re.compile("a+").match("aaaa").group()
>> 'aaaa'
>>>>> re.compile("a+?").match("aaaa").group()
>> 'a'
>>
>> By default pattern matching is "greedy" -- the ".+" part of your regex
>> matches as many characters as possible. Adding a ? like in ".+?" triggers
>> non-greedy matching.
>
>> In [2]: patt = r""" # the match pattern is:
>> ...: .+ # one or more characters
Peter meant that you should replace '.+' with '.+?' to get the
non-greedy match.
>> ...: [ ] # followed by a space
>> ...: (?=[@#D]:) # ONLY IF is followed by one of the <<< please note
>> ...: # chars "@#D" and a colon ":"
>> ...: """
>
> From the python documentation
>
>> (?=...)
>> Matches if ... matches next, but doesn't consume any of the string.
>> This is called a lookahead assertion. For example,
>> Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.
>
> I know about greedy and not-greedy, but the problem remains.
Greedy '.+' matches the whole string. The matcher then back up to find
a space -- initially the last space. It then, and only then, checks the
lookahead assertion. If that failed, it would back up again. In your
examples, it succeeds, and the matcher stops.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-05-08 20:19 +0200 |
| Message-ID | <mailman.527.1462731580.32212.python-list@python.org> |
| In reply to | #108370 |
Sergio Spina wrote:
> I know about greedy and not-greedy, but the problem remains.
This makes me wonder why you had to ask
>>> Why the regex engine stops the search at last piece of string?
>>> Why not at the first match of the group "@:"?
To make it crystal clear this time:
>>> import re
>>>
>>> patt = r""" # the match pattern is:
... .+? # one or more characters
... [ ] # followed by a space
... (?=[@#D]:) # that is followed by one of the
... # chars "@#D" and a colon ":"
... """
>>> pattern = re.compile(patt, re.VERBOSE)
>>> m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>> m.group()
'Jun@i Bun#i '
That's exactly what you asked for in
>>> What can it be a regex pattern with the following result?
>>>
>>>> In [1]: m = pattern.match("Jun@i Bun#i @:Janji D:Banji #:Junji")
>>>>
>>>> In [2]: m.group()
>>>> Out[2]: 'Jun@i Bun#i '
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web