Groups > comp.lang.python > #3741 > unrolled thread

Groups in regular expressions don't repeat as expected

Started by	John Nagle <nagle@animats.com>
First post	2011-04-20 12:20 -0700
Last post	2011-04-21 20:36 +0200
Articles	8 — 4 participants

Back to article view | Back to comp.lang.python

  Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-20 12:20 -0700
    Re: Groups in regular expressions don't repeat as expected Neil Cerutti <neilc@norwich.edu> - 2011-04-20 19:23 +0000
      Re: Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-20 13:34 -0700
        Re: Groups in regular expressions don't repeat as expected Neil Cerutti <neilc@norwich.edu> - 2011-04-21 13:16 +0000
          Re: Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-24 12:43 -0700
    Re: Groups in regular expressions don't repeat as expected MRAB <python@mrabarnett.plus.com> - 2011-04-20 21:03 +0100
    Re: Groups in regular expressions don't repeat as expected Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-04-21 15:57 +0200
    Re: Groups in regular expressions don't repeat as expected Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-04-21 20:36 +0200

#3741 — Groups in regular expressions don't repeat as expected

From	John Nagle <nagle@animats.com>
Date	2011-04-20 12:20 -0700
Subject	Groups in regular expressions don't repeat as expected
Message-ID	<4daf31e3$0$10596$742ec2ed@news.sonic.net>

Here's something that surprised me about Python regular expressions.

 >>> krex = re.compile(r"^([a-z])+$")
 >>> s = "abcdef"
 >>> ms = krex.match(s)
 >>> ms.groups()
('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times.  The regular
expression matches that way.  But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple 
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

					John Nagle

[toc] | [next] | [standalone]

#3742

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-04-20 19:23 +0000
Message-ID	<918q69FjfgU2@mid.individual.net>
In reply to	#3741

On 2011-04-20, John Nagle <nagle@animats.com> wrote:
> Here's something that surprised me about Python regular expressions.
>
> >>> krex = re.compile(r"^([a-z])+$")
> >>> s = "abcdef"
> >>> ms = krex.match(s)
> >>> ms.groups()
> ('f',)
>
> The parentheses indicate a capturing group within the
> regular expression, and the "+" indicates that the
> group can appear one or more times.  The regular
> expression matches that way.  But instead of returning
> a captured group for each character, it returns only the
> last one.
>
> The documentation in fact says that, at
>
> http://docs.python.org/library/re.html
>
> "If a group is contained in a part of the pattern that matched multiple 
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.

.findall

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#3749

From	John Nagle <nagle@animats.com>
Date	2011-04-20 13:34 -0700
Message-ID	<4daf4344$0$10519$742ec2ed@news.sonic.net>
In reply to	#3742

On 4/20/2011 12:23 PM, Neil Cerutti wrote:
> On 2011-04-20, John Nagle<nagle@animats.com>  wrote:
>> Here's something that surprised me about Python regular expressions.
>>
>>>>> krex = re.compile(r"^([a-z])+$")
>>>>> s = "abcdef"
>>>>> ms = krex.match(s)
>>>>> ms.groups()
>> ('f',)
>>
>> The parentheses indicate a capturing group within the
>> regular expression, and the "+" indicates that the
>> group can appear one or more times.  The regular
>> expression matches that way.  But instead of returning
>> a captured group for each character, it returns only the
>> last one.
>>
>> The documentation in fact says that, at
>>
>> http://docs.python.org/library/re.html
>>
>> "If a group is contained in a part of the pattern that matched multiple
>> times, the last match is returned."
>>
>> That's kind of lame, though. I'd expect that there would be some way
>> to retrieve all matches.
>
> .findall
>

     Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

     Consider a regular expression for matching domain names:

 >>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
 >>> s = 'www.example.com'
 >>> ms = kre.match(s)
 >>> ms.groups()
('www', 'com')
 >>> msall = kre.findall(s)
 >>> msall
[('www', 'com')]

This is just a simple example.  But it illustrates an unnecessary
limitation.  The matcher can do the repeated matching; you just can't
get the results out.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#3794

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-04-21 13:16 +0000
Message-ID	<91ap1kF1pjU2@mid.individual.net>
In reply to	#3749

On 2011-04-20, John Nagle <nagle@animats.com> wrote:
>      Findall does something a bit different. It returns a list of
> matches of the entire pattern, not repeats of groups within
> the pattern.
>
>      Consider a regular expression for matching domain names:
>
> >>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
> >>> s = 'www.example.com'
> >>> ms = kre.match(s)
> >>> ms.groups()
> ('www', 'com')
> >>> msall = kre.findall(s)
> >>> msall
> [('www', 'com')]
>
> This is just a simple example.  But it illustrates an unnecessary
> limitation.  The matcher can do the repeated matching; you just can't
> get the results out.

Thanks for the further explantion.

Assuming a fake API that returned multiple group matches as a
tuple:

>>? print(re.match(r"^([a-z])+$", "abcdef").groups())
(('a', 'b', 'c', 'd', 'e', 'f'),)

I was thinking of applying findall something like this, but you
have to make multiple calls:

>>> m = re.match(r"^[a-z]+$", s)
>>> if m:
...   print(re.findall(r"[a-z]", m.group()))
...
['a', 'b', 'c', 'd', 'e', 'f']

I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#3959

From	John Nagle <nagle@animats.com>
Date	2011-04-24 12:43 -0700
Message-ID	<4db47d66$0$10524$742ec2ed@news.sonic.net>
In reply to	#3794

On 4/21/2011 6:16 AM, Neil Cerutti wrote:
> On 2011-04-20, John Nagle<nagle@animats.com>  wrote:
>>       Findall does something a bit different. It returns a list of
>> matches of the entire pattern, not repeats of groups within
>> the pattern.
>>
>>       Consider a regular expression for matching domain names:
>>
>>>>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
>>>>> s = 'www.example.com'
>>>>> ms = kre.match(s)
>>>>> ms.groups()
>> ('www', 'com')
>>>>> msall = kre.findall(s)
>>>>> msall
>> [('www', 'com')]
>>
>> This is just a simple example.  But it illustrates an unnecessary
>> limitation.  The matcher can do the repeated matching; you just can't
>> get the results out.
>
> Thanks for the further explantion.
>
> Assuming a fake API that returned multiple group matches as a
> tuple:
>
>>> ? print(re.match(r"^([a-z])+$", "abcdef").groups())
> (('a', 'b', 'c', 'd', 'e', 'f'),)
>
> I was thinking of applying findall something like this, but you
> have to make multiple calls:
>
>>>> m = re.match(r"^[a-z]+$", s)
>>>> if m:
> ...   print(re.findall(r"[a-z]", m.group()))
> ...
> ['a', 'b', 'c', 'd', 'e', 'f']
>
> I can see that getting really annoying. Is there a better way to
> make multiple group matches accessible without adding a third
> element type as a group element?

     The most elegant solution would be to have a regular expression
function that returned a tree of tuples or lists.  Then you could
express an entire language syntax as a regular expression and
get out a parse tree.

     Since the regular expression system is actually doing that work,
then discarding the results, it seems a reasonable extension.
I'm not suggesting extending regular expression matching itself,
just the way the results are stored.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#3746

From	MRAB <python@mrabarnett.plus.com>
Date	2011-04-20 21:03 +0100
Message-ID	<mailman.669.1303329895.9059.python-list@python.org>
In reply to	#3741

On 20/04/2011 20:20, John Nagle wrote:
> Here's something that surprised me about Python regular expressions.
>
>  >>> krex = re.compile(r"^([a-z])+$")
>  >>> s = "abcdef"
>  >>> ms = krex.match(s)
>  >>> ms.groups()
> ('f',)
>
> The parentheses indicate a capturing group within the
> regular expression, and the "+" indicates that the
> group can appear one or more times. The regular
> expression matches that way. But instead of returning
> a captured group for each character, it returns only the
> last one.
>
> The documentation in fact says that, at
>
> http://docs.python.org/library/re.html
>
> "If a group is contained in a part of the pattern that matched multiple
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.
>
You should take a look at the regex module on PyPI. :-)

[toc] | [prev] | [next] | [standalone]

#3796

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-04-21 15:57 +0200
Message-ID	<mailman.701.1303394244.9059.python-list@python.org>
In reply to	#3741

2011/4/20 John Nagle <nagle@animats.com>:
> Here's something that surprised me about Python regular expressions.
>
>>>> krex = re.compile(r"^([a-z])+$")
>>>> s = "abcdef"
>>>> ms = krex.match(s)
>>>> ms.groups()
> ('f',)
>
>...

> "If a group is contained in a part of the pattern that matched multiple
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.
>
>                                        John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
do you mean something like:

>>> import regex
>>> ms = regex.match(r"^([a-z])+$", "abcdef")
>>> ms.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']
>>>

>>> help(ms.captures)
Help on built-in function captures:

captures(...)
    captures([group1, ...]) --> list of strings or tuple of list of strings.
    Return the captures of one or more subgroups of the match.  If there is a
    single argument, the result is a list of strings; if there are multiple
    arguments, the result is a tuple of lists with one item per argument; if
    there are no arguments, the captures of the whole match is returned.  Group
    0 is the whole match.

>>>

cf.
http://pypi.python.org/pypi/regex

hth,
  vbr

[toc] | [prev] | [next] | [standalone]

#3817

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2011-04-21 20:36 +0200
Message-ID	<mailman.717.1303411028.9059.python-list@python.org>
In reply to	#3741

2011/4/20 MRAB <python@mrabarnett.plus.com>:
> On 20/04/2011 20:20, John Nagle wrote:
>>
>> Here's something that surprised me about Python regular expressions.
>>
>> ...
> You should take a look at the regex module on PyPI. :-)
> --
>

Ah well...
sorry for possibly destroying the point and the aha! effect ...
    vbr

[toc] | [prev] | [standalone]

csiph-web

Groups in regular expressions don't repeat as expected

Contents

#3741 — Groups in regular expressions don't repeat as expected

#3742

#3749

#3794

#3959

#3746

#3796

#3817