Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #3741 > unrolled thread
| Started by | John Nagle <nagle@animats.com> |
|---|---|
| First post | 2011-04-20 12:20 -0700 |
| Last post | 2011-04-21 20:36 +0200 |
| Articles | 8 — 4 participants |
Back to article view | Back to comp.lang.python
Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-20 12:20 -0700
Re: Groups in regular expressions don't repeat as expected Neil Cerutti <neilc@norwich.edu> - 2011-04-20 19:23 +0000
Re: Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-20 13:34 -0700
Re: Groups in regular expressions don't repeat as expected Neil Cerutti <neilc@norwich.edu> - 2011-04-21 13:16 +0000
Re: Groups in regular expressions don't repeat as expected John Nagle <nagle@animats.com> - 2011-04-24 12:43 -0700
Re: Groups in regular expressions don't repeat as expected MRAB <python@mrabarnett.plus.com> - 2011-04-20 21:03 +0100
Re: Groups in regular expressions don't repeat as expected Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-04-21 15:57 +0200
Re: Groups in regular expressions don't repeat as expected Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-04-21 20:36 +0200
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2011-04-20 12:20 -0700 |
| Subject | Groups in regular expressions don't repeat as expected |
| Message-ID | <4daf31e3$0$10596$742ec2ed@news.sonic.net> |
Here's something that surprised me about Python regular expressions.
>>> krex = re.compile(r"^([a-z])+$")
>>> s = "abcdef"
>>> ms = krex.match(s)
>>> ms.groups()
('f',)
The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.
The documentation in fact says that, at
http://docs.python.org/library/re.html
"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."
That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.
John Nagle
[toc] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2011-04-20 19:23 +0000 |
| Message-ID | <918q69FjfgU2@mid.individual.net> |
| In reply to | #3741 |
On 2011-04-20, John Nagle <nagle@animats.com> wrote:
> Here's something that surprised me about Python regular expressions.
>
> >>> krex = re.compile(r"^([a-z])+$")
> >>> s = "abcdef"
> >>> ms = krex.match(s)
> >>> ms.groups()
> ('f',)
>
> The parentheses indicate a capturing group within the
> regular expression, and the "+" indicates that the
> group can appear one or more times. The regular
> expression matches that way. But instead of returning
> a captured group for each character, it returns only the
> last one.
>
> The documentation in fact says that, at
>
> http://docs.python.org/library/re.html
>
> "If a group is contained in a part of the pattern that matched multiple
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.
.findall
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2011-04-20 13:34 -0700 |
| Message-ID | <4daf4344$0$10519$742ec2ed@news.sonic.net> |
| In reply to | #3742 |
On 4/20/2011 12:23 PM, Neil Cerutti wrote:
> On 2011-04-20, John Nagle<nagle@animats.com> wrote:
>> Here's something that surprised me about Python regular expressions.
>>
>>>>> krex = re.compile(r"^([a-z])+$")
>>>>> s = "abcdef"
>>>>> ms = krex.match(s)
>>>>> ms.groups()
>> ('f',)
>>
>> The parentheses indicate a capturing group within the
>> regular expression, and the "+" indicates that the
>> group can appear one or more times. The regular
>> expression matches that way. But instead of returning
>> a captured group for each character, it returns only the
>> last one.
>>
>> The documentation in fact says that, at
>>
>> http://docs.python.org/library/re.html
>>
>> "If a group is contained in a part of the pattern that matched multiple
>> times, the last match is returned."
>>
>> That's kind of lame, though. I'd expect that there would be some way
>> to retrieve all matches.
>
> .findall
>
Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.
Consider a regular expression for matching domain names:
>>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
>>> s = 'www.example.com'
>>> ms = kre.match(s)
>>> ms.groups()
('www', 'com')
>>> msall = kre.findall(s)
>>> msall
[('www', 'com')]
This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2011-04-21 13:16 +0000 |
| Message-ID | <91ap1kF1pjU2@mid.individual.net> |
| In reply to | #3749 |
On 2011-04-20, John Nagle <nagle@animats.com> wrote:
> Findall does something a bit different. It returns a list of
> matches of the entire pattern, not repeats of groups within
> the pattern.
>
> Consider a regular expression for matching domain names:
>
> >>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
> >>> s = 'www.example.com'
> >>> ms = kre.match(s)
> >>> ms.groups()
> ('www', 'com')
> >>> msall = kre.findall(s)
> >>> msall
> [('www', 'com')]
>
> This is just a simple example. But it illustrates an unnecessary
> limitation. The matcher can do the repeated matching; you just can't
> get the results out.
Thanks for the further explantion.
Assuming a fake API that returned multiple group matches as a
tuple:
>>? print(re.match(r"^([a-z])+$", "abcdef").groups())
(('a', 'b', 'c', 'd', 'e', 'f'),)
I was thinking of applying findall something like this, but you
have to make multiple calls:
>>> m = re.match(r"^[a-z]+$", s)
>>> if m:
... print(re.findall(r"[a-z]", m.group()))
...
['a', 'b', 'c', 'd', 'e', 'f']
I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2011-04-24 12:43 -0700 |
| Message-ID | <4db47d66$0$10524$742ec2ed@news.sonic.net> |
| In reply to | #3794 |
On 4/21/2011 6:16 AM, Neil Cerutti wrote:
> On 2011-04-20, John Nagle<nagle@animats.com> wrote:
>> Findall does something a bit different. It returns a list of
>> matches of the entire pattern, not repeats of groups within
>> the pattern.
>>
>> Consider a regular expression for matching domain names:
>>
>>>>> kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
>>>>> s = 'www.example.com'
>>>>> ms = kre.match(s)
>>>>> ms.groups()
>> ('www', 'com')
>>>>> msall = kre.findall(s)
>>>>> msall
>> [('www', 'com')]
>>
>> This is just a simple example. But it illustrates an unnecessary
>> limitation. The matcher can do the repeated matching; you just can't
>> get the results out.
>
> Thanks for the further explantion.
>
> Assuming a fake API that returned multiple group matches as a
> tuple:
>
>>> ? print(re.match(r"^([a-z])+$", "abcdef").groups())
> (('a', 'b', 'c', 'd', 'e', 'f'),)
>
> I was thinking of applying findall something like this, but you
> have to make multiple calls:
>
>>>> m = re.match(r"^[a-z]+$", s)
>>>> if m:
> ... print(re.findall(r"[a-z]", m.group()))
> ...
> ['a', 'b', 'c', 'd', 'e', 'f']
>
> I can see that getting really annoying. Is there a better way to
> make multiple group matches accessible without adding a third
> element type as a group element?
The most elegant solution would be to have a regular expression
function that returned a tree of tuples or lists. Then you could
express an entire language syntax as a regular expression and
get out a parse tree.
Since the regular expression system is actually doing that work,
then discarding the results, it seems a reasonable extension.
I'm not suggesting extending regular expression matching itself,
just the way the results are stored.
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-04-20 21:03 +0100 |
| Message-ID | <mailman.669.1303329895.9059.python-list@python.org> |
| In reply to | #3741 |
On 20/04/2011 20:20, John Nagle wrote:
> Here's something that surprised me about Python regular expressions.
>
> >>> krex = re.compile(r"^([a-z])+$")
> >>> s = "abcdef"
> >>> ms = krex.match(s)
> >>> ms.groups()
> ('f',)
>
> The parentheses indicate a capturing group within the
> regular expression, and the "+" indicates that the
> group can appear one or more times. The regular
> expression matches that way. But instead of returning
> a captured group for each character, it returns only the
> last one.
>
> The documentation in fact says that, at
>
> http://docs.python.org/library/re.html
>
> "If a group is contained in a part of the pattern that matched multiple
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.
>
You should take a look at the regex module on PyPI. :-)
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-04-21 15:57 +0200 |
| Message-ID | <mailman.701.1303394244.9059.python-list@python.org> |
| In reply to | #3741 |
2011/4/20 John Nagle <nagle@animats.com>:
> Here's something that surprised me about Python regular expressions.
>
>>>> krex = re.compile(r"^([a-z])+$")
>>>> s = "abcdef"
>>>> ms = krex.match(s)
>>>> ms.groups()
> ('f',)
>
>...
> "If a group is contained in a part of the pattern that matched multiple
> times, the last match is returned."
>
> That's kind of lame, though. I'd expect that there would be some way
> to retrieve all matches.
>
> John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list
Hi,
do you mean something like:
>>> import regex
>>> ms = regex.match(r"^([a-z])+$", "abcdef")
>>> ms.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']
>>>
>>> help(ms.captures)
Help on built-in function captures:
captures(...)
captures([group1, ...]) --> list of strings or tuple of list of strings.
Return the captures of one or more subgroups of the match. If there is a
single argument, the result is a list of strings; if there are multiple
arguments, the result is a tuple of lists with one item per argument; if
there are no arguments, the captures of the whole match is returned. Group
0 is the whole match.
>>>
cf.
http://pypi.python.org/pypi/regex
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-04-21 20:36 +0200 |
| Message-ID | <mailman.717.1303411028.9059.python-list@python.org> |
| In reply to | #3741 |
2011/4/20 MRAB <python@mrabarnett.plus.com>:
> On 20/04/2011 20:20, John Nagle wrote:
>>
>> Here's something that surprised me about Python regular expressions.
>>
>> ...
> You should take a look at the regex module on PyPI. :-)
> --
>
Ah well...
sorry for possibly destroying the point and the aha! effect ...
vbr
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web