Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #17195 > unrolled thread
| Started by | candide <candide@free.invalid> |
|---|---|
| First post | 2011-12-14 12:12 +0100 |
| Last post | 2011-12-14 14:38 +0100 |
| Articles | 4 — 2 participants |
Back to article view | Back to comp.lang.python
Regexp : repeated group identification candide <candide@free.invalid> - 2011-12-14 12:12 +0100
Re: Regexp : repeated group identification Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-12-14 12:34 +0100
Re: Regexp : repeated group identification candide <candide@free.invalid> - 2011-12-14 13:57 +0100
Re: Regexp : repeated group identification Vlastimil Brom <vlastimil.brom@gmail.com> - 2011-12-14 14:38 +0100
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-12-14 12:12 +0100 |
| Subject | Regexp : repeated group identification |
| Message-ID | <4ee88488$0$27871$426a74cc@news.free.fr> |
Consider the following code
# ----------------------------
import re
z=re.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
print z.group(0)
print z.group(1)
# ----------------------------
outputting :
----------------------------
Spam4Spam2Spam7Spam8
Spam8
----------------------------
The '(Spam\d)+' regexp is tested against 'Spam4Spam2Spam7Spam8' and the
regexp matches the string.
Group numbered one within the regex '(Spam\d)+' refers to Spam\d
The fours substrings
Spam4 Spam2 Spam7 and Spam8
match the group numbered 1.
So I don't understand why z.group(1) gives the last substring (ie Spam8
as the output shows), why not an another one, Spam4 for example ?
[toc] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-12-14 12:34 +0100 |
| Message-ID | <mailman.3635.1323862458.27778.python-list@python.org> |
| In reply to | #17195 |
2011/12/14 candide <candide@free.invalid>:
> Consider the following code
>
> # ----------------------------
> import re
>
> z=re.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
> print z.group(0)
> print z.group(1)
> # ----------------------------
>
> outputting :
>
> ----------------------------
> Spam4Spam2Spam7Spam8
> Spam8
> ----------------------------
>
> The '(Spam\d)+' regexp is tested against 'Spam4Spam2Spam7Spam8' and the
> regexp matches the string.
>
> Group numbered one within the regex '(Spam\d)+' refers to Spam\d
>
> The fours substrings
>
> Spam4 Spam2 Spam7 and Spam8
>
> match the group numbered 1.
>
> So I don't understand why z.group(1) gives the last substring (ie Spam8 as
> the output shows), why not an another one, Spam4 for example ?
> --
> http://mail.python.org/mailman/listinfo/python-list
Hi,
you may find a tiny notice in the re docs on this:
http://docs.python.org/library/re.html#re.MatchObject.group
"If a group is contained in a part of the pattern that matched
multiple times, the last match is returned."
If you need to work with the content captured in the repeated group,
you may check the new regex implementation:
http://pypi.python.org/pypi/regex
Which has a special "captures" method of the match object for this
(beyond many other improvements):
>>> import regex
>>> m=regex.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
>>> m.captures(1)
['Spam4', 'Spam2', 'Spam7', 'Spam8']
>>>
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-12-14 13:57 +0100 |
| Message-ID | <4ee89d3d$0$7725$426a74cc@news.free.fr> |
| In reply to | #17196 |
Le 14/12/2011 12:34, Vlastimil Brom a écrit :
> "If a group is contained in a part of the pattern that matched
> multiple times, the last match is returned."
>
I missed this point, your answer matches my question ;) thanks.
> If you need to work with the content captured in the repeated group,
> you may check the new regex implementation:
> http://pypi.python.org/pypi/regex
>
> Which has a special "captures" method of the match object for this
> (beyond many other improvements):
>
>>>> import regex
>>>> m=regex.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
>>>> m.captures(1)
> ['Spam4', 'Spam2', 'Spam7', 'Spam8']
>>>>
>
Thanks for the reference and the example. I didn't know of this
reimplementation, hoping it offers the Aho-Corasick algorithm allowing
multiple keys search.
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2011-12-14 14:38 +0100 |
| Message-ID | <mailman.3643.1323869918.27778.python-list@python.org> |
| In reply to | #17209 |
2011/12/14 candide <candide@free.invalid>: ... > > Thanks for the reference and the example. I didn't know of this > reimplementation, hoping it offers the Aho-Corasick algorithm allowing > multiple keys search. > -- > http://mail.python.org/mailman/listinfo/python-list Hi, I am not sure about the underlying algorithm (it could as well be an internal expansion of the alternatives like ...|...|...), but you can use a list (set, actually) of alternatives to search for. check the "named lists" feature, \L<...> hth, vbr
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web