Groups > comp.lang.python > #102020 > unrolled thread

one more question on regex

Started by	mg <noOne@nowhere.com>
First post	2016-01-22 15:32 +0000
Last post	2016-01-23 11:39 +0100
Articles	6 — 3 participants

Back to article view | Back to comp.lang.python

  one more question on regex mg <noOne@nowhere.com> - 2016-01-22 15:32 +0000
    Re: one more question on regex Peter Otten <__peter__@web.de> - 2016-01-22 16:47 +0100
    Re: one more question on regex mg <noOne@nowhere.com> - 2016-01-22 15:50 +0000
      Re: one more question on regex Vlastimil Brom <vlastimil.brom@gmail.com> - 2016-01-22 21:10 +0100
        Re: one more question on regex mg <noOne@nowhere.com> - 2016-01-22 22:47 +0000
          Re: one more question on regex Vlastimil Brom <vlastimil.brom@gmail.com> - 2016-01-23 11:39 +0100

#102020 — one more question on regex

From	mg <noOne@nowhere.com>
Date	2016-01-22 15:32 +0000
Subject	one more question on regex
Message-ID	<n7ti39$7rt$1@gioia.aioe.org>

python 3.4.3 

import re
re.search('(ab){2}','abzzabab')
<_sre.SRE_Match object; span=(4, 8), match='abab'>

>>> re.findall('(ab){2}','abzzabab')
['ab']

Why for search() the match is 'abab' and for findall the match is 'ab'?

[toc] | [next] | [standalone]

#102021

From	Peter Otten <__peter__@web.de>
Date	2016-01-22 16:47 +0100
Message-ID	<mailman.171.1453477686.15297.python-list@python.org>
In reply to	#102020

mg wrote:

> python 3.4.3
> 
> import re
> re.search('(ab){2}','abzzabab')
> <_sre.SRE_Match object; span=(4, 8), match='abab'>
> 
>>>> re.findall('(ab){2}','abzzabab')
> ['ab']
> 
> Why for search() the match is 'abab' and for findall the match is 'ab'?

I suppose someone thought it was convenient for findall to return the 
explicit groups if there are any. If you want the whole match aka group(0) 
you can get that with

>>> re.findall('(?:ab){2}','abzzabab')
['abab']

[toc] | [prev] | [next] | [standalone]

#102022

From	mg <noOne@nowhere.com>
Date	2016-01-22 15:50 +0000
Message-ID	<n7tj3j$9ra$1@gioia.aioe.org>
In reply to	#102020

Il Fri, 22 Jan 2016 15:32:57 +0000, mg ha scritto:

> python 3.4.3
> 
> import re re.search('(ab){2}','abzzabab')
> <_sre.SRE_Match object; span=(4, 8), match='abab'>
> 
>>>> re.findall('(ab){2}','abzzabab')
> ['ab']
> 
> Why for search() the match is 'abab' and for findall the match is 'ab'?

finditer seems to be consistent with search:
regex = re.compile('(ab){2}')

for match in regex.finditer('abzzababab'): 
  print ("%s: %s" % (match.start(), match.span() ))
... 
4: (4, 8)

[toc] | [prev] | [next] | [standalone]

#102025

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2016-01-22 21:10 +0100
Message-ID	<mailman.173.1453493453.15297.python-list@python.org>
In reply to	#102022

2016-01-22 16:50 GMT+01:00 mg <noOne@nowhere.com>:
> Il Fri, 22 Jan 2016 15:32:57 +0000, mg ha scritto:
>
>> python 3.4.3
>>
>> import re re.search('(ab){2}','abzzabab')
>> <_sre.SRE_Match object; span=(4, 8), match='abab'>
>>
>>>>> re.findall('(ab){2}','abzzabab')
>> ['ab']
>>
>> Why for search() the match is 'abab' and for findall the match is 'ab'?
>
> finditer seems to be consistent with search:
> regex = re.compile('(ab){2}')
>
> for match in regex.finditer('abzzababab'):
>   print ("%s: %s" % (match.start(), match.span() ))
> ...
> 4: (4, 8)
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
as was already pointed out, findall "collects" the content of the
capturing groups (if present), rather than the whole matching text;

for repeated captures the last content of them is taken discarding the
previous ones; cf.:

>>> re.findall('(?i)(a)x(b)+','axbB')
[('a', 'B')]
>>>
(for multiple capturing groups in the pattern, a tuple of captured
parts are collected)

or with your example with differenciated parts of the string using
upper/lower case:
>>> re.findall('(?i)(ab){2}','aBzzAbAB')
['AB']
>>>

hth,
   vbr

[toc] | [prev] | [next] | [standalone]

#102026

From	mg <noOne@nowhere.com>
Date	2016-01-22 22:47 +0000
Message-ID	<n7ubhk$k9f$1@gioia.aioe.org>
In reply to	#102025

Il Fri, 22 Jan 2016 21:10:44 +0100, Vlastimil Brom ha scritto:

> 2016-01-22 16:50 GMT+01:00 mg <noOne@nowhere.com>:
>> Il Fri, 22 Jan 2016 15:32:57 +0000, mg ha scritto:
>>
>>> python 3.4.3
>>>
>>> import re re.search('(ab){2}','abzzabab')
>>> <_sre.SRE_Match object; span=(4, 8), match='abab'>
>>>
>>>>>> re.findall('(ab){2}','abzzabab')
>>> ['ab']
>>>
>>> Why for search() the match is 'abab' and for findall the match is
>>> 'ab'?
>>
>> finditer seems to be consistent with search:
>> regex = re.compile('(ab){2}')
>>
>> for match in regex.finditer('abzzababab'):
>>   print ("%s: %s" % (match.start(), match.span() ))
>> ...
>> 4: (4, 8)
>>
>> -- https://mail.python.org/mailman/listinfo/python-list
> 
> Hi,
> as was already pointed out, findall "collects" the content of the
> capturing groups (if present), rather than the whole matching text;
> 
> for repeated captures the last content of them is taken discarding the
> previous ones; cf.:
> 
>>>> re.findall('(?i)(a)x(b)+','axbB')
> [('a', 'B')]
>>>>
> (for multiple capturing groups in the pattern, a tuple of captured parts
> are collected)
> 
> or with your example with differenciated parts of the string using
> upper/lower case:
>>>> re.findall('(?i)(ab){2}','aBzzAbAB')
> ['AB']
>>>>
>>>>
> hth,
>    vbr

You explanation of re.findall() results is correct. My point is that the 
documentation states:

re.findall(pattern, string, flags=0)
    Return all non-overlapping matches of pattern in string, as a list of 
strings

and this is not what re.findall does. IMHO it should be more reasonable 
to get back the whole matches, since this seems to me the most useful 
information for the user. In any case I'll go with finditer, that returns 
in match object all the infos that anyone can look for.

[toc] | [prev] | [next] | [standalone]

#102031

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2016-01-23 11:39 +0100
Message-ID	<mailman.174.1453545581.15297.python-list@python.org>
In reply to	#102026

2016-01-22 23:47 GMT+01:00 mg <noOne@nowhere.com>:
> Il Fri, 22 Jan 2016 21:10:44 +0100, Vlastimil Brom ha scritto:
>
>> [...]
>
> You explanation of re.findall() results is correct. My point is that the
> documentation states:
>
> re.findall(pattern, string, flags=0)
>     Return all non-overlapping matches of pattern in string, as a list of
> strings
>
> and this is not what re.findall does. IMHO it should be more reasonable
> to get back the whole matches, since this seems to me the most useful
> information for the user. In any case I'll go with finditer, that returns
> in match object all the infos that anyone can look for.
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
I don't know the reasoning for this special behaviour of findall, but
it seems to be documented explicitly:
https://docs.python.org/3/library/re.html#re.findall
"... If one or more groups are present in the pattern, return a list
of groups; this will be a list of tuples if the pattern has more than
one group.
finditer is clearly much more robust for general usage.
I only use findall for quick one-line tests (and there one has to
account for this specificities - either by using non capturing groups
or enclosing the whole pattern in a "main" group and use the first
items in the resulting tuples.
vbr

[toc] | [prev] | [standalone]

csiph-web

one more question on regex

Contents

#102020 — one more question on regex

#102021

#102022

#102025

#102026

#102031