Groups > comp.lang.python > #74045 > unrolled thread

How to write this repeat matching?

Started by	rxjwg98@gmail.com
First post	2014-07-06 11:57 -0700
Last post	2014-07-07 10:18 -0600
Articles	6 — 4 participants

Back to article view | Back to comp.lang.python

  How to write this repeat matching? rxjwg98@gmail.com - 2014-07-06 11:57 -0700
    Re: How to write this repeat matching? MRAB <python@mrabarnett.plus.com> - 2014-07-06 20:19 +0100
    Re: How to write this repeat matching? Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-06 13:26 -0600
      Re: How to write this repeat matching? rxjwg98@gmail.com - 2014-07-07 06:30 -0700
        Re: How to write this repeat matching? Anssi Saari <as@sci.fi> - 2014-07-07 18:48 +0300
        Re: How to write this repeat matching? Ian Kelly <ian.g.kelly@gmail.com> - 2014-07-07 10:18 -0600

#74045 — How to write this repeat matching?

From	rxjwg98@gmail.com
Date	2014-07-06 11:57 -0700
Subject	How to write this repeat matching?
Message-ID	<93a40570-00ed-4507-aa16-221d7e500468@googlegroups.com>

Hi,
On Python website, it says that the following match can reach 'abcb' in 6 steps:

.............
A step-by-step example will make this more obvious. Let's consider the expression
a[bcd]*b. This matches the letter 'a', zero or more letters from the class [bcd], 
and finally ends with a 'b'. Now imagine matching this RE against the string 
abcbd.

The end of the RE has now been reached, and it has matched abcb.  This 
demonstrates how the matching engine goes as far as it can at first, and if no
match is found it will then progressively back up and retry the rest of the RE
again and again. It will back up until it has tried zero matches for [bcd]*, and
if that subsequently fails, the engine will conclude that the string doesn't
match the RE at all.
.............

I write the following code:

.......
import re

line = "abcdb"

matchObj = re.match( 'a[bcd]*b', line) 

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(0) : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"
.........

In which I have used its match pattern, but the result is not 'abcb'

Only matchObj.group(0): abcdb 

displays. All other group(s) have no content.

How to write this greedy search?

Thanks,

[toc] | [next] | [standalone]

#74050

From	MRAB <python@mrabarnett.plus.com>
Date	2014-07-06 20:19 +0100
Message-ID	<mailman.11555.1404674388.18130.python-list@python.org>
In reply to	#74045

On 2014-07-06 19:57, rxjwg98@gmail.com wrote:
> Hi,
> On Python website, it says that the following match can reach 'abcb' in 6 steps:
>
> .............
> A step-by-step example will make this more obvious. Let's consider the expression
> a[bcd]*b. This matches the letter 'a', zero or more letters from the class [bcd],
> and finally ends with a 'b'. Now imagine matching this RE against the string
> abcbd.
>
> The end of the RE has now been reached, and it has matched abcb.  This
> demonstrates how the matching engine goes as far as it can at first, and if no
> match is found it will then progressively back up and retry the rest of the RE
> again and again. It will back up until it has tried zero matches for [bcd]*, and
> if that subsequently fails, the engine will conclude that the string doesn't
> match the RE at all.
> .............
>
> I write the following code:
>
> .......
> import re
>
> line = "abcdb"
>
> matchObj = re.match( 'a[bcd]*b', line)
>
> if matchObj:
>     print "matchObj.group() : ", matchObj.group()
>     print "matchObj.group(0) : ", matchObj.group()
>     print "matchObj.group(1) : ", matchObj.group(1)
>     print "matchObj.group(2) : ", matchObj.group(2)
> else:
>     print "No match!!"
> .........
>
> In which I have used its match pattern, but the result is not 'abcb'
>
That's because the example has 'abcb', but you have:

     line = "abcdb"

(You've put a 'd' in it.)

> Only matchObj.group(0): abcdb
>
> displays. All other group(s) have no content.
>
There are no capture groups in your regex, only group 0 (the entire
matched part).

> How to write this greedy search?
>

[toc] | [prev] | [next] | [standalone]

#74054

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-07-06 13:26 -0600
Message-ID	<mailman.11559.1404675307.18130.python-list@python.org>
In reply to	#74045

On Sun, Jul 6, 2014 at 12:57 PM,  <rxjwg98@gmail.com> wrote:
> I write the following code:
>
> .......
> import re
>
> line = "abcdb"
>
> matchObj = re.match( 'a[bcd]*b', line)
>
> if matchObj:
>    print "matchObj.group() : ", matchObj.group()
>    print "matchObj.group(0) : ", matchObj.group()
>    print "matchObj.group(1) : ", matchObj.group(1)
>    print "matchObj.group(2) : ", matchObj.group(2)
> else:
>    print "No match!!"
> .........
>
> In which I have used its match pattern, but the result is not 'abcb'

You're never going to get a match of 'abcb' on that string, because
'abcb' is not found anywhere in that string.

There are two possible matches for the given pattern over that string:
'abcdb' and 'ab'.  The first one matches the [bcd]* three times, and
the second one matches it zero times.  Because the matching is greedy,
you get the result that matches three times.  It cannot match one, two
or four times because then there would be no 'b' following the [bcd]*
portion as required by the pattern.

>
> Only matchObj.group(0): abcdb
>
> displays. All other group(s) have no content.

Calling match.group(0) is equivalent to calling match.group without
arguments. In that case it returns the matched string of the entire
regular expression.  match.group(1) and match.group(2) will return the
value of the first and second matching group respectively, but the
pattern does not have any matching groups.  If you want a matching
group, then enclose the part that you want it to match in parentheses.
For example, if you change the pattern to:

    matchObj = re.match('a([bcd]*)b', line)

then the value of matchObj.group(1) will be 'bcd'

[toc] | [prev] | [next] | [standalone]

#74107

From	rxjwg98@gmail.com
Date	2014-07-07 06:30 -0700
Message-ID	<3840e655-b202-4a8d-b432-77c2d3cd58a4@googlegroups.com>
In reply to	#74054

On Sunday, July 6, 2014 3:26:44 PM UTC-4, Ian wrote:
> On Sun, Jul 6, 2014 at 12:57 PM,  <rxjwg98@gmail.com> wrote:
> 
> > I write the following code:
> 
> >
> 
> > .......
> 
> > import re
> 
> >
> 
> > line = "abcdb"
> 
> >
> 
> > matchObj = re.match( 'a[bcd]*b', line)
> 
> >
> 
> > if matchObj:
> 
> >    print "matchObj.group() : ", matchObj.group()
> 
> >    print "matchObj.group(0) : ", matchObj.group()
> 
> >    print "matchObj.group(1) : ", matchObj.group(1)
> 
> >    print "matchObj.group(2) : ", matchObj.group(2)
> 
> > else:
> 
> >    print "No match!!"
> 
> > .........
> 
> >
> 
> > In which I have used its match pattern, but the result is not 'abcb'
> 
> 
> 
> You're never going to get a match of 'abcb' on that string, because
> 
> 'abcb' is not found anywhere in that string.
> 
> 
> 
> There are two possible matches for the given pattern over that string:
> 
> 'abcdb' and 'ab'.  The first one matches the [bcd]* three times, and
> 
> the second one matches it zero times.  Because the matching is greedy,
> 
> you get the result that matches three times.  It cannot match one, two
> 
> or four times because then there would be no 'b' following the [bcd]*
> 
> portion as required by the pattern.
> 
> 
> 
> >
> 
> > Only matchObj.group(0): abcdb
> 
> >
> 
> > displays. All other group(s) have no content.
> 
> 
> 
> Calling match.group(0) is equivalent to calling match.group without
> 
> arguments. In that case it returns the matched string of the entire
> 
> regular expression.  match.group(1) and match.group(2) will return the
> 
> value of the first and second matching group respectively, but the
> 
> pattern does not have any matching groups.  If you want a matching
> 
> group, then enclose the part that you want it to match in parentheses.
> 
> For example, if you change the pattern to:
> 
> 
> 
>     matchObj = re.match('a([bcd]*)b', line)
> 
> 
> 
> then the value of matchObj.group(1) will be 'bcd'

Because I am new to Python, I may not describe the question clearly. Could you
read the original problem on web:

https://docs.python.org/2/howto/regex.html

It says that it gets 'abcb'. Could you explain it to me? Thanks again

A step-by-step example will make this more obvious. Let's consider the
 expression a[bcd]*b. This matches the letter 'a', zero or more letters from
 the class [bcd], and finally ends with a 'b'. Now imagine matching this RE
 against the string abcbd.

Step                 Matched              Explanation

1 a The a in the RE matches. 
2 abcbd The engine matches [bcd]*, going as far as it can, which is to the end
 of the string. 
3 Failure The engine tries to match b, but the current position is at the end
 of the string, so it fails. 
4 abcb Back up, so that [bcd]* matches one less character. 
5 Failure Try b again, but the current position is at the last character, which
 is a 'd'. 

6 abc Back up again, so that [bcd]* is only matching bc. 

6 abcb Try b again. This time the character at the current position is 'b', so
 it succeeds.

[toc] | [prev] | [next] | [standalone]

#74118

From	Anssi Saari <as@sci.fi>
Date	2014-07-07 18:48 +0300
Message-ID	<vg3mwclmaol.fsf@coffee.modeemi.fi>
In reply to	#74107

rxjwg98@gmail.com writes:

> Because I am new to Python, I may not describe the question clearly. Could you
> read the original problem on web:
>
> https://docs.python.org/2/howto/regex.html
>
> It says that it gets 'abcb'. Could you explain it to me? Thanks again

Actually, it tries to explain how * works in the regular expression
engine. Do you feel that's a crucial thing for a beginner to understand
about Python? Hopefully your answer is no and you can move on.

[toc] | [prev] | [next] | [standalone]

#74123

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-07-07 10:18 -0600
Message-ID	<mailman.11599.1404749958.18130.python-list@python.org>
In reply to	#74107

On Mon, Jul 7, 2014 at 7:30 AM,  <rxjwg98@gmail.com> wrote:
> Because I am new to Python, I may not describe the question clearly. Could you
> read the original problem on web:
>
> https://docs.python.org/2/howto/regex.html
>
> It says that it gets 'abcb'. Could you explain it to me? Thanks again

The string being matched in the explanation at that link is 'abcbd',
not 'abcdb'. The 'a' in the pattern matches the 'a' in the string, the
'[bcd]*' in the pattern matches the 'bc' in the string (with a repeat
count of 2), and finally the 'b' in the pattern matches the 'b'
following that in the string.

[toc] | [prev] | [standalone]

csiph-web

How to write this repeat matching?

Contents

#74045 — How to write this repeat matching?

#74050

#74054

#74107

#74118

#74123