Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #30781 > unrolled thread

Re: + in regular expression

Started byCameron Simpson <cs@zip.com.au>
First post2012-10-05 15:22 +1000
Last post2012-10-05 17:07 +0100
Articles 5 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: + in regular expression Cameron Simpson <cs@zip.com.au> - 2012-10-05 15:22 +1000
    Re: + in regular expression Duncan Booth <duncan.booth@invalid.invalid> - 2012-10-05 09:23 +0000
      Re: Re: + in regular expression Evan Driscoll <driscoll@cs.wisc.edu> - 2012-10-05 10:27 -0500
      Re: + in regular expression Evan Driscoll <driscoll@cs.wisc.edu> - 2012-10-05 10:31 -0500
      Re: + in regular expression MRAB <python@mrabarnett.plus.com> - 2012-10-05 17:07 +0100

#30781 — Re: + in regular expression

FromCameron Simpson <cs@zip.com.au>
Date2012-10-05 15:22 +1000
SubjectRe: + in regular expression
Message-ID<mailman.1838.1349414969.27098.python-list@python.org>
On 03Oct2012 21:17, Ian Kelly <ian.g.kelly@gmail.com> wrote:
| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion <contropinion@gmail.com> wrote:
| > why the  "\s{6}+"  is not a regular pattern?
| 
| Use a group: "(?:\s{6})+"

Yeah, it is probably a precedence issue in the grammar.
"(\s{6})+" is also accepted.
-- 
Cameron Simpson <cs@zip.com.au>

Disclaimer: ERIM wanted to share my opinions, but I wouldn't let them.
        - David Wiseman <dwiseman@erim.org>

[toc] | [next] | [standalone]


#30796

FromDuncan Booth <duncan.booth@invalid.invalid>
Date2012-10-05 09:23 +0000
Message-ID<XnsA0E3689B3693duncanbooth@127.0.0.1>
In reply to#30781
Cameron Simpson <cs@zip.com.au> wrote:

> On 03Oct2012 21:17, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion
>| <contropinion@gmail.com> wrote: 
>| > why the  "\s{6}+"  is not a regular pattern?
>| 
>| Use a group: "(?:\s{6})+"
> 
> Yeah, it is probably a precedence issue in the grammar.
> "(\s{6})+" is also accepted.

It's about syntax, not precedence, but the documentation doesn't really 
spell it out in full. Like most regex documentation it talks in woolly 
terms about special characters rather than giving a formal syntax.

A regular expression element may be followed by a quantifier. 
Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers 
'*?', '+?', '{n,m}?'). There's nothing in the regex language which says 
you can follow an element with two quantifiers. Parentheses (grouping or 
non-grouping) around a regex turn that regex into a single element which 
is why you can then use another quantifier.

In bnf, I think Python's regexes would be somthing like:

re ::= union | simple-re
union ::= re | simple-re
simple-re ::= concatenation | basic-re
concatenation ::= simple-re basic-re
basic-re ::= element | element quantifier
element ::= group | nc-group | "." | "^" | "$" | char | charset
quantifier = "*" | "+" | "?" | "{" NUMBER "}" | "{" NUMBER "," NUMBER 
"}" |"*?" | "+?" | "{" NUMBER "," NUMBER "}?"
group ::= "(" re ")"
nc-group ::= "(?:" re ")"
char = <any non-special character> | "\" <any character>

... and so on. I didn't include charsets or all the (?...) extensions or 
special sequences.

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [next] | [standalone]


#30821

FromEvan Driscoll <driscoll@cs.wisc.edu>
Date2012-10-05 10:27 -0500
Message-ID<mailman.1855.1349450806.27098.python-list@python.org>
In reply to#30796
On 10/05/2012 04:23 AM, Duncan Booth wrote:
> A regular expression element may be followed by a quantifier.
> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
> you can follow an element with two quantifiers.
In fact, *you* did -- the first sentence of that paragraph! :-)

\s is a regex, so you can follow it with a quantifier and get \s{6}. 
That's also a regex, so you should be able to follow it with a quantifier.

I can understand that you can create a grammar that excludes it. I'm 
actually really interested to know if anyone knows whether this was a 
deliberate decision and, if so, what the reason is. (And if not -- 
should it be considered a (low priority) bug?)

Was it because such patterns often reveal a mistake? Because "\s{6}+" 
has other meanings in different regex syntaxes and the designers didn't 
want confusion? Because it was simpler to parse that way? Because the 
"hey you recognize regular expressions by converting it to a finite 
automaton" story is a lie in most real-world regex implementations (in 
part because they're not actually regular expressions) and repeated 
quantifiers cause problems with the parsing techniques that actually get 
used?

Evan

[toc] | [prev] | [next] | [standalone]


#30822

FromEvan Driscoll <driscoll@cs.wisc.edu>
Date2012-10-05 10:31 -0500
Message-ID<mailman.1857.1349451071.27098.python-list@python.org>
In reply to#30796
On 10/05/2012 10:27 AM, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}. 
> That's also a regex, so you should be able to follow it with a 
> quantifier.
OK, I guess this isn't true... you said a "regular expression *element*" 
can be followed by a quantifier. I just took what I usually see as part 
of a regular expression and read into your post something it didn't 
quite say. Still, the rest of mine applies.

Evan

[toc] | [prev] | [next] | [standalone]


#30825

FromMRAB <python@mrabarnett.plus.com>
Date2012-10-05 17:07 +0100
Message-ID<mailman.1860.1349453267.27098.python-list@python.org>
In reply to#30796
On 2012-10-05 16:27, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a quantifier.
>
> I can understand that you can create a grammar that excludes it. I'm
> actually really interested to know if anyone knows whether this was a
> deliberate decision and, if so, what the reason is. (And if not --
> should it be considered a (low priority) bug?)
>
> Was it because such patterns often reveal a mistake? Because "\s{6}+"
> has other meanings in different regex syntaxes and the designers didn't
> want confusion? Because it was simpler to parse that way? Because the
> "hey you recognize regular expressions by converting it to a finite
> automaton" story is a lie in most real-world regex implementations (in
> part because they're not actually regular expressions) and repeated
> quantifiers cause problems with the parsing techniques that actually get
> used?
>
You rarely want to repeat a repeated element. It can also result in 
catastrophic
backtracking unless you're _very_ careful.

In many other regex implementations (including mine), "*+", "*+" and
"?+" are possessive quantifiers, much as "??", "*?" and "??" are lazy
quantifiers.

You could, of course, ask why adding "?" after a quantifier doesn't
make it optional, e.g. why r"\s{6}?" doesn't mean the same as
r"(?:\s{6})?", or why r"\s{0,6}?" doesn't mean the same as
r"(?:\s{0,6})?".

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web