Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Fri, 05 Oct 2012 17:07:47 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: + in regular expression
References: <CALwzidnH2T5vsYT=nMvBmO4V6fmK+aMfHpxQDWrwArJ6aKtVew@mail.gmail.com> <mailman.1838.1349414969.27098.python-list@python.org> <XnsA0E3689B3693duncanbooth@127.0.0.1> <506EFC44.40508@cs.wisc.edu>
In-Reply-To: <506EFC44.40508@cs.wisc.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.1860.1349453267.27098.python-list@python.org>
Lines: 37
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:30825

On 2012-10-05 16:27, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a quantifier.
>
> I can understand that you can create a grammar that excludes it. I'm
> actually really interested to know if anyone knows whether this was a
> deliberate decision and, if so, what the reason is. (And if not --
> should it be considered a (low priority) bug?)
>
> Was it because such patterns often reveal a mistake? Because "\s{6}+"
> has other meanings in different regex syntaxes and the designers didn't
> want confusion? Because it was simpler to parse that way? Because the
> "hey you recognize regular expressions by converting it to a finite
> automaton" story is a lie in most real-world regex implementations (in
> part because they're not actually regular expressions) and repeated
> quantifiers cause problems with the parsing techniques that actually get
> used?
>
You rarely want to repeat a repeated element. It can also result in 
catastrophic
backtracking unless you're _very_ careful.

In many other regex implementations (including mine), "*+", "*+" and
"?+" are possessive quantifiers, much as "??", "*?" and "??" are lazy
quantifiers.

You could, of course, ask why adding "?" after a quantifier doesn't
make it optional, e.g. why r"\s{6}?" doesn't mean the same as
r"(?:\s{6})?", or why r"\s{0,6}?" doesn't mean the same as
r"(?:\s{0,6})?".