Groups > comp.lang.python > #30854 > unrolled thread

Re: + in regular expression

Started by	Cameron Simpson <cs@zip.com.au>
First post	2012-10-06 09:37 +1000
Last post	2012-10-09 11:29 +0000
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: + in regular expression Cameron Simpson <cs@zip.com.au> - 2012-10-06 09:37 +1000
    Re: + in regular expression Duncan Booth <duncan.booth@invalid.invalid> - 2012-10-09 11:29 +0000

#30854 — Re: + in regular expression

From	Cameron Simpson <cs@zip.com.au>
Date	2012-10-06 09:37 +1000
Subject	Re: + in regular expression
Message-ID	<mailman.1884.1349480266.27098.python-list@python.org>

On 05Oct2012 10:27, Evan Driscoll <driscoll@cs.wisc.edu> wrote:
| I can understand that you can create a grammar that excludes it. [...]
| Was it because such patterns often reveal a mistake?

For myself, I would consider that sufficient reason.

I've seen plenty of languages (C and shell, for example, though they
are not alone or egrarious) where a compiler can emit a syntax complaint
many lines from the actual coding mistake (in shell, an unclosed quote
or control construct is a common examplei; Python has the same issue
but mitigated by the indentation requirements which cut the occurence
down a lot).

Forbidding a common error by requiring a wordier workaround isn't
unreasonable.

| Because "\s{6}+" 
| has other meanings in different regex syntaxes and the designers didn't 
| want confusion?

I think Python REs are supposed to be Perl compatible; ISTR an opening
sentence to that effect...

| Because it was simpler to parse that way? Because the 
| "hey you recognize regular expressions by converting it to a finite 
| automaton" story is a lie in most real-world regex implementations (in 
| part because they're not actually regular expressions) and repeated 
| quantifiers cause problems with the parsing techniques that actually get 
| used?

There are certainly constructs that can cause an exponential amount
of backtracking is misused. One could make a case for discouragement
(though not a case for forbidding them).

Just my 2c,
-- 
Cameron Simpson <cs@zip.com.au>

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.

[toc] | [next] | [standalone]

#31004

From	Duncan Booth <duncan.booth@invalid.invalid>
Date	2012-10-09 11:29 +0000
Message-ID	<XnsA0E698D1EA28duncanbooth@127.0.0.1>
In reply to	#30854

Cameron Simpson <cs@zip.com.au> wrote:

>| Because "\s{6}+" 
>| has other meanings in different regex syntaxes and the designers didn't 
>| want confusion?
> 
> I think Python REs are supposed to be Perl compatible; ISTR an opening
> sentence to that effect...
> 
I don't know the full history of how regex engines evolved, but I suspect 
at least part of the answer is that the decisions the Perl developers made 
influenced the other implementations.

Perl's quantifiers allow both '?' and '+' as modifiers on the standard 
quantifiers so clearly you cannot stack those particular quantifiers in 
Perl, therefore quantifiers in general are unstackable.

The only grammars I can find online for regular expressions split out the 
elements and quantifiers the way I did in my previous post. Python's regex 
parser (and I would guess also most of the others in existence) tend more 
to the spaghetti code than following a grammar (_parse is a 238 line 
function). So I think it really is just trying to match existing regular 
expression parsers and any possible grammar is an excuse for why it should 
be the way it is rather than an explanation.

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [standalone]

csiph-web

Re: + in regular expression

Contents

#30854 — Re: + in regular expression

#31004