Groups > comp.lang.python > #18386 > unrolled thread

Repeating assertions in regular expression

Started by	candide <candide@free.invalid>
First post	2012-01-03 09:46 +0100
Last post	2012-01-03 14:36 -0500
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  Repeating assertions in regular expression candide <candide@free.invalid> - 2012-01-03 09:46 +0100
    Re: Repeating assertions in regular expression Devin Jeanpierre <jeanpierreda@gmail.com> - 2012-01-03 04:45 -0500
    Re: Repeating assertions in regular expression MRAB <python@mrabarnett.plus.com> - 2012-01-03 18:57 +0000
    Re: Repeating assertions in regular expression Devin Jeanpierre <jeanpierreda@gmail.com> - 2012-01-03 14:36 -0500

#18386 — Repeating assertions in regular expression

From	candide <candide@free.invalid>
Date	2012-01-03 09:46 +0100
Subject	Repeating assertions in regular expression
Message-ID	<4f02c069$0$690$426a74cc@news.free.fr>

The regular expression HOWTO 
(http://docs.python.org/howto/regex.html#more-metacharacters) explains 
the following

# ------------------------------
zero-width assertions should never be repeated, because if they match 
once at a given location, they can obviously be matched an infinite 
number of times.
# ------------------------------


Why the wording is "should never" ? Repeating a zero-width assertion is 
not forbidden, for instance :

 >>> import re
 >>> re.compile("\\b\\b\w+\\b\\b")
<_sre.SRE_Pattern object at 0xb7831140>
 >>>

Nevertheless, the following doesn't execute :

 >>> re.compile("\\b{2}\w+\\b\\b")
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/python2.7/re.py", line 190, in compile
     return _compile(pattern, flags)
   File "/usr/lib/python2.7/re.py", line 245, in _compile
     raise error, v # invalid expression
sre_constants.error: nothing to repeat
 >>>


\\b\\b and \\b{2} aren't equivalent ?


Surprisingly, the engine doesn't optimize repeated boundary assertions, 
for instance

# ------------------------------
import re
import time

a=time.clock()
len("\\b\\b\\b"*100000+"\w+")
b=time.clock()
print "CPU time : %.2f s" %(b - a)

a=time.clock()
re.compile("\\b\\b\\b"*100000+"\w+")
b=time.clock()
print "CPU time : %.2f s" %(b - a)
# ------------------------------

outputs:

# ------------------------------
CPU time : 0.00 s
CPU time : 1.33 s
# ------------------------------


Your comments are welcome!

[toc] | [next] | [standalone]

#18390

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2012-01-03 04:45 -0500
Message-ID	<mailman.4343.1325583985.27778.python-list@python.org>
In reply to	#18386

> \\b\\b and \\b{2} aren't equivalent ?

This sounds suspiciously like a bug!

> Why the wording is "should never" ? Repeating a zero-width assertion is not
> forbidden, for instance :
>
>>>> import re
>>>> re.compile("\\b\\b\w+\\b\\b")
> <_sre.SRE_Pattern object at 0xb7831140>
>>>>

I believe this is meant to refer to arbitrary-length repetitions, such
as r'\b*', not simple concatenations like that. r'\b*' will abort the
whole match if is run on a boundary, because Python detects a
repetition of a zero-width match and decides this is an error.

A little OT:

I'm not really sure why Python does it this way. It's possible to
match r'\b*' fairly painlessly: instead of aborting the whole match on
a repeat of nothing, it can backtrack. So what should happen is that
the RE engine looks at \b* and does the following:

Is there one \b?
Yes.
Is there another \b?
Backtrack: You haven't read any more characters since the last match
in the * operation
# The result now is that we've matched '\b*' with exactly one '\b' and
can move on to the rest of the input

This is possible (I haven't tried implementing it, but it's e.g.
mentioned in Russ Cox's regular expression papers), you need to
augment the backtracking search with a notion of the regular
expression having "progress" so that you can require progress at
certain points. Otherwise you obviously get problems with nested stars
and things like that. (As it happens, Python's regex engine can't
handle nested stars either).

I'd be appreciative if anyone who works on re or regex can discuss
this a little.

-- Devin

On Tue, Jan 3, 2012 at 3:46 AM, candide <candide@free.invalid> wrote:
> The regular expression HOWTO
> (http://docs.python.org/howto/regex.html#more-metacharacters) explains the
> following
>
> # ------------------------------
> zero-width assertions should never be repeated, because if they match once
> at a given location, they can obviously be matched an infinite number of
> times.
> # ------------------------------
>
>
> Why the wording is "should never" ? Repeating a zero-width assertion is not
> forbidden, for instance :
>
>>>> import re
>>>> re.compile("\\b\\b\w+\\b\\b")
> <_sre.SRE_Pattern object at 0xb7831140>
>>>>
>
> Nevertheless, the following doesn't execute :
>
>>>> re.compile("\\b{2}\w+\\b\\b")
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/usr/lib/python2.7/re.py", line 190, in compile
>    return _compile(pattern, flags)
>  File "/usr/lib/python2.7/re.py", line 245, in _compile
>    raise error, v # invalid expression
> sre_constants.error: nothing to repeat
>>>>
>
>
> \\b\\b and \\b{2} aren't equivalent ?
>
>
> Surprisingly, the engine doesn't optimize repeated boundary assertions, for
> instance
>
> # ------------------------------
> import re
> import time
>
> a=time.clock()
> len("\\b\\b\\b"*100000+"\w+")
> b=time.clock()
> print "CPU time : %.2f s" %(b - a)
>
> a=time.clock()
> re.compile("\\b\\b\\b"*100000+"\w+")
> b=time.clock()
> print "CPU time : %.2f s" %(b - a)
> # ------------------------------
>
> outputs:
>
> # ------------------------------
> CPU time : 0.00 s
> CPU time : 1.33 s
> # ------------------------------
>
>
> Your comments are welcome!
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#18432

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-03 18:57 +0000
Message-ID	<mailman.4368.1325617059.27778.python-list@python.org>
In reply to	#18386

On 03/01/2012 09:45, Devin Jeanpierre wrote:
>>  \\b\\b and \\b{2} aren't equivalent ?
>
> This sounds suspiciously like a bug!
>
>>  Why the wording is "should never" ? Repeating a zero-width assertion is not
>>  forbidden, for instance :
>>
>>>>>  import re
>>>>>  re.compile("\\b\\b\w+\\b\\b")
>>  <_sre.SRE_Pattern object at 0xb7831140>
>>>>>
>
> I believe this is meant to refer to arbitrary-length repetitions, such
> as r'\b*', not simple concatenations like that. r'\b*' will abort the
> whole match if is run on a boundary, because Python detects a
> repetition of a zero-width match and decides this is an error.
>
r"\b+" can be optimised to r"\b", but r"\b*" can be optimised to r"". 
r"\b\b", r"\b\b\b", etc, can be optimised to r"\b".

So why doesn't it optimised?

Because every potential optimisation has a cost, which is the time it
would take to look for it.

That cost needs to be balanced against the potential benefit.

How often do you see repeated r"\b"?

Put simply, it doesn't occur often enough to be worth it. The cost
outweighs the potential benefit.

[toc] | [prev] | [next] | [standalone]

#18437

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2012-01-03 14:36 -0500
Message-ID	<mailman.4373.1325619405.27778.python-list@python.org>
In reply to	#18386

> Put simply, it doesn't occur often enough to be worth it. The cost
> outweighs the potential benefit.

I don't buy it. You could backtrack instead of failing for \b+ and
\b*, and it would be almost as fast as this optimization.

-- Devin

On Tue, Jan 3, 2012 at 1:57 PM, MRAB <python@mrabarnett.plus.com> wrote:
> On 03/01/2012 09:45, Devin Jeanpierre wrote:
>>>
>>>  \\b\\b and \\b{2} aren't equivalent ?
>>
>>
>> This sounds suspiciously like a bug!
>>
>>>  Why the wording is "should never" ? Repeating a zero-width assertion is
>>> not
>>>  forbidden, for instance :
>>>
>>>>>>  import re
>>>>>>  re.compile("\\b\\b\w+\\b\\b")
>>>
>>>  <_sre.SRE_Pattern object at 0xb7831140>
>>>>>>
>>>>>>
>>
>> I believe this is meant to refer to arbitrary-length repetitions, such
>> as r'\b*', not simple concatenations like that. r'\b*' will abort the
>> whole match if is run on a boundary, because Python detects a
>> repetition of a zero-width match and decides this is an error.
>>
> r"\b+" can be optimised to r"\b", but r"\b*" can be optimised to r"".
> r"\b\b", r"\b\b\b", etc, can be optimised to r"\b".
>
> So why doesn't it optimised?
>
> Because every potential optimisation has a cost, which is the time it
> would take to look for it.
>
> That cost needs to be balanced against the potential benefit.
>
> How often do you see repeated r"\b"?
>
> Put simply, it doesn't occur often enough to be worth it. The cost
> outweighs the potential benefit.
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [standalone]

csiph-web

Repeating assertions in regular expression

Contents

#18386 — Repeating assertions in regular expression

#18390

#18432

#18437