Groups > comp.lang.python > #48641 > unrolled thread

Why is regex so slow?

Started by	roy@panix.com (Roy Smith)
First post	2013-06-18 12:45 -0400
Last post	2013-06-19 03:16 +0000
Articles	3 on this page of 23 — 15 participants

Back to article view | Back to comp.lang.python

  Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 12:45 -0400
    Re: Why is regex so slow? Skip Montanaro <skip@pobox.com> - 2013-06-18 12:01 -0500
    Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:08 -0400
    Re: Why is regex so slow? Chris Angelico <rosuav@gmail.com> - 2013-06-19 03:20 +1000
      Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-18 20:10 +0200
        Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 12:40 -0700
        Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 21:59 +0200
          Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 22:13 +0200
    Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 18:31 +0100
    Re: Why is regex so slow? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-18 18:34 +0100
      Re: Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 15:21 -0400
        Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 20:49 +0100
    Re: Why is regex so slow? Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-18 12:21 -0700
    Re: Why is regex so slow? Antoine Pitrou <solipsis@pitrou.net> - 2013-06-18 20:05 +0000
      Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:23 -0700
        Re: Why is regex so slow? Duncan Booth <duncan.booth@invalid.invalid> - 2013-06-19 13:21 +0000
          Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-19 12:55 -0700
      Re: Why is regex so slow? Grant Edwards <invalid@invalid.invalid> - 2013-06-18 20:30 +0000
        Re: Why is regex so slow? Terry Reedy <tjreedy@udel.edu> - 2013-06-18 17:29 -0400
        Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-19 10:29 +0200
    Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 01:51 +0000
      Re: Why is regex so slow? Dave Angel <davea@davea.name> - 2013-06-18 22:11 -0400
        Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 03:16 +0000

Page 2 of 2 — ← Prev page 1 [2]

#48675

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-19 01:51 +0000
Message-ID	<51c10e9e$0$29973$c3e8da3$5496439d@news.astraweb.com>
In reply to	#48641

On Tue, 18 Jun 2013 12:45:29 -0400, Roy Smith wrote:

> I've got a 170 MB file I want to search for lines that look like:
> 
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
> ENQUEUEING: /listen/the-station-one
> 
> This code runs in 1.3 seconds:
> 
> ------------------------------
> import re
> 
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0 
> for line in open('error.log'):
>     m = pattern.search(line)
>     if m:
>         count += 1
> 
> print count
> ------------------------------
> 
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)

You say that as if it were surprising. It's not. Regex searches have 
higher overhead than bog standard dumb `substring in string` tests, so 
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.

Not always, it depends on how many hits/misses you have, and the relative 
overhead of each, but as a general rule, I would always try pre-filtering 
as the obvious way to optimize a regex.

Even if the regex engine is just as efficient at doing simple character 
matching as `in`, and it probably isn't, your regex tries to match all 
eleven characters of "ENQUEUEING" while the `in` test only has to match 
three, "ENQ".

[...]
> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".

Yes, but O(N) doesn't tell you how long it takes to run, only how it 
*scales*.

class MyDumbString(str):
    def __contains__(self, substr):
        time.sleep(1000)
        return super(MyDumbString, self).__contains__(substring)

MyDumbString `in` is also O(N), but it is a wee less efficient than the 
built-in version...

Regexes do a lot of work, because they are far more complex than dumb 
string __contains__. It should not surprise you that the overhead is 
greater, even when matching a plain-ol' substring with no metacharacters. 
Especially since Python does not traditionally use regexes for 
everything, like Perl does, and so has not spent the effort to complicate 
the implementation in order to squeeze every last microsecond out of it.

> And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.  Why is doing
> twice the work also twice the speed?

It's not doing twice the work. It's doing less work, overall, by spending 
a moment to trivially filter out the lines which cannot possibly match, 
before spending a few moments to do a more careful match. If the number 
of non-matching lines is high, as it is in your data, then the cheaper 
pre-filter pays for itself and then some.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#48676

From	Dave Angel <davea@davea.name>
Date	2013-06-18 22:11 -0400
Message-ID	<mailman.3566.1371607883.3114.python-list@python.org>
In reply to	#48675

On 06/18/2013 09:51 PM, Steven D'Aprano wrote:

    <SNIP>
>
> Even if the regex engine is just as efficient at doing simple character
> matching as `in`, and it probably isn't, your regex tries to match all
> eleven characters of "ENQUEUEING" while the `in` test only has to match
> three, "ENQ".
>

The rest of your post was valid, and useful, but there's a misconception 
in this paragraph;  I hope you don't mind me pointing it out.

In general, for simple substring searches, you can search for a large 
string faster than you can search for a smaller one.  I'd expect

if "ENQUEUING" in bigbuffer

to be faster than

if "ENQ"  in bigbuffer

assuming that all occurrences of ENQ will actually match the whole 
thing.  If CPython's implementation doesn't show the speed difference, 
maybe there's some room for optimization.

See Boyer-Moore if you want a peek at the algorithm.

When I was writiing a simple search program, I could typically search 
for a 4-character string faster than REP SCASB could match a one 
character string.  And that's a single instruction (with prefix).
-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#48678

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-06-19 03:16 +0000
Message-ID	<51c12297$0$29973$c3e8da3$5496439d@news.astraweb.com>
In reply to	#48676

On Tue, 18 Jun 2013 22:11:01 -0400, Dave Angel wrote:

> On 06/18/2013 09:51 PM, Steven D'Aprano wrote:
> 
>     <SNIP>
>>
>> Even if the regex engine is just as efficient at doing simple character
>> matching as `in`, and it probably isn't, your regex tries to match all
>> eleven characters of "ENQUEUEING" while the `in` test only has to match
>> three, "ENQ".
>>
>>
> The rest of your post was valid, and useful, but there's a misconception
> in this paragraph;  I hope you don't mind me pointing it out.

Of course not, I'm always happy to learn if I'm mistaken.

 
> In general, for simple substring searches, you can search for a large
> string faster than you can search for a smaller one.  I'd expect
> 
> if "ENQUEUING" in bigbuffer
> 
> to be faster than
> 
> if "ENQ"  in bigbuffer

And so it is:


steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQ'" \
> -s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
> "sub in s"
10000 loops, best of 3: 38.3 usec per loop
steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQUIRING'" \
> -s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
> "sub in s"
100000 loops, best of 3: 15.4 usec per loop



Thank you.



-- 
Steven

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

Why is regex so slow?

Contents

#48675

#48676

#48678