Groups > comp.lang.python > #48641 > unrolled thread

Why is regex so slow?

Started by	roy@panix.com (Roy Smith)
First post	2013-06-18 12:45 -0400
Last post	2013-06-19 03:16 +0000
Articles	20 on this page of 23 — 15 participants

Back to article view | Back to comp.lang.python

  Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 12:45 -0400
    Re: Why is regex so slow? Skip Montanaro <skip@pobox.com> - 2013-06-18 12:01 -0500
    Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:08 -0400
    Re: Why is regex so slow? Chris Angelico <rosuav@gmail.com> - 2013-06-19 03:20 +1000
      Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-18 20:10 +0200
        Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 12:40 -0700
        Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 21:59 +0200
          Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 22:13 +0200
    Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 18:31 +0100
    Re: Why is regex so slow? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-18 18:34 +0100
      Re: Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 15:21 -0400
        Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 20:49 +0100
    Re: Why is regex so slow? Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-18 12:21 -0700
    Re: Why is regex so slow? Antoine Pitrou <solipsis@pitrou.net> - 2013-06-18 20:05 +0000
      Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:23 -0700
        Re: Why is regex so slow? Duncan Booth <duncan.booth@invalid.invalid> - 2013-06-19 13:21 +0000
          Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-19 12:55 -0700
      Re: Why is regex so slow? Grant Edwards <invalid@invalid.invalid> - 2013-06-18 20:30 +0000
        Re: Why is regex so slow? Terry Reedy <tjreedy@udel.edu> - 2013-06-18 17:29 -0400
        Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-19 10:29 +0200
    Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 01:51 +0000
      Re: Why is regex so slow? Dave Angel <davea@davea.name> - 2013-06-18 22:11 -0400
        Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 03:16 +0000

Page 1 of 2 [1] 2 Next page →

#48641 — Why is regex so slow?

From	roy@panix.com (Roy Smith)
Date	2013-06-18 12:45 -0400
Subject	Why is regex so slow?
Message-ID	<kpq2r9$gg6$1@panix2.panix.com>

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
    m = pattern.search(line)
    if m:
        count += 1

print count
------------------------------

If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
    if 'ENQ' not in line:
        continue
    m = pattern.search(line)
    if m:
        count += 1

print count
------------------------------

Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total).  I don't understand why the
first way is so much slower.

Once the regex is compiled, you should have a state machine pattern
matcher.  It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ".  And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well.  Why is doing
twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

[toc] | [next] | [standalone]

#48643

From	Skip Montanaro <skip@pobox.com>
Date	2013-06-18 12:01 -0500
Message-ID	<mailman.3543.1371574888.3114.python-list@python.org>
In reply to	#48641

> I don't understand why the first way is so much slower.

I have no obvious answers, but a couple suggestions:

1. Can you anchor the pattern at the beginning of the line?  (use
match() instead of search())
2. Does it get faster it you eliminate the "(.*)" part of the pattern?
 It seems that if you find a line matching the first part of the
pattern, you could just as easily split the line yourself instead of
creating a group.

Skip

[toc] | [prev] | [next] | [standalone]

#48644

From	Roy Smith <roy@panix.com>
Date	2013-06-18 13:08 -0400
Message-ID	<mailman.3544.1371575342.3114.python-list@python.org>
In reply to	#48641

On Jun 18, 2013, at 1:01 PM, Skip Montanaro wrote:

>> I don't understand why the first way is so much slower.
> 
> I have no obvious answers, but a couple suggestions:
> 
> 1. Can you anchor the pattern at the beginning of the line?  (use
> match() instead of search())

That's one of the things we tried.  Didn't make any difference.

> 2. Does it get faster it you eliminate the "(.*)" part of the pattern?

Just tried that, it also didn't make any difference.

> It seems that if you find a line matching the first part of the
> pattern, you could just as easily split the line yourself instead of
> creating a group.

At this point, I'm not so much interested in making this faster as understanding why it's so slow.  I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

---
Roy Smith
roy@panix.com

[toc] | [prev] | [next] | [standalone]

#48645

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-19 03:20 +1000
Message-ID	<mailman.3545.1371576060.3114.python-list@python.org>
In reply to	#48641

On Wed, Jun 19, 2013 at 3:08 AM, Roy Smith <roy@panix.com> wrote:
>  I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Yeah, I'd try that against 3.3 before opening a performance bug. Also,
it's entirely possible that performance is majorly different in 3.x
anyway, on account of strings being Unicode. Definitely merits another
look imho.

ChrisA

[toc] | [prev] | [next] | [standalone]

#48653

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2013-06-18 20:10 +0200
Message-ID	<kpq7q8$n9o$1@news.albasani.net>
In reply to	#48645

On 18.06.2013 19:20, Chris Angelico wrote:

> Yeah, I'd try that against 3.3 before opening a performance bug. Also,
> it's entirely possible that performance is majorly different in 3.x
> anyway, on account of strings being Unicode. Definitely merits another
> look imho.

Hmmm, at least Python 3.2 seems to have the same issue. I generated test
data with:

#!/usr/bin/python3
import random
random.seed(0)
f = open("error.log", "w")
for i in range(1500000):
	q = random.randint(0, 99)
	if q == 0:
		print("ENQUEUEING: /listen/ fhsduifhsd uifhuisd hfuisd hfuihds
iufhsd", file = f)
	else:
		print("fiosdjfoi sdmfio sdmfio msdiof msdoif msdoimf oisd mfoisdm f",
file = f)

Resulting file has a size of 91530018 and md5 of
2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
page because I'm not sure the PRNG is stable across Python versions).

Testing with:

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#	if 'ENQ' not in line:
#		continue
	m = pattern.search(line)
	if m:
		count += 1
print(count)

The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#48660

From	Roy Smith <roy@panix.com>
Date	2013-06-18 12:40 -0700
Message-ID	<32f93486-61f9-46c6-9efc-ca7507f3054e@googlegroups.com>
In reply to	#48653

On Tuesday, June 18, 2013 2:10:16 PM UTC-4, Johannes Bauer wrote:

> Resulting file has a size of 91530018 and md5 of
> > 2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
> > page because I'm not sure the PRNG is stable across Python versions).

If people want to test against my original data, grab

https://s3.amazonaws.com/songza/recruiting/logs.tar.gz

and do "cat app[12].error.log > error.log".

[toc] | [prev] | [next] | [standalone]

#48664

From	André Malo <ndparker@gmail.com>
Date	2013-06-18 21:59 +0200
Message-ID	<2281418.s73s1ze9gg@news.perlig.de>
In reply to	#48653

* Johannes Bauer wrote:

> The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
> sec). Curious. This is Python 3.2.3 on Linux x86_64.

A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
in your inner loop (1500000 times...)

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
    #if 'ENQ' not in line:
    #    continue
    m = pattern.search(line)
    if m:
        count += 1
print(count)

runs ~ 1.39 s

replacing some dict lookups with index lookups:

#!/usr/bin/python3
def main():
    import re
    pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
    count = 0
    for line in open('error.log'):
        #if 'ENQ' not in line:
        #    continue
        m = pattern.search(line)
        if m:
            count += 1
    print(count)
main()

runs ~ 1.15s

and further:

#!/usr/bin/python3
def main():
    import re
    pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
    search = pattern.search
    count = 0
    for line in open('error.log'):
        #if 'ENQ' not in line:
        #    continue
        m = search(line)
        if m:
            count += 1
    print(count)
main()

runs ~ 1.08 s

and for reference:

#!/usr/bin/python3
def main():
    import re
    pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
    search = pattern.search
    count = 0
    for line in open('error.log'):
        if 'ENQ' not in line:
            continue
        m = search(line)
        if m:
            count += 1
    print(count)
main()

runs ~ 0.71 s

The final difference is probably just the difference between a hardcoded
string search and a generic NFA.

nd

[toc] | [prev] | [next] | [standalone]

#48666

From	André Malo <ndparker@gmail.com>
Date	2013-06-18 22:13 +0200
Message-ID	<1532119.OxbuKqgHX2@news.perlig.de>
In reply to	#48664

* André Malo wrote:

> * Johannes Bauer wrote:
> 
>> The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
>> sec). Curious. This is Python 3.2.3 on Linux x86_64.
> 
> A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
> in your inner loop (1500000 times...)

[...]

Missed one basic timing BTW:

#!/usr/bin/python3
def main():
    for line in open('error.log'):
        pass
main()

runs ~ 0.53 s

nd
-- 
Already I've seen people (really!) write web URLs in the form:
http:\\some.site.somewhere
[...] How soon until greengrocers start writing "apples $1\pound"
or something?                           -- Joona I Palaste in clc

[toc] | [prev] | [next] | [standalone]

#48647

From	MRAB <python@mrabarnett.plus.com>
Date	2013-06-18 18:31 +0100
Message-ID	<mailman.3548.1371576696.3114.python-list@python.org>
In reply to	#48641

On 18/06/2013 17:45, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
>
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
>
> This code runs in 1.3 seconds:
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
>      m = pattern.search(line)
>      if m:
>          count += 1
>
> print count
> ------------------------------
>
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
>      if 'ENQ' not in line:
>          continue
>      m = pattern.search(line)
>      if m:
>          count += 1
>
> print count
> ------------------------------
>
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total).  I don't understand why the
> first way is so much slower.
>
> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".  And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.  Why is doing
> twice the work also twice the speed?
>
> I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
>
I'd be interested in how the 'regex' module 
(http://pypi.python.org/pypi/regex) compares. :-)

[toc] | [prev] | [next] | [standalone]

#48648

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-06-18 18:34 +0100
Message-ID	<mailman.3549.1371576854.3114.python-list@python.org>
In reply to	#48641

On 18/06/2013 18:08, Roy Smith wrote:
>
> On Jun 18, 2013, at 1:01 PM, Skip Montanaro wrote:
>
>>> I don't understand why the first way is so much slower.
>>
>> I have no obvious answers, but a couple suggestions:
>>
>> 1. Can you anchor the pattern at the beginning of the line?  (use
>> match() instead of search())
>
> That's one of the things we tried.  Didn't make any difference.
>
>> 2. Does it get faster it you eliminate the "(.*)" part of the pattern?
>
> Just tried that, it also didn't make any difference.
>
>> It seems that if you find a line matching the first part of the
>> pattern, you could just as easily split the line yourself instead of
>> creating a group.
>
>
> At this point, I'm not so much interested in making this faster as understanding why it's so slow.  I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).
>
> ---
> Roy Smith
> roy@panix.com
>

Out of curiousity have the tried the new regex module from pypi rather 
than the stdlib version?  A heck of a lot of work has gone into it see 
http://bugs.python.org/issue2636

-- 
"Steve is going for the pink ball - and for those of you who are 
watching in black and white, the pink is next to the green." Snooker 
commentator 'Whispering' Ted Lowe.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#48659

From	roy@panix.com (Roy Smith)
Date	2013-06-18 15:21 -0400
Message-ID	<kpqc0a$8am$1@panix2.panix.com>
In reply to	#48648

In article <mailman.3549.1371576854.3114.python-list@python.org>,
Mark Lawrence  <breamoreboy@yahoo.co.uk> wrote:

> Out of curiousity have the tried the new regex module from pypi rather 
> than the stdlib version?  A heck of a lot of work has gone into it see 
> http://bugs.python.org/issue2636

I just installed that and gave it a shot.  It's *slower* (and, much
higher variation from run to run).  I'm too exhausted fighting with
OpenOffice to get this into some sane spreadsheet format, so here's
the raw timings:

Built-in re module:
0:01.32
0:01.33
0:01.32
0:01.33
0:01.35
0:01.32
0:01.35
0:01.36
0:01.33
0:01.32

regex with flags=V0:
0:01.66
0:01.53
0:01.51
0:01.47
0:01.81
0:01.58
0:01.78
0:01.57
0:01.64
0:01.60

regex with flags=V1:
0:01.53
0:01.57
0:01.65
0:01.61
0:01.83
0:01.82
0:01.59
0:01.60
0:01.55
0:01.82

[toc] | [prev] | [next] | [standalone]

#48663

From	MRAB <python@mrabarnett.plus.com>
Date	2013-06-18 20:49 +0100
Message-ID	<mailman.3556.1371584976.3114.python-list@python.org>
In reply to	#48659

On 18/06/2013 20:21, Roy Smith wrote:
> In article <mailman.3549.1371576854.3114.python-list@python.org>,
> Mark Lawrence  <breamoreboy@yahoo.co.uk> wrote:
>
>> Out of curiousity have the tried the new regex module from pypi rather
>> than the stdlib version?  A heck of a lot of work has gone into it see
>> http://bugs.python.org/issue2636
>
> I just installed that and gave it a shot.  It's *slower* (and, much
> higher variation from run to run).  I'm too exhausted fighting with
> OpenOffice to get this into some sane spreadsheet format, so here's
> the raw timings:
>
> Built-in re module:
> 0:01.32
> 0:01.33
> 0:01.32
> 0:01.33
> 0:01.35
> 0:01.32
> 0:01.35
> 0:01.36
> 0:01.33
> 0:01.32
>
> regex with flags=V0:
> 0:01.66
> 0:01.53
> 0:01.51
> 0:01.47
> 0:01.81
> 0:01.58
> 0:01.78
> 0:01.57
> 0:01.64
> 0:01.60
>
> regex with flags=V1:
> 0:01.53
> 0:01.57
> 0:01.65
> 0:01.61
> 0:01.83
> 0:01.82
> 0:01.59
> 0:01.60
> 0:01.55
> 0:01.82
>
I reckon that about 1/3 of that time is spent in 
PyArg_ParseTupleAndKeywords, just getting the arguments!

There's a higher initial overhead in using regex than string methods,
so working just a line at time will take longer.

[toc] | [prev] | [next] | [standalone]

#48658

From	Rick Johnson <rantingrickjohnson@gmail.com>
Date	2013-06-18 12:21 -0700
Message-ID	<4617e637-43d4-4460-9f6a-df96b6eba562@googlegroups.com>
In reply to	#48641

On Tuesday, June 18, 2013 11:45:29 AM UTC-5, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
> This code runs in 1.3 seconds:
> ------------------------------
> import re
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
> for line in open('error.log'):
>     m = pattern.search(line)
>     if m:
>         count += 1
> print count

Is the power of regexps required to solve such a simplistic problem? I believe string methods should suffice.

py> line = "[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one"
py> idx = line.find('ENQ')
py> if idx > 0:
	match = line[idx:]
py> match
'ENQUEUEING: /listen/the-station-one'

[toc] | [prev] | [next] | [standalone]

#48665

From	Antoine Pitrou <solipsis@pitrou.net>
Date	2013-06-18 20:05 +0000
Message-ID	<mailman.3557.1371585950.3114.python-list@python.org>
In reply to	#48641

Roy Smith <roy <at> panix.com> writes:
> 
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total).  I don't understand why the
> first way is so much slower.

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster. That's like saying "I don't understand why my CPU is slower
than my GPU at calculating 3D structures".

That said, there may be ways to improve the regex engine to deal with such
special cases in a speedier way. But there will still be some overhead related
to the fact that you are invoking a powerful generic engine rather than a
lean and mean specialized routine.

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".  And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

Regards

Antoine.

[toc] | [prev] | [next] | [standalone]

#48667

From	Roy Smith <roy@panix.com>
Date	2013-06-18 13:23 -0700
Message-ID	<67bebb63-5f5c-4824-bf5b-e61547425449@googlegroups.com>
In reply to	#48665

On Tuesday, June 18, 2013 4:05:25 PM UTC-4, Antoine Pitrou wrote:

> One invokes a fast special-purpose substring searching routine (the
> str.__contains__ operator), the other a generic matching engine able to
> process complex patterns. It's hardly a surprise for the specialized routine
> to be faster.

Except that the complexity in regexes is compiling the pattern down to a FSM.  Once you've got the FSM built, the inner loop should be pretty quick. In C, the inner loop for executing a FSM should be something like:

for(char* p = input; p; ++p) {
    next_state = current_state[*p];
    if (next_state == MATCH) {
        break;
   }
}

which should compile down to a couple of machine instructions which run entirely in the instruction pipeline cache.  But I'm probably simplifying it more than I should :-)

> (to be fair, on CPython there's also the fact that operators are faster
> than method calls, so some overhead is added by that too)

I've been doing some experimenting, and I'm inclined to believe this is indeed a significant part of it.  I also took some ideas from André Malo and factored out some name lookups from the inner loop.  That bummed me another 10% in speed.

[toc] | [prev] | [next] | [standalone]

#48712

From	Duncan Booth <duncan.booth@invalid.invalid>
Date	2013-06-19 13:21 +0000
Message-ID	<XnsA1E492148973Aduncanbooth@127.0.0.1>
In reply to	#48667

Roy Smith <roy@panix.com> wrote:

> Except that the complexity in regexes is compiling the pattern down to
> a FSM.  Once you've got the FSM built, the inner loop should be pretty
> quick. In C, the inner loop for executing a FSM should be something
> like: 
> 
> for(char* p = input; p; ++p) {
>     next_state = current_state[*p];
>     if (next_state == MATCH) {
>         break;
>    }
> }
> 
> which should compile down to a couple of machine instructions which
> run entirely in the instruction pipeline cache.  But I'm probably
> simplifying it more than I should :-) 

I'd just like to point out that your simple loop is looking at every 
character of the input string. The simple "'ENQ' not in line" test can look 
at the third character of the string and if it's none of 'E', 'N' or 'Q' 
skip to checking the 6th and then the 9th. It doesn't have to touch the 
intervening characters at all.

Or as the source puts it: it's a mix between Boyer-Moore and Horspool, with 
a few more bells and whistles on the top.

Also the regex library has to do a whole lot more than just figuring out if 
it got a match, so you have massively over-simplified it.

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [next] | [standalone]

#48746

From	Roy Smith <roy@panix.com>
Date	2013-06-19 12:55 -0700
Message-ID	<272ab565-823a-4974-9473-0d2185066108@googlegroups.com>
In reply to	#48712

On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote:

> I'd just like to point out that your simple loop is looking at every 
> character of the input string. The simple "'ENQ' not in line" test can look 
> at the third character of the string and if it's none of 'E', 'N' or 'Q' 
> skip to checking the 6th and then the 9th. It doesn't have to touch the 
> intervening characters at all.

It's been a while since I looked at boyer-moore in detail.  Looking at Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that:

        /* create compressed boyer-moore delta 1 table */

        /* process pattern[:-1] */
	for (i = 0; i < mlast; i++) {
            STRINGLIB_BLOOM_ADD(mask, p[i]);
            if (p[i] == p[mlast])
                skip = mlast - i - 1;
        }
        /* process pattern[-1] outside the loop */
        STRINGLIB_BLOOM_ADD(mask, p[mlast]);

is essentially (well, sort-if) the same as the compile() step of a regex.  For the (presumably) common use case of searching many strings for the same substring (which is what we're doing here), it seems like it would be a win to cache the mask and reuse it if the search string id is the same as the last search string id.  The overhead on cache misses would be a single pointer comparison.  Has anybody looked at doing that?

[toc] | [prev] | [next] | [standalone]

#48668

From	Grant Edwards <invalid@invalid.invalid>
Date	2013-06-18 20:30 +0000
Message-ID	<kpqg0s$4q0$1@reader1.panix.com>
In reply to	#48665

On 2013-06-18, Antoine Pitrou <solipsis@pitrou.net> wrote:
> Roy Smith <roy <at> panix.com> writes:
>
> You should read again on the O(...) notation. It's an asymptotic complexity,
> it tells you nothing about the exact function values at different data points.
> So you can have two O(n) routines, one of which always twice faster than the
> other.

And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).

All the O() tells you is the general shape of the line.  It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0.  It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm.  [Though that becomes a lot less
likely as n gets large.]

-- 
Grant Edwards               grant.b.edwards        Yow! Where's SANDY DUNCAN?
                                  at               
                              gmail.com

[toc] | [prev] | [next] | [standalone]

#48670

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-06-18 17:29 -0400
Message-ID	<mailman.3559.1371590994.3114.python-list@python.org>
In reply to	#48668

On 6/18/2013 4:30 PM, Grant Edwards wrote:
> On 2013-06-18, Antoine Pitrou <solipsis@pitrou.net> wrote:
>> Roy Smith <roy <at> panix.com> writes:
>>
>> You should read again on the O(...) notation. It's an asymptotic complexity,
>> it tells you nothing about the exact function values at different data points.
>> So you can have two O(n) routines, one of which always twice faster than the
>> other.

Or one that is a million times as fast.

> And you can have two O(n) routines, one of which is twice as fast for
> one value of n and the other is twice as fast for a different value of
> n (and that's true for any value of 'twice': 2X 10X 100X).
>
> All the O() tells you is the general shape of the line.  It doesn't
> tell you where the line is or how steep the slope is (except in the
> case of O(1), where you do know the slope is 0.  It's perfectly
> feasible that for the range of values of n that you care about in a
> particular application, there's an O(n^2) algorithm that's way faster
> than another O(log(n)) algorithm.

In fact, Tim Peters put together two facts to create the current list.sort.
1. O(n*n) binary insert sort is faster than O(n*logn) merge sort, with 
both competently coded in C, for n up to about 64. Part of the reason is 
that binary insert sort is actually O(n*logn) (n binary searches) + 
O(n*n) (n insertions with a shift averaging n/2 items). The multiplier 
for the O(n*n) part is much smaller because on modern CPUs, the shift 
needed for the insertion is a single machine instruction.
2. O(n*logn) sorts have a lower assymtotic complexity because they 
divide the sequence roughly in half about logn times. In other words, 
they are 'fast' because they split a list into lots of little pieces. So 
Tim's aha moment was to think 'Lets stop splitting when pieces are less 
than or equal to 64, rather than splitting all the way down to 1 or 2".

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#48699

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2013-06-19 10:29 +0200
Message-ID	<kprq65$bq8$1@news.albasani.net>
In reply to	#48668

On 18.06.2013 22:30, Grant Edwards wrote:

> All the O() tells you is the general shape of the line.

Nitpick: it only gives an *upper bound* for the complexity. Any function
that is within O(n) is also within O(n^2). Usually when people say O()
they actually mean capital Thetha (which is the correct term).

> It's perfectly
> feasible that for the range of values of n that you care about in a
> particular application, there's an O(n^2) algorithm that's way faster
> than another O(log(n)) algorithm.  [Though that becomes a lot less
> likely as n gets large.]

Since O() only gives upper bounds it's also possible for an algorithm
within O(n^2) to always be faster than another algorithm within O(logn).
The O(n^2) algorithm could be Thetha(1).

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Why is regex so slow?

Contents

#48641 — Why is regex so slow?

#48643

#48644

#48645

#48653

#48660

#48664

#48666

#48647

#48648

#48659

#48663

#48658

#48665

#48667

#48712

#48746

#48668

#48670

#48699