Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #48641 > unrolled thread
| Started by | roy@panix.com (Roy Smith) |
|---|---|
| First post | 2013-06-18 12:45 -0400 |
| Last post | 2013-06-19 03:16 +0000 |
| Articles | 20 on this page of 23 — 15 participants |
Back to article view | Back to comp.lang.python
Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 12:45 -0400
Re: Why is regex so slow? Skip Montanaro <skip@pobox.com> - 2013-06-18 12:01 -0500
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:08 -0400
Re: Why is regex so slow? Chris Angelico <rosuav@gmail.com> - 2013-06-19 03:20 +1000
Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-18 20:10 +0200
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 12:40 -0700
Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 21:59 +0200
Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 22:13 +0200
Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 18:31 +0100
Re: Why is regex so slow? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-18 18:34 +0100
Re: Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 15:21 -0400
Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 20:49 +0100
Re: Why is regex so slow? Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-18 12:21 -0700
Re: Why is regex so slow? Antoine Pitrou <solipsis@pitrou.net> - 2013-06-18 20:05 +0000
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:23 -0700
Re: Why is regex so slow? Duncan Booth <duncan.booth@invalid.invalid> - 2013-06-19 13:21 +0000
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-19 12:55 -0700
Re: Why is regex so slow? Grant Edwards <invalid@invalid.invalid> - 2013-06-18 20:30 +0000
Re: Why is regex so slow? Terry Reedy <tjreedy@udel.edu> - 2013-06-18 17:29 -0400
Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-19 10:29 +0200
Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 01:51 +0000
Re: Why is regex so slow? Dave Angel <davea@davea.name> - 2013-06-18 22:11 -0400
Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 03:16 +0000
Page 1 of 2 [1] 2 Next page →
| From | roy@panix.com (Roy Smith) |
|---|---|
| Date | 2013-06-18 12:45 -0400 |
| Subject | Why is regex so slow? |
| Message-ID | <kpq2r9$gg6$1@panix2.panix.com> |
I've got a 170 MB file I want to search for lines that look like:
[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.
Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?
I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
[toc] | [next] | [standalone]
| From | Skip Montanaro <skip@pobox.com> |
|---|---|
| Date | 2013-06-18 12:01 -0500 |
| Message-ID | <mailman.3543.1371574888.3114.python-list@python.org> |
| In reply to | #48641 |
> I don't understand why the first way is so much slower. I have no obvious answers, but a couple suggestions: 1. Can you anchor the pattern at the beginning of the line? (use match() instead of search()) 2. Does it get faster it you eliminate the "(.*)" part of the pattern? It seems that if you find a line matching the first part of the pattern, you could just as easily split the line yourself instead of creating a group. Skip
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-06-18 13:08 -0400 |
| Message-ID | <mailman.3544.1371575342.3114.python-list@python.org> |
| In reply to | #48641 |
On Jun 18, 2013, at 1:01 PM, Skip Montanaro wrote: >> I don't understand why the first way is so much slower. > > I have no obvious answers, but a couple suggestions: > > 1. Can you anchor the pattern at the beginning of the line? (use > match() instead of search()) That's one of the things we tried. Didn't make any difference. > 2. Does it get faster it you eliminate the "(.*)" part of the pattern? Just tried that, it also didn't make any difference. > It seems that if you find a line matching the first part of the > pattern, you could just as easily split the line yourself instead of > creating a group. At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series). --- Roy Smith roy@panix.com
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-19 03:20 +1000 |
| Message-ID | <mailman.3545.1371576060.3114.python-list@python.org> |
| In reply to | #48641 |
On Wed, Jun 19, 2013 at 3:08 AM, Roy Smith <roy@panix.com> wrote: > I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series). Yeah, I'd try that against 3.3 before opening a performance bug. Also, it's entirely possible that performance is majorly different in 3.x anyway, on account of strings being Unicode. Definitely merits another look imho. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2013-06-18 20:10 +0200 |
| Message-ID | <kpq7q8$n9o$1@news.albasani.net> |
| In reply to | #48645 |
On 18.06.2013 19:20, Chris Angelico wrote:
> Yeah, I'd try that against 3.3 before opening a performance bug. Also,
> it's entirely possible that performance is majorly different in 3.x
> anyway, on account of strings being Unicode. Definitely merits another
> look imho.
Hmmm, at least Python 3.2 seems to have the same issue. I generated test
data with:
#!/usr/bin/python3
import random
random.seed(0)
f = open("error.log", "w")
for i in range(1500000):
q = random.randint(0, 99)
if q == 0:
print("ENQUEUEING: /listen/ fhsduifhsd uifhuisd hfuisd hfuihds
iufhsd", file = f)
else:
print("fiosdjfoi sdmfio sdmfio msdiof msdoif msdoimf oisd mfoisdm f",
file = f)
Resulting file has a size of 91530018 and md5 of
2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
page because I'm not sure the PRNG is stable across Python versions).
Testing with:
#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
# if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)
The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.
Regards,
Johannes
--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-06-18 12:40 -0700 |
| Message-ID | <32f93486-61f9-46c6-9efc-ca7507f3054e@googlegroups.com> |
| In reply to | #48653 |
On Tuesday, June 18, 2013 2:10:16 PM UTC-4, Johannes Bauer wrote: > Resulting file has a size of 91530018 and md5 of > > 2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same > > page because I'm not sure the PRNG is stable across Python versions). If people want to test against my original data, grab https://s3.amazonaws.com/songza/recruiting/logs.tar.gz and do "cat app[12].error.log > error.log".
[toc] | [prev] | [next] | [standalone]
| From | André Malo <ndparker@gmail.com> |
|---|---|
| Date | 2013-06-18 21:59 +0200 |
| Message-ID | <2281418.s73s1ze9gg@news.perlig.de> |
| In reply to | #48653 |
* Johannes Bauer wrote:
> The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
> sec). Curious. This is Python 3.2.3 on Linux x86_64.
A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
in your inner loop (1500000 times...)
#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)
runs ~ 1.39 s
replacing some dict lookups with index lookups:
#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)
main()
runs ~ 1.15s
and further:
#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = search(line)
if m:
count += 1
print(count)
main()
runs ~ 1.08 s
and for reference:
#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = search(line)
if m:
count += 1
print(count)
main()
runs ~ 0.71 s
The final difference is probably just the difference between a hardcoded
string search and a generic NFA.
nd
[toc] | [prev] | [next] | [standalone]
| From | André Malo <ndparker@gmail.com> |
|---|---|
| Date | 2013-06-18 22:13 +0200 |
| Message-ID | <1532119.OxbuKqgHX2@news.perlig.de> |
| In reply to | #48664 |
* André Malo wrote:
> * Johannes Bauer wrote:
>
>> The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
>> sec). Curious. This is Python 3.2.3 on Linux x86_64.
>
> A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
> in your inner loop (1500000 times...)
[...]
Missed one basic timing BTW:
#!/usr/bin/python3
def main():
for line in open('error.log'):
pass
main()
runs ~ 0.53 s
nd
--
Already I've seen people (really!) write web URLs in the form:
http:\\some.site.somewhere
[...] How soon until greengrocers start writing "apples $1\pound"
or something? -- Joona I Palaste in clc
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-06-18 18:31 +0100 |
| Message-ID | <mailman.3548.1371576696.3114.python-list@python.org> |
| In reply to | #48641 |
On 18/06/2013 17:45, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
>
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
>
> This code runs in 1.3 seconds:
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
> m = pattern.search(line)
> if m:
> count += 1
>
> print count
> ------------------------------
>
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
> if 'ENQ' not in line:
> continue
> m = pattern.search(line)
> if m:
> count += 1
>
> print count
> ------------------------------
>
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total). I don't understand why the
> first way is so much slower.
>
> Once the regex is compiled, you should have a state machine pattern
> matcher. It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ". And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well. Why is doing
> twice the work also twice the speed?
>
> I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
>
I'd be interested in how the 'regex' module
(http://pypi.python.org/pypi/regex) compares. :-)
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-06-18 18:34 +0100 |
| Message-ID | <mailman.3549.1371576854.3114.python-list@python.org> |
| In reply to | #48641 |
On 18/06/2013 18:08, Roy Smith wrote: > > On Jun 18, 2013, at 1:01 PM, Skip Montanaro wrote: > >>> I don't understand why the first way is so much slower. >> >> I have no obvious answers, but a couple suggestions: >> >> 1. Can you anchor the pattern at the beginning of the line? (use >> match() instead of search()) > > That's one of the things we tried. Didn't make any difference. > >> 2. Does it get faster it you eliminate the "(.*)" part of the pattern? > > Just tried that, it also didn't make any difference. > >> It seems that if you find a line matching the first part of the >> pattern, you could just as easily split the line yourself instead of >> creating a group. > > > At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series). > > --- > Roy Smith > roy@panix.com > Out of curiousity have the tried the new regex module from pypi rather than the stdlib version? A heck of a lot of work has gone into it see http://bugs.python.org/issue2636 -- "Steve is going for the pink ball - and for those of you who are watching in black and white, the pink is next to the green." Snooker commentator 'Whispering' Ted Lowe. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | roy@panix.com (Roy Smith) |
|---|---|
| Date | 2013-06-18 15:21 -0400 |
| Message-ID | <kpqc0a$8am$1@panix2.panix.com> |
| In reply to | #48648 |
In article <mailman.3549.1371576854.3114.python-list@python.org>, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote: > Out of curiousity have the tried the new regex module from pypi rather > than the stdlib version? A heck of a lot of work has gone into it see > http://bugs.python.org/issue2636 I just installed that and gave it a shot. It's *slower* (and, much higher variation from run to run). I'm too exhausted fighting with OpenOffice to get this into some sane spreadsheet format, so here's the raw timings: Built-in re module: 0:01.32 0:01.33 0:01.32 0:01.33 0:01.35 0:01.32 0:01.35 0:01.36 0:01.33 0:01.32 regex with flags=V0: 0:01.66 0:01.53 0:01.51 0:01.47 0:01.81 0:01.58 0:01.78 0:01.57 0:01.64 0:01.60 regex with flags=V1: 0:01.53 0:01.57 0:01.65 0:01.61 0:01.83 0:01.82 0:01.59 0:01.60 0:01.55 0:01.82
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-06-18 20:49 +0100 |
| Message-ID | <mailman.3556.1371584976.3114.python-list@python.org> |
| In reply to | #48659 |
On 18/06/2013 20:21, Roy Smith wrote: > In article <mailman.3549.1371576854.3114.python-list@python.org>, > Mark Lawrence <breamoreboy@yahoo.co.uk> wrote: > >> Out of curiousity have the tried the new regex module from pypi rather >> than the stdlib version? A heck of a lot of work has gone into it see >> http://bugs.python.org/issue2636 > > I just installed that and gave it a shot. It's *slower* (and, much > higher variation from run to run). I'm too exhausted fighting with > OpenOffice to get this into some sane spreadsheet format, so here's > the raw timings: > > Built-in re module: > 0:01.32 > 0:01.33 > 0:01.32 > 0:01.33 > 0:01.35 > 0:01.32 > 0:01.35 > 0:01.36 > 0:01.33 > 0:01.32 > > regex with flags=V0: > 0:01.66 > 0:01.53 > 0:01.51 > 0:01.47 > 0:01.81 > 0:01.58 > 0:01.78 > 0:01.57 > 0:01.64 > 0:01.60 > > regex with flags=V1: > 0:01.53 > 0:01.57 > 0:01.65 > 0:01.61 > 0:01.83 > 0:01.82 > 0:01.59 > 0:01.60 > 0:01.55 > 0:01.82 > I reckon that about 1/3 of that time is spent in PyArg_ParseTupleAndKeywords, just getting the arguments! There's a higher initial overhead in using regex than string methods, so working just a line at time will take longer.
[toc] | [prev] | [next] | [standalone]
| From | Rick Johnson <rantingrickjohnson@gmail.com> |
|---|---|
| Date | 2013-06-18 12:21 -0700 |
| Message-ID | <4617e637-43d4-4460-9f6a-df96b6eba562@googlegroups.com> |
| In reply to | #48641 |
On Tuesday, June 18, 2013 11:45:29 AM UTC-5, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
> This code runs in 1.3 seconds:
> ------------------------------
> import re
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
> for line in open('error.log'):
> m = pattern.search(line)
> if m:
> count += 1
> print count
Is the power of regexps required to solve such a simplistic problem? I believe string methods should suffice.
py> line = "[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one"
py> idx = line.find('ENQ')
py> if idx > 0:
match = line[idx:]
py> match
'ENQUEUEING: /listen/the-station-one'
[toc] | [prev] | [next] | [standalone]
| From | Antoine Pitrou <solipsis@pitrou.net> |
|---|---|
| Date | 2013-06-18 20:05 +0000 |
| Message-ID | <mailman.3557.1371585950.3114.python-list@python.org> |
| In reply to | #48641 |
Roy Smith <roy <at> panix.com> writes: > > Every line which contains 'ENQ' also matches the full regex (61425 > lines match, out of 2.1 million total). I don't understand why the > first way is so much slower. One invokes a fast special-purpose substring searching routine (the str.__contains__ operator), the other a generic matching engine able to process complex patterns. It's hardly a surprise for the specialized routine to be faster. That's like saying "I don't understand why my CPU is slower than my GPU at calculating 3D structures". That said, there may be ways to improve the regex engine to deal with such special cases in a speedier way. But there will still be some overhead related to the fact that you are invoking a powerful generic engine rather than a lean and mean specialized routine. (to be fair, on CPython there's also the fact that operators are faster than method calls, so some overhead is added by that too) > Once the regex is compiled, you should have a state machine pattern > matcher. It should be O(n) in the length of the input to figure out > that it doesn't match as far as "ENQ". And that's exactly how long it > should take for "if 'ENQ' not in line" to run as well. You should read again on the O(...) notation. It's an asymptotic complexity, it tells you nothing about the exact function values at different data points. So you can have two O(n) routines, one of which always twice faster than the other. Regards Antoine.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-06-18 13:23 -0700 |
| Message-ID | <67bebb63-5f5c-4824-bf5b-e61547425449@googlegroups.com> |
| In reply to | #48665 |
On Tuesday, June 18, 2013 4:05:25 PM UTC-4, Antoine Pitrou wrote:
> One invokes a fast special-purpose substring searching routine (the
> str.__contains__ operator), the other a generic matching engine able to
> process complex patterns. It's hardly a surprise for the specialized routine
> to be faster.
Except that the complexity in regexes is compiling the pattern down to a FSM. Once you've got the FSM built, the inner loop should be pretty quick. In C, the inner loop for executing a FSM should be something like:
for(char* p = input; p; ++p) {
next_state = current_state[*p];
if (next_state == MATCH) {
break;
}
}
which should compile down to a couple of machine instructions which run entirely in the instruction pipeline cache. But I'm probably simplifying it more than I should :-)
> (to be fair, on CPython there's also the fact that operators are faster
> than method calls, so some overhead is added by that too)
I've been doing some experimenting, and I'm inclined to believe this is indeed a significant part of it. I also took some ideas from André Malo and factored out some name lookups from the inner loop. That bummed me another 10% in speed.
[toc] | [prev] | [next] | [standalone]
| From | Duncan Booth <duncan.booth@invalid.invalid> |
|---|---|
| Date | 2013-06-19 13:21 +0000 |
| Message-ID | <XnsA1E492148973Aduncanbooth@127.0.0.1> |
| In reply to | #48667 |
Roy Smith <roy@panix.com> wrote:
> Except that the complexity in regexes is compiling the pattern down to
> a FSM. Once you've got the FSM built, the inner loop should be pretty
> quick. In C, the inner loop for executing a FSM should be something
> like:
>
> for(char* p = input; p; ++p) {
> next_state = current_state[*p];
> if (next_state == MATCH) {
> break;
> }
> }
>
> which should compile down to a couple of machine instructions which
> run entirely in the instruction pipeline cache. But I'm probably
> simplifying it more than I should :-)
I'd just like to point out that your simple loop is looking at every
character of the input string. The simple "'ENQ' not in line" test can look
at the third character of the string and if it's none of 'E', 'N' or 'Q'
skip to checking the 6th and then the 9th. It doesn't have to touch the
intervening characters at all.
Or as the source puts it: it's a mix between Boyer-Moore and Horspool, with
a few more bells and whistles on the top.
Also the regex library has to do a whole lot more than just figuring out if
it got a match, so you have massively over-simplified it.
--
Duncan Booth http://kupuguy.blogspot.com
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-06-19 12:55 -0700 |
| Message-ID | <272ab565-823a-4974-9473-0d2185066108@googlegroups.com> |
| In reply to | #48712 |
On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote:
> I'd just like to point out that your simple loop is looking at every
> character of the input string. The simple "'ENQ' not in line" test can look
> at the third character of the string and if it's none of 'E', 'N' or 'Q'
> skip to checking the 6th and then the 9th. It doesn't have to touch the
> intervening characters at all.
It's been a while since I looked at boyer-moore in detail. Looking at Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that:
/* create compressed boyer-moore delta 1 table */
/* process pattern[:-1] */
for (i = 0; i < mlast; i++) {
STRINGLIB_BLOOM_ADD(mask, p[i]);
if (p[i] == p[mlast])
skip = mlast - i - 1;
}
/* process pattern[-1] outside the loop */
STRINGLIB_BLOOM_ADD(mask, p[mlast]);
is essentially (well, sort-if) the same as the compile() step of a regex. For the (presumably) common use case of searching many strings for the same substring (which is what we're doing here), it seems like it would be a win to cache the mask and reuse it if the search string id is the same as the last search string id. The overhead on cache misses would be a single pointer comparison. Has anybody looked at doing that?
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2013-06-18 20:30 +0000 |
| Message-ID | <kpqg0s$4q0$1@reader1.panix.com> |
| In reply to | #48665 |
On 2013-06-18, Antoine Pitrou <solipsis@pitrou.net> wrote:
> Roy Smith <roy <at> panix.com> writes:
>
> You should read again on the O(...) notation. It's an asymptotic complexity,
> it tells you nothing about the exact function values at different data points.
> So you can have two O(n) routines, one of which always twice faster than the
> other.
And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).
All the O() tells you is the general shape of the line. It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0. It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm. [Though that becomes a lot less
likely as n gets large.]
--
Grant Edwards grant.b.edwards Yow! Where's SANDY DUNCAN?
at
gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-06-18 17:29 -0400 |
| Message-ID | <mailman.3559.1371590994.3114.python-list@python.org> |
| In reply to | #48668 |
On 6/18/2013 4:30 PM, Grant Edwards wrote: > On 2013-06-18, Antoine Pitrou <solipsis@pitrou.net> wrote: >> Roy Smith <roy <at> panix.com> writes: >> >> You should read again on the O(...) notation. It's an asymptotic complexity, >> it tells you nothing about the exact function values at different data points. >> So you can have two O(n) routines, one of which always twice faster than the >> other. Or one that is a million times as fast. > And you can have two O(n) routines, one of which is twice as fast for > one value of n and the other is twice as fast for a different value of > n (and that's true for any value of 'twice': 2X 10X 100X). > > All the O() tells you is the general shape of the line. It doesn't > tell you where the line is or how steep the slope is (except in the > case of O(1), where you do know the slope is 0. It's perfectly > feasible that for the range of values of n that you care about in a > particular application, there's an O(n^2) algorithm that's way faster > than another O(log(n)) algorithm. In fact, Tim Peters put together two facts to create the current list.sort. 1. O(n*n) binary insert sort is faster than O(n*logn) merge sort, with both competently coded in C, for n up to about 64. Part of the reason is that binary insert sort is actually O(n*logn) (n binary searches) + O(n*n) (n insertions with a shift averaging n/2 items). The multiplier for the O(n*n) part is much smaller because on modern CPUs, the shift needed for the insertion is a single machine instruction. 2. O(n*logn) sorts have a lower assymtotic complexity because they divide the sequence roughly in half about logn times. In other words, they are 'fast' because they split a list into lots of little pieces. So Tim's aha moment was to think 'Lets stop splitting when pieces are less than or equal to 64, rather than splitting all the way down to 1 or 2". -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2013-06-19 10:29 +0200 |
| Message-ID | <kprq65$bq8$1@news.albasani.net> |
| In reply to | #48668 |
On 18.06.2013 22:30, Grant Edwards wrote: > All the O() tells you is the general shape of the line. Nitpick: it only gives an *upper bound* for the complexity. Any function that is within O(n) is also within O(n^2). Usually when people say O() they actually mean capital Thetha (which is the correct term). > It's perfectly > feasible that for the range of values of n that you care about in a > particular application, there's an O(n^2) algorithm that's way faster > than another O(log(n)) algorithm. [Though that becomes a lot less > likely as n gets large.] Since O() only gives upper bounds it's also possible for an algorithm within O(n^2) to always be faster than another algorithm within O(logn). The O(n^2) algorithm could be Thetha(1). Regards, Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web