Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #48641 > unrolled thread
| Started by | roy@panix.com (Roy Smith) |
|---|---|
| First post | 2013-06-18 12:45 -0400 |
| Last post | 2013-06-19 03:16 +0000 |
| Articles | 3 on this page of 23 — 15 participants |
Back to article view | Back to comp.lang.python
Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 12:45 -0400
Re: Why is regex so slow? Skip Montanaro <skip@pobox.com> - 2013-06-18 12:01 -0500
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:08 -0400
Re: Why is regex so slow? Chris Angelico <rosuav@gmail.com> - 2013-06-19 03:20 +1000
Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-18 20:10 +0200
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 12:40 -0700
Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 21:59 +0200
Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 22:13 +0200
Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 18:31 +0100
Re: Why is regex so slow? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-18 18:34 +0100
Re: Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 15:21 -0400
Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 20:49 +0100
Re: Why is regex so slow? Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-18 12:21 -0700
Re: Why is regex so slow? Antoine Pitrou <solipsis@pitrou.net> - 2013-06-18 20:05 +0000
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:23 -0700
Re: Why is regex so slow? Duncan Booth <duncan.booth@invalid.invalid> - 2013-06-19 13:21 +0000
Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-19 12:55 -0700
Re: Why is regex so slow? Grant Edwards <invalid@invalid.invalid> - 2013-06-18 20:30 +0000
Re: Why is regex so slow? Terry Reedy <tjreedy@udel.edu> - 2013-06-18 17:29 -0400
Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-19 10:29 +0200
Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 01:51 +0000
Re: Why is regex so slow? Dave Angel <davea@davea.name> - 2013-06-18 22:11 -0400
Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 03:16 +0000
Page 2 of 2 — ← Prev page 1 [2]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-19 01:51 +0000 |
| Message-ID | <51c10e9e$0$29973$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #48641 |
On Tue, 18 Jun 2013 12:45:29 -0400, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
>
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
> ENQUEUEING: /listen/the-station-one
>
> This code runs in 1.3 seconds:
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
> for line in open('error.log'):
> m = pattern.search(line)
> if m:
> count += 1
>
> print count
> ------------------------------
>
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)
You say that as if it were surprising. It's not. Regex searches have
higher overhead than bog standard dumb `substring in string` tests, so
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.
Not always, it depends on how many hits/misses you have, and the relative
overhead of each, but as a general rule, I would always try pre-filtering
as the obvious way to optimize a regex.
Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".
[...]
> Once the regex is compiled, you should have a state machine pattern
> matcher. It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".
Yes, but O(N) doesn't tell you how long it takes to run, only how it
*scales*.
class MyDumbString(str):
def __contains__(self, substr):
time.sleep(1000)
return super(MyDumbString, self).__contains__(substring)
MyDumbString `in` is also O(N), but it is a wee less efficient than the
built-in version...
Regexes do a lot of work, because they are far more complex than dumb
string __contains__. It should not surprise you that the overhead is
greater, even when matching a plain-ol' substring with no metacharacters.
Especially since Python does not traditionally use regexes for
everything, like Perl does, and so has not spent the effort to complicate
the implementation in order to squeeze every last microsecond out of it.
> And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well. Why is doing
> twice the work also twice the speed?
It's not doing twice the work. It's doing less work, overall, by spending
a moment to trivially filter out the lines which cannot possibly match,
before spending a few moments to do a more careful match. If the number
of non-matching lines is high, as it is in your data, then the cheaper
pre-filter pays for itself and then some.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-06-18 22:11 -0400 |
| Message-ID | <mailman.3566.1371607883.3114.python-list@python.org> |
| In reply to | #48675 |
On 06/18/2013 09:51 PM, Steven D'Aprano wrote:
<SNIP>
>
> Even if the regex engine is just as efficient at doing simple character
> matching as `in`, and it probably isn't, your regex tries to match all
> eleven characters of "ENQUEUEING" while the `in` test only has to match
> three, "ENQ".
>
The rest of your post was valid, and useful, but there's a misconception
in this paragraph; I hope you don't mind me pointing it out.
In general, for simple substring searches, you can search for a large
string faster than you can search for a smaller one. I'd expect
if "ENQUEUING" in bigbuffer
to be faster than
if "ENQ" in bigbuffer
assuming that all occurrences of ENQ will actually match the whole
thing. If CPython's implementation doesn't show the speed difference,
maybe there's some room for optimization.
See Boyer-Moore if you want a peek at the algorithm.
When I was writiing a simple search program, I could typically search
for a 4-character string faster than REP SCASB could match a one
character string. And that's a single instruction (with prefix).
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-19 03:16 +0000 |
| Message-ID | <51c12297$0$29973$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #48676 |
On Tue, 18 Jun 2013 22:11:01 -0400, Dave Angel wrote: > On 06/18/2013 09:51 PM, Steven D'Aprano wrote: > > <SNIP> >> >> Even if the regex engine is just as efficient at doing simple character >> matching as `in`, and it probably isn't, your regex tries to match all >> eleven characters of "ENQUEUEING" while the `in` test only has to match >> three, "ENQ". >> >> > The rest of your post was valid, and useful, but there's a misconception > in this paragraph; I hope you don't mind me pointing it out. Of course not, I'm always happy to learn if I'm mistaken. > In general, for simple substring searches, you can search for a large > string faster than you can search for a smaller one. I'd expect > > if "ENQUEUING" in bigbuffer > > to be faster than > > if "ENQ" in bigbuffer And so it is: steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQ'" \ > -s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \ > "sub in s" 10000 loops, best of 3: 38.3 usec per loop steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQUIRING'" \ > -s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \ > "sub in s" 100000 loops, best of 3: 15.4 usec per loop Thank you. -- Steven
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web