Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #48665

Re: Why is regex so slow?

From Antoine Pitrou <solipsis@pitrou.net>
Subject Re: Why is regex so slow?
Date 2013-06-18 20:05 +0000
References <kpq2r9$gg6$1@panix2.panix.com>
Newsgroups comp.lang.python
Message-ID <mailman.3557.1371585950.3114.python-list@python.org> (permalink)

Show all headers | View raw


Roy Smith <roy <at> panix.com> writes:
> 
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total).  I don't understand why the
> first way is so much slower.

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster. That's like saying "I don't understand why my CPU is slower
than my GPU at calculating 3D structures".

That said, there may be ways to improve the regex engine to deal with such
special cases in a speedier way. But there will still be some overhead related
to the fact that you are invoking a powerful generic engine rather than a
lean and mean specialized routine.

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".  And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

Regards

Antoine.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 12:45 -0400
  Re: Why is regex so slow? Skip Montanaro <skip@pobox.com> - 2013-06-18 12:01 -0500
  Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:08 -0400
  Re: Why is regex so slow? Chris Angelico <rosuav@gmail.com> - 2013-06-19 03:20 +1000
    Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-18 20:10 +0200
      Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 12:40 -0700
      Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 21:59 +0200
        Re: Why is regex so slow? André Malo <ndparker@gmail.com> - 2013-06-18 22:13 +0200
  Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 18:31 +0100
  Re: Why is regex so slow? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-18 18:34 +0100
    Re: Why is regex so slow? roy@panix.com (Roy Smith) - 2013-06-18 15:21 -0400
      Re: Why is regex so slow? MRAB <python@mrabarnett.plus.com> - 2013-06-18 20:49 +0100
  Re: Why is regex so slow? Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-18 12:21 -0700
  Re: Why is regex so slow? Antoine Pitrou <solipsis@pitrou.net> - 2013-06-18 20:05 +0000
    Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-18 13:23 -0700
      Re: Why is regex so slow? Duncan Booth <duncan.booth@invalid.invalid> - 2013-06-19 13:21 +0000
        Re: Why is regex so slow? Roy Smith <roy@panix.com> - 2013-06-19 12:55 -0700
    Re: Why is regex so slow? Grant Edwards <invalid@invalid.invalid> - 2013-06-18 20:30 +0000
      Re: Why is regex so slow? Terry Reedy <tjreedy@udel.edu> - 2013-06-18 17:29 -0400
      Re: Why is regex so slow? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2013-06-19 10:29 +0200
  Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 01:51 +0000
    Re: Why is regex so slow? Dave Angel <davea@davea.name> - 2013-06-18 22:11 -0400
      Re: Why is regex so slow? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-19 03:16 +0000

csiph-web