Path: csiph.com!usenet.pasdenom.info!gegeweb.org!newsfeed.kamp.net!newsfeed.kamp.net!87.79.20.101.MISMATCH!newsreader4.netcologne.de!news.netcologne.de!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'cpython': 0.05; 'matches': 0.07; 'patterns.': 0.07; '"if': 0.09; 'calculating': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:Why': 0.09; 'complexity,': 0.16; 'from:addr:pitrou.net': 0.16; 'from:addr:solipsis': 0.16; 'from:name:antoine pitrou': 0.16; 'invokes': 0.16; 'invoking': 0.16; 'message-id:@post.gmane.org': 0.16; 'notation.': 0.16; 'received:213.41.240': 0.16; 'received:213.41.240.54': 0.16; 'received:80.91.229.3': 0.16; 'received:charlus.yi.org': 0.16; 'received:plane.gmane.org': 0.16; 'received:yi.org': 0.16; 'roy': 0.16; 'subject:slow': 0.16; 'substring': 0.16; '(the': 0.22; 'machine': 0.22; 'input': 0.22; 'saying': 0.22; 'header:User-Agent:1': 0.23; '2.1': 0.24; 'tells': 0.24; 'values': 0.27; 'header:X-Complaints-To:1': 0.27; 'function': 0.29; 'related': 0.29; "doesn't": 0.30; 'matching': 0.30; 'said,': 0.30; 'lines': 0.31; 'operators': 0.31; 'overhead': 0.31; 'routine': 0.31; 'writes:': 0.31; 'figure': 0.32; 'run': 0.32; 'cases': 0.33; 'but': 0.35; 'there': 0.35; 'method': 0.36; 'charset:us-ascii': 0.36; 'subject:?': 0.36; 'should': 0.36; 'searching': 0.37; 'two': 0.37; 'generic': 0.38; 'to:addr:python- list': 0.38; 'fact': 0.38; 'rather': 0.38; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'read': 0.60; 'length': 0.61; 'full': 0.61; 'first': 0.61; 'such': 0.63; 'different': 0.65; 'specialized': 0.65; 'smith': 0.68; 'surprise': 0.74; 'special': 0.74; 'million': 0.74; 'other.': 0.75; 'antoine.': 0.84; 'calls,': 0.84; 'faster.': 0.84; 'hardly': 0.84; 'lean': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Antoine Pitrou Subject: Re: Why is regex so slow? Date: Tue, 18 Jun 2013 20:05:25 +0000 (UTC) References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: sea.gmane.org User-Agent: Loom/3.14 (http://gmane.org/) X-Loom-IP: 213.41.240.54 (Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 35 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1371585950 news.xs4all.nl 15892 [2001:888:2000:d::a6]:42097 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:48665 Roy Smith panix.com> writes: > > Every line which contains 'ENQ' also matches the full regex (61425 > lines match, out of 2.1 million total). I don't understand why the > first way is so much slower. One invokes a fast special-purpose substring searching routine (the str.__contains__ operator), the other a generic matching engine able to process complex patterns. It's hardly a surprise for the specialized routine to be faster. That's like saying "I don't understand why my CPU is slower than my GPU at calculating 3D structures". That said, there may be ways to improve the regex engine to deal with such special cases in a speedier way. But there will still be some overhead related to the fact that you are invoking a powerful generic engine rather than a lean and mean specialized routine. (to be fair, on CPython there's also the fact that operators are faster than method calls, so some overhead is added by that too) > Once the regex is compiled, you should have a state machine pattern > matcher. It should be O(n) in the length of the input to figure out > that it doesn't match as far as "ENQ". And that's exactly how long it > should take for "if 'ENQ' not in line" to run as well. You should read again on the O(...) notation. It's an asymptotic complexity, it tells you nothing about the exact function values at different data points. So you can have two O(n) routines, one of which always twice faster than the other. Regards Antoine.