Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Cameron Simpson Newsgroups: comp.lang.python Subject: Re: psss...I want to move from Perl to Python Date: Mon, 1 Feb 2016 08:16:46 +1100 Lines: 31 Message-ID: References: <877fipr64m.fsf@jester.gateway.pace.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed X-Trace: news.uni-berlin.de P1ubXBn5mGuntnbFT8c7Tw75Uu8MgIckqlUAVFebW2vg== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:sourceforge': 0.03; 'patterns': 0.04; 'subject:Python': 0.05; '(b)': 0.07; 'merging': 0.07; 'cc:addr:python-list': 0.09; '(actually': 0.09; 'broken.': 0.09; 'plug': 0.09; 'script,': 0.09; 'successive': 0.09; 'applies': 0.15; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'message- id:@cskk.homeip.net': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'simpson': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'string': 0.17; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'so.': 0.22; '(a)': 0.22; 'function,': 0.22; 'cheers,': 0.22; 'cc:no real name:2**0': 0.22; 'installation': 0.23; 'matching': 0.23; 'header:In-Reply-To:1': 0.24; 'paul': 0.24; 'script': 0.25; 'header:User-Agent:1': 0.26; 'character': 0.29; "i'm": 0.30; 'rules': 0.31; 'class': 0.33; 'source': 0.33; 'add': 0.34; 'gets': 0.35; 'could': 0.35; "isn't": 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; '(and': 0.36; 'urls': 0.36; 'subject:: ': 0.37; 'expect': 0.37; 'charset:us- ascii': 0.37; 'things': 0.38; 'received:localdomain': 0.38; 'turned': 0.38; 'several': 0.38; 'subject:from': 0.39; 'space': 0.40; 'your': 0.60; 'more': 0.63; 'latest': 0.64; 'positions': 0.64; 'cameron': 0.66; 'sounds': 0.76; '>state': 0.84; 'examines': 0.84; 'improved.': 0.84; 'subject:want': 0.93 Content-Disposition: inline In-Reply-To: <877fipr64m.fsf@jester.gateway.pace.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102366 On 31Jan2016 09:49, Paul Rubin wrote: >Cameron Simpson writes: >> Adzapper. It has many many regexps matching URLs. (Actually a more >> globlike syntax, but it gets turned into a regexp.) You plug it into >> your squid proxy. > >Oh cool, is that out there in circulation? Yes: http://adzapper.sourceforge.net/ which includes the installation instructions (install script, add a line to squid.conf). However my publication workflow is broken. (And source forge isn't what it used to be.) I need to get the update process improved. I'm happy to send the latest copy to people by private email. >It sounds like the approach of merging all the regexes into one and >compiling to a FSM could be a big win. I wouldn't expect too big a >state space explosion. Perhaps so. The existing script (a) merges regexps for successive patterns for the same class and (b) use's perl's "study" function, which examines a string which will have several regexps applies to it - it nots things like character positions I gather, which is used in the matching process. Since the zapper applies all the rules to most URLs this is a performance win. Cheers, Cameron Simpson