Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.019 X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'correspond': 0.07; 'python': 0.08; 'basic,': 0.09; 'hierarchical': 0.09; 'manipulate': 0.09; 'received:mail-lpp01m010-f46.google.com': 0.09; 'tags,': 0.09; 'output': 0.10; 'meaningful': 0.13; 'res': 0.16; 'subject:expression': 0.16; 'subject:regular': 0.16; 'cc:addr:python-list': 0.16; 'looked': 0.16; 'wrote:': 0.18; 'cheers,': 0.20; 'cc:no real name:2**0': 0.21; 'header:In-Reply- To:1': 0.22; '(or': 0.22; 'converts': 0.23; 'extract': 0.24; 'structure': 0.26; 'cc:2**0': 0.26; 'message-id:@mail.gmail.com': 0.29; 'alternatives': 0.29; 'cc:addr:python.org': 0.29; 'pm,': 0.29; 'use?': 0.30; 'pretty': 0.31; 'subject:?': 0.31; 'actually': 0.31; 'received:209.85.215.46': 0.32; 'sufficient': 0.32; 'actual': 0.32; 'tue,': 0.32; 'idea': 0.32; 'there': 0.33; 'anything': 0.34; 'subject:What': 0.34; 'something': 0.35; 'received:google.com': 0.37; 'received:209.85': 0.38; 'allows': 0.38; 'could': 0.38; 'should': 0.38; 'data': 0.38; 'easier': 0.38; 'received:209.85.215': 0.39; 'received:209': 0.39; 'called': 0.40; 'john': 0.61; 'header:Received:6': 0.61; 'your': 0.61; 'forward': 0.63; 'guarantee': 0.66; 'subject:best': 0.67; 'news,': 0.73; 'subject:this': 0.74; 'stream': 0.77; 'messed': 0.84; 'subject:write': 0.84; 'that)': 0.84; 'from.': 0.93 Received-SPF: pass (google.com: domain of ian.g.kelly@gmail.com designates 10.112.85.199 as permitted sender) client-ip=10.112.85.199; Authentication-Results: mr.google.com; spf=pass (google.com: domain of ian.g.kelly@gmail.com designates 10.112.85.199 as permitted sender) smtp.mail=ian.g.kelly@gmail.com; dkim=pass header.i=ian.g.kelly@gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=R4eq0gGEIF3/WvsHn8RshWRGUaoKLQWjQn/iIL0edRk=; b=Ex46qJfSgNenG8w8kmOHQCe/XYogfxfzS2Hoa8ibNu8mf3xm9eBgJzqCk3NA6ECfPF j1ocgIsCeaCQLA/2udP4l9uhr/ru2OUMtZEK6VSjrcRzZGRET50zAjEQ0OyW6XYJ46pP mMnvu0F+qfjF+Q4+pL3BVYy8JZIWXqbZikkvMuU3hdDeniPZEwnj3uOsjMEExhZpR8Gm vII7vkgBye8JqLeWiIVkjC1VsJBdgM/c/ikbcZr0qw1jjrSkt4bg6sFFe+PXxhaiRgl+ zsPnqM3qhbvACvYP2aB179p6fVObXOCh7w3Ex3/P5VIaY6sQbYtfsz7Ls/wEAbLuF+4a e6iA== MIME-Version: 1.0 In-Reply-To: <28285433.1413.1331075139309.JavaMail.geo-discussion-forums@ynbq18> References: <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4> <28285433.1413.1331075139309.JavaMail.geo-discussion-forums@ynbq18> From: Ian Kelly Date: Tue, 6 Mar 2012 16:35:32 -0700 Subject: Re: What's the best way to write this regular expression? To: John Salerno Content-Type: text/plain; charset=ISO-8859-1 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 14 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1331076963 news.xs4all.nl 6985 [2001:888:2000:d::a6]:50640 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:21289 On Tue, Mar 6, 2012 at 4:05 PM, John Salerno wrote: >> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. HTMLParser is pretty basic, although it may be sufficient for your needs. It just converts an html document into a stream of start tags, end tags, and text, with no guarantee that the tags will actually correspond in any meaningful way. lxml can be used to output an actual hierarchical structure that may be easier to manipulate and extract data from. Cheers, Ian