Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!newsfeed20.multikabel.net!multikabel.net!newsfeed10.multikabel.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.071 X-Spam-Evidence: '*H*': 0.86; '*S*': 0.00; 'python:': 0.05; 'parsing': 0.09; 'exception': 0.12; 'subject:expression': 0.16; 'subject:regular': 0.16; 'url:lxml': 0.16; 'cc:addr:python-list': 0.16; 'wrote:': 0.18; 'skip:[ 20': 0.19; 'cheers,': 0.20; 'cc:no real name:2**0': 0.21; 'header:In-Reply-To:1': 0.22; 'extract': 0.24; 'received:209.85.220': 0.25; 'cc:2**0': 0.26; 'script': 0.28; 'message-id:@mail.gmail.com': 0.29; 'expressions': 0.29; 'cc:addr:python.org': 0.29; 'pm,': 0.29; 'chris': 0.30; '(as': 0.31; 'specified': 0.31; 'subject:?': 0.31; 'tue,': 0.32; 'sort': 0.33; 'it.': 0.33; 'subject:What': 0.34; 'regular': 0.35; 'but': 0.37; 'received:google.com': 0.37; 'using': 0.37; 'received:209.85': 0.38; 'first.': 0.39; 'received:209': 0.39; 'point': 0.40; "you'll": 0.61; 'john': 0.61; 'header:Received:6': 0.61; 'website': 0.65; 'subject:best': 0.67; 'today': 0.70; 'encountered': 0.73; 'song': 0.73; 'subject:this': 0.74; 'sender:addr:chris': 0.84; 'subject:write': 0.84 Received-SPF: pass (google.com: domain of chris@rebertia.com designates 10.52.72.107 as permitted sender) client-ip=10.52.72.107; Authentication-Results: mr.google.com; spf=pass (google.com: domain of chris@rebertia.com designates 10.52.72.107 as permitted sender) smtp.mail=chris@rebertia.com; dkim=pass header.i=chris@rebertia.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=ZXt7/Bm+1uQ7cej+Uibzq2b6j2sU716/7jkh+KlqidQ=; b=Kvo14CVL/b7Mm4GCqwbVybs8lmpU9gi/5Jvd4Q6JSVxTw1M6Mt4ZTtxaYkFY3C4xVj /1Mwg764SKMHPAaEkP3W1FQM+qxvIrfki+MCs+S2Yiv4mxH5NpFPvhvWhXuNiyIUTnJv OriwZ5HhTezmPZl4zdIkFWo1IP9hLtvVzvvis= MIME-Version: 1.0 Sender: chris@rebertia.com In-Reply-To: <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4> References: <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4> Date: Tue, 6 Mar 2012 14:52:10 -0800 X-Google-Sender-Auth: OUueWC72D1cTs01H7dHMKBUXwEE Subject: Re: What's the best way to write this regular expression? From: Chris Rebert To: John Salerno Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQnk0vbrEpH59bbyI4IVgHZ2UGuIhw0DVkZhYilixfcTW1LiYxntak0H80jeHEWJmBx6BRYv Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 19 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1331074333 news.xs4all.nl 6856 [2001:888:2000:d::a6]:54928 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:21280 On Tue, Mar 6, 2012 at 2:43 PM, John Salerno wrote: > I sort of have to work with what the website gives me (as you'll see belo= w), but today I encountered an exception to my RE. Let me just give all the= specific information first. The point of my script is to go to the specifi= ed URL and extract song information from it. > > This is my RE: > > song_pattern =3D re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?(.*?= ).*?(.*?)', re.DOTALL) I would advise against using regular expressions to "parse" HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xht= ml-self-contained-tags lxml is a popular choice for parsing HTML in Python: http://lxml.de Cheers, Chris