Path: csiph.com!usenet.pasdenom.info!news.albasani.net!news.mixmin.net!hq-usenetpeers.eweka.nl!81.171.88.250.MISMATCH!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Received-SPF: pass (google.com: domain of johnjsal@gmail.com designates 10.50.153.234 as permitted sender) client-ip=10.50.153.234;
MIME-Version: 1.0
In-Reply-To: <CALwzid=H=kk++fGt6MJUHHek_4Qfa_jNo6H0mjmZ_aOf99_8zg@mail.gmail.com>
References: <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4> <mailman.442.1331074333.3037.python-list@python.org> <mailman.443.1331074966.3037.python-list@python.org> <28285433.1413.1331075139309.JavaMail.geo-discussion-forums@ynbq18> <CALwzid=H=kk++fGt6MJUHHek_4Qfa_jNo6H0mjmZ_aOf99_8zg@mail.gmail.com>
From: John Salerno <johnjsal@gmail.com>
Date: Tue, 6 Mar 2012 17:39:42 -0600
Subject: Re: What's the best way to write this regular expression?
To: Ian Kelly <ian.g.kelly@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.448.1331077211.3037.python-list@python.org>
Lines: 28
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:21290

Thanks. I'm thinking the choice might be between lxml and Beautiful
Soup, but since BS uses lxml as a parser, I'm trying to figure out the
difference between them. I don't necessarily need the simplest
(html.parser), but I want to choose one that is simple enough yet
powerful enough that I won't have to learn another method later.




On Tue, Mar 6, 2012 at 5:35 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Tue, Mar 6, 2012 at 4:05 PM, John Salerno <johnjsal@gmail.com> wrote:
>>> Anything that allows me NOT to use REs is welcome news, so I look forwa=
rd to learning about something new! :)
>>
>> I should ask though...are there alternatives already bundled with Python=
 that I could use? Now that you mention it, I remember something called HTM=
LParser (or something like that) and I have no idea why I never looked into=
 that before I messed with REs.
>
> HTMLParser is pretty basic, although it may be sufficient for your
> needs. =A0It just converts an html document into a stream of start tags,
> end tags, and text, with no guarantee that the tags will actually
> correspond in any meaningful way. =A0lxml can be used to output an
> actual hierarchical structure that may be easier to manipulate and
> extract data from.
>
> Cheers,
> Ian