Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Screen scraper to get all 'a title' elements
Date: Thu, 26 Nov 2015 10:54:04 +1100
Lines: 36
Message-ID: <mailman.105.1448495654.20593.python-list@python.org>
References: <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> <mailman.96.1448484959.20593.python-list@python.org> <n35ckk$9q0$1@dont-email.me> <c1e43997-0da3-4b93-b9af-98a2568eff9d@googlegroups.com> <mailman.103.1448492791.20593.python-list@python.org> <n35h0v$stn$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <n35h0v$stn$1@dont-email.me>
Precedence: list
Xref: csiph.com comp.lang.python:99510

On Thu, Nov 26, 2015 at 10:44 AM, Grobu <snailcoder@retrosite.invalid> wrote:
> On 26/11/15 00:06, Chris Angelico wrote:
>>
>> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
>>>
>>> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
>>> Also, it seems to pick up a lot more than just the list I wanted, but
>>> that's ok, I can see why it does that.
>>>
>>> Can you just please explain what it's doing???
>>
>>
>> It's a trap!
>>
>> Don't use a regex to parse HTML, unless you're deliberately trying to
>> entice young and innocent programmers to the dark side.
>>
>> ChrisA
>>
>
> Sorry, I wasn't aware of regex being on the dark side :-)
> Now that you mention it, I suppose that their being complex and
> error-inducing could lead to broken code all too easily when there is a
> reliable, ready-made solution like BeautifulSoup.

Regular expressions have their uses, but parsing HTML is not one of
them. The most important use of a regex is letting an end user control
the search pattern; it's a compact language for describing a variety
of text search concepts. For hard-coded regular expressions, there are
some places where they're very good, and a lot of places where they're
the wrong tool for the job. And one of those wrong-tool-for-job places
is parsing stuff that fundamentally cannot be parsed with regexes,
such as HTML. You _need_ a proper parser, which is what Beautiful Soup
is for.

ChrisA