Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21534

Re: What's the best way to parse this HTML tag?

From Roy Smith <roy@panix.com>
Newsgroups comp.lang.python
Subject Re: What's the best way to parse this HTML tag?
Date 2012-03-12 09:27 -0400
Organization PANIX Public Access Internet and UNIX, NYC
Message-ID <roy-4FE287.09270212032012@news.panix.com> (permalink)
References <239c4ad7-ac93-45c5-98d6-71a434e1c5aa@r21g2000yqa.googlegroups.com> <roy-0C82FF.20283311032012@news.panix.com> <bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com>

Show all headers | View raw


In article 
<bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com>,
 John Salerno <johnjsal@gmail.com> wrote:

> Well, I had considered exactly that method, but I don't know for sure
> if the titles and names will always have links like that, so I didn't
> want to tie my programming to something so specific. But perhaps it's
> still better than just taking the first two strings.

Such is the nature of screen scraping.  For the most part, web pages are 
not meant to be parsed.  If you decide to go down the road of trying to 
extract data from them, all bets are off.  You look at the markup, take 
your best guess, and go for it.

There's no magic here.  Nobody can look at this HTML and come up with 
some hard and fast rule for how you're supposed to parse it.  And, even 
if they could, it's all likely to change tomorrow when the site rolls 
out their next UI makeover.

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 15:53 -0700
  Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-11 20:28 -0400
    Re: What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 19:35 -0700
      Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-12 09:27 -0400

csiph-web