Groups > comp.lang.python > #21510 > unrolled thread

What's the best way to parse this HTML tag?

Started by	John Salerno <johnjsal@gmail.com>
First post	2012-03-11 15:53 -0700
Last post	2012-03-12 09:27 -0400
Articles	4 — 2 participants

Back to article view | Back to comp.lang.python

  What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 15:53 -0700
    Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-11 20:28 -0400
      Re: What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 19:35 -0700
        Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-12 09:27 -0400

#21510 — What's the best way to parse this HTML tag?

From	John Salerno <johnjsal@gmail.com>
Date	2012-03-11 15:53 -0700
Subject	What's the best way to parse this HTML tag?
Message-ID	<239c4ad7-ac93-45c5-98d6-71a434e1c5aa@r21g2000yqa.googlegroups.com>

I'm using Beautiful Soup to extract some song information from a radio
station's website that lists the songs it plays as it plays them.
Getting the time that the song is played is easy, because the time is
wrapped in a <div> tag all by itself with a class attribute that has a
specific value I can search for. But the actual song title and artist
information is harder, because the HTML isn't quite as precise. Here's
a sample:

<div class="cmPlaylistContent">
 <strong>
  <a href="/lsp/t2995/">
   Love Without End, Amen
  </a>
 </strong>
 <br/>
 <a href="/lsp/a436/">
  George Strait
 </a>
 <br/>
 <span class="sprite iconDownload">
 </span>
 Download Song:
 <a href="http://itunes.apple.com/us/album/love-without-end-amen/
id71416?i=71404&amp;uo=4">
  iTunes
 </a>
 |
 <a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ?
SubscriptionId=1NXYFBZST44V8CCDK182&amp;tag=coxradiointer-20&amp;linkCode=xm2&amp;camp=2025&amp;creative=165953&amp;creativeASIN=B000V638BQ">
  Amazon MP3
 </a>
 <br/>
 <span class="sprite iconComments">
  Comments  (1)
 </span>
 <span class="sprite iconVoteUp">
  Votes  (1)
 </span>
</div>

This is about as far as I can drill down without getting TOO specific.
I simply find the <div> tags with the "cmPlaylistContent" class. This
tag contains both the song title and the artist name, and sometimes
miscellaneous other information as well, like a way to vote for the
song or links to purchase it from iTunes or Amazon.

So my question is, given the above HTML, how can I best extract the
song title and artist name? It SEEMS like they are always the first
two pieces of information in the tag, such that:

for item in div.stripped_strings: print(item)

Love Without End, Amen
George Strait
Download Song:
iTunes
|
Amazon MP3
Comments  (1)
Votes  (1)

and I could simply get the first two items returned by that generator.
It's not quite as clean as I'd like, because I have no idea if
anything could ever be inserted before either of these items, thus
messing it all up.

I also don't want to rely on the <strong> tag, which makes me shudder,
or the <a> tag, because I don't know if they will always have an href.
Ideall, the <a> tag would have also had an attribute that labeled the
title as the title, and the artist as the artist, but alas.....

Therefore, I appeal to your greater wisdom in these matters. Given
this HTML, is there a "best practice" for how to refer to the song
title and artist?

Thanks!

[toc] | [next] | [standalone]

#21514

From	Roy Smith <roy@panix.com>
Date	2012-03-11 20:28 -0400
Message-ID	<roy-0C82FF.20283311032012@news.panix.com>
In reply to	#21510

In article 
<239c4ad7-ac93-45c5-98d6-71a434e1c5aa@r21g2000yqa.googlegroups.com>,
 John Salerno <johnjsal@gmail.com> wrote:

> Getting the time that the song is played is easy, because the time is
> wrapped in a <div> tag all by itself with a class attribute that has a
> specific value I can search for. But the actual song title and artist
> information is harder, because the HTML isn't quite as precise. Here's
> a sample:
> 
> <div class="cmPlaylistContent">
>  <strong>
>   <a href="/lsp/t2995/">
>    Love Without End, Amen
>   </a>
>  </strong>
>  <br/>
>  <a href="/lsp/a436/">
>   George Strait
>  </a>
> [...]
> Therefore, I appeal to your greater wisdom in these matters. Given
> this HTML, is there a "best practice" for how to refer to the song
> title and artist?

Obviously, any attempt at screen scraping is fraught with peril.  
Beautiful Soup is a great tool but it doesn't negate the fact that 
you've made a pact with the devil.  That being said, if I had to guess, 
here's your puppy:

>   <a href="/lsp/t2995/">
>    Love Without End, Amen
>   </a>

the thing to look for is an "a" element with an href that starts with 
"/lsp/t", where "t" is for "track".  Likewise:

>  <a href="/lsp/a436/">
>   George Strait
>  </a>

an href starting with "/lsp/a" is probably an artist link.

You owe the Oracle three helpings of tag soup.

[toc] | [prev] | [next] | [standalone]

#21519

From	John Salerno <johnjsal@gmail.com>
Date	2012-03-11 19:35 -0700
Message-ID	<bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com>
In reply to	#21514

On Mar 11, 7:28 pm, Roy Smith <r...@panix.com> wrote:
> In article
> <239c4ad7-ac93-45c5-98d6-71a434e1c...@r21g2000yqa.googlegroups.com>,
>  John Salerno <johnj...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > Getting the time that the song is played is easy, because the time is
> > wrapped in a <div> tag all by itself with a class attribute that has a
> > specific value I can search for. But the actual song title and artist
> > information is harder, because the HTML isn't quite as precise. Here's
> > a sample:
>
> > <div class="cmPlaylistContent">
> >  <strong>
> >   <a href="/lsp/t2995/">
> >    Love Without End, Amen
> >   </a>
> >  </strong>
> >  <br/>
> >  <a href="/lsp/a436/">
> >   George Strait
> >  </a>
> > [...]
> > Therefore, I appeal to your greater wisdom in these matters. Given
> > this HTML, is there a "best practice" for how to refer to the song
> > title and artist?
>
> Obviously, any attempt at screen scraping is fraught with peril.
> Beautiful Soup is a great tool but it doesn't negate the fact that
> you've made a pact with the devil.  That being said, if I had to guess,
> here's your puppy:
>
> >   <a href="/lsp/t2995/">
> >    Love Without End, Amen
> >   </a>
>
> the thing to look for is an "a" element with an href that starts with
> "/lsp/t", where "t" is for "track".  Likewise:
>
> >  <a href="/lsp/a436/">
> >   George Strait
> >  </a>
>
> an href starting with "/lsp/a" is probably an artist link.
>
> You owe the Oracle three helpings of tag soup.

Well, I had considered exactly that method, but I don't know for sure
if the titles and names will always have links like that, so I didn't
want to tie my programming to something so specific. But perhaps it's
still better than just taking the first two strings.

[toc] | [prev] | [next] | [standalone]

#21534

From	Roy Smith <roy@panix.com>
Date	2012-03-12 09:27 -0400
Message-ID	<roy-4FE287.09270212032012@news.panix.com>
In reply to	#21519

In article 
<bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com>,
 John Salerno <johnjsal@gmail.com> wrote:

> Well, I had considered exactly that method, but I don't know for sure
> if the titles and names will always have links like that, so I didn't
> want to tie my programming to something so specific. But perhaps it's
> still better than just taking the first two strings.

Such is the nature of screen scraping.  For the most part, web pages are 
not meant to be parsed.  If you decide to go down the road of trying to 
extract data from them, all bets are off.  You look at the markup, take 
your best guess, and go for it.

There's no magic here.  Nobody can look at this HTML and come up with 
some hard and fast rule for how you're supposed to parse it.  And, even 
if they could, it's all likely to change tomorrow when the site rolls 
out their next UI makeover.

[toc] | [prev] | [standalone]

csiph-web

What's the best way to parse this HTML tag?

Contents

#21510 — What's the best way to parse this HTML tag?

#21514

#21519

#21534