Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21510 > unrolled thread
| Started by | John Salerno <johnjsal@gmail.com> |
|---|---|
| First post | 2012-03-11 15:53 -0700 |
| Last post | 2012-03-12 09:27 -0400 |
| Articles | 4 — 2 participants |
Back to article view | Back to comp.lang.python
What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 15:53 -0700
Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-11 20:28 -0400
Re: What's the best way to parse this HTML tag? John Salerno <johnjsal@gmail.com> - 2012-03-11 19:35 -0700
Re: What's the best way to parse this HTML tag? Roy Smith <roy@panix.com> - 2012-03-12 09:27 -0400
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-11 15:53 -0700 |
| Subject | What's the best way to parse this HTML tag? |
| Message-ID | <239c4ad7-ac93-45c5-98d6-71a434e1c5aa@r21g2000yqa.googlegroups.com> |
I'm using Beautiful Soup to extract some song information from a radio station's website that lists the songs it plays as it plays them. Getting the time that the song is played is easy, because the time is wrapped in a <div> tag all by itself with a class attribute that has a specific value I can search for. But the actual song title and artist information is harder, because the HTML isn't quite as precise. Here's a sample: <div class="cmPlaylistContent"> <strong> <a href="/lsp/t2995/"> Love Without End, Amen </a> </strong> <br/> <a href="/lsp/a436/"> George Strait </a> <br/> <span class="sprite iconDownload"> </span> Download Song: <a href="http://itunes.apple.com/us/album/love-without-end-amen/ id71416?i=71404&uo=4"> iTunes </a> | <a href="http://www.amazon.com/Love-Without-End-Amen/dp/B000V638BQ? SubscriptionId=1NXYFBZST44V8CCDK182&tag=coxradiointer-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B000V638BQ"> Amazon MP3 </a> <br/> <span class="sprite iconComments"> Comments (1) </span> <span class="sprite iconVoteUp"> Votes (1) </span> </div> This is about as far as I can drill down without getting TOO specific. I simply find the <div> tags with the "cmPlaylistContent" class. This tag contains both the song title and the artist name, and sometimes miscellaneous other information as well, like a way to vote for the song or links to purchase it from iTunes or Amazon. So my question is, given the above HTML, how can I best extract the song title and artist name? It SEEMS like they are always the first two pieces of information in the tag, such that: for item in div.stripped_strings: print(item) Love Without End, Amen George Strait Download Song: iTunes | Amazon MP3 Comments (1) Votes (1) and I could simply get the first two items returned by that generator. It's not quite as clean as I'd like, because I have no idea if anything could ever be inserted before either of these items, thus messing it all up. I also don't want to rely on the <strong> tag, which makes me shudder, or the <a> tag, because I don't know if they will always have an href. Ideall, the <a> tag would have also had an attribute that labeled the title as the title, and the artist as the artist, but alas..... Therefore, I appeal to your greater wisdom in these matters. Given this HTML, is there a "best practice" for how to refer to the song title and artist? Thanks!
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-03-11 20:28 -0400 |
| Message-ID | <roy-0C82FF.20283311032012@news.panix.com> |
| In reply to | #21510 |
In article <239c4ad7-ac93-45c5-98d6-71a434e1c5aa@r21g2000yqa.googlegroups.com>, John Salerno <johnjsal@gmail.com> wrote: > Getting the time that the song is played is easy, because the time is > wrapped in a <div> tag all by itself with a class attribute that has a > specific value I can search for. But the actual song title and artist > information is harder, because the HTML isn't quite as precise. Here's > a sample: > > <div class="cmPlaylistContent"> > <strong> > <a href="/lsp/t2995/"> > Love Without End, Amen > </a> > </strong> > <br/> > <a href="/lsp/a436/"> > George Strait > </a> > [...] > Therefore, I appeal to your greater wisdom in these matters. Given > this HTML, is there a "best practice" for how to refer to the song > title and artist? Obviously, any attempt at screen scraping is fraught with peril. Beautiful Soup is a great tool but it doesn't negate the fact that you've made a pact with the devil. That being said, if I had to guess, here's your puppy: > <a href="/lsp/t2995/"> > Love Without End, Amen > </a> the thing to look for is an "a" element with an href that starts with "/lsp/t", where "t" is for "track". Likewise: > <a href="/lsp/a436/"> > George Strait > </a> an href starting with "/lsp/a" is probably an artist link. You owe the Oracle three helpings of tag soup.
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-11 19:35 -0700 |
| Message-ID | <bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com> |
| In reply to | #21514 |
On Mar 11, 7:28 pm, Roy Smith <r...@panix.com> wrote: > In article > <239c4ad7-ac93-45c5-98d6-71a434e1c...@r21g2000yqa.googlegroups.com>, > John Salerno <johnj...@gmail.com> wrote: > > > > > > > > > > > Getting the time that the song is played is easy, because the time is > > wrapped in a <div> tag all by itself with a class attribute that has a > > specific value I can search for. But the actual song title and artist > > information is harder, because the HTML isn't quite as precise. Here's > > a sample: > > > <div class="cmPlaylistContent"> > > <strong> > > <a href="/lsp/t2995/"> > > Love Without End, Amen > > </a> > > </strong> > > <br/> > > <a href="/lsp/a436/"> > > George Strait > > </a> > > [...] > > Therefore, I appeal to your greater wisdom in these matters. Given > > this HTML, is there a "best practice" for how to refer to the song > > title and artist? > > Obviously, any attempt at screen scraping is fraught with peril. > Beautiful Soup is a great tool but it doesn't negate the fact that > you've made a pact with the devil. That being said, if I had to guess, > here's your puppy: > > > <a href="/lsp/t2995/"> > > Love Without End, Amen > > </a> > > the thing to look for is an "a" element with an href that starts with > "/lsp/t", where "t" is for "track". Likewise: > > > <a href="/lsp/a436/"> > > George Strait > > </a> > > an href starting with "/lsp/a" is probably an artist link. > > You owe the Oracle three helpings of tag soup. Well, I had considered exactly that method, but I don't know for sure if the titles and names will always have links like that, so I didn't want to tie my programming to something so specific. But perhaps it's still better than just taking the first two strings.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-03-12 09:27 -0400 |
| Message-ID | <roy-4FE287.09270212032012@news.panix.com> |
| In reply to | #21519 |
In article <bb1a55fa-3dcf-4480-ae87-be30a1a65bf7@h9g2000yqe.googlegroups.com>, John Salerno <johnjsal@gmail.com> wrote: > Well, I had considered exactly that method, but I don't know for sure > if the titles and names will always have links like that, so I didn't > want to tie my programming to something so specific. But perhaps it's > still better than just taking the first two strings. Such is the nature of screen scraping. For the most part, web pages are not meant to be parsed. If you decide to go down the road of trying to extract data from them, all bets are off. You look at the markup, take your best guess, and go for it. There's no magic here. Nobody can look at this HTML and come up with some hard and fast rule for how you're supposed to parse it. And, even if they could, it's all likely to change tomorrow when the site rolls out their next UI makeover.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web