Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #26780 > unrolled thread

Re: Beautiful Soup Table Parsing

Started byAndreas Perstinger <andipersti@gmail.com>
First post2012-08-09 09:25 +0200
Last post2012-08-09 09:25 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Beautiful Soup Table Parsing Andreas Perstinger <andipersti@gmail.com> - 2012-08-09 09:25 +0200

#26780 — Re: Beautiful Soup Table Parsing

FromAndreas Perstinger <andipersti@gmail.com>
Date2012-08-09 09:25 +0200
SubjectRe: Beautiful Soup Table Parsing
Message-ID<mailman.3095.1344497153.4697.python-list@python.org>
On 09.08.2012 01:58, Tom Russell wrote:
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>      for cell in row.findAll("td"):
>          print cell.findAll(text=True)
>
> brings in a list that looks like this:

[snip]

> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not. Also I want to do
> something like:
>
> If cell.contents[0] == "Advances":
>      Advances = next cell or whatever??---> this part I am not sure how to do.
>
> Can someone help point me in the right direction to get the first data
> point for the Advances row? I have others I will get as well but
> figure once I understand how to do this I can do the rest.

To get the header row you could do something like:

header_row = table.find(lambda tag: tag.td.string == "NYSE")

 From there you can look for the next row you are interested in:

advances_row = header_row.findNextSibling(lambda tag: tag.td.string == 
"Advances")

You could also iterate through all next siblings of the header_row:

for row in header_row.findNextSiblings("tr"):
      # do something

Bye, Andreas

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web