Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Thu, 09 Aug 2012 09:25:49 +0200
From: Andreas Perstinger <andipersti@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Beautiful Soup Table Parsing
References: <CAKfBN+eDYVx+Q1m8ee8PQqABtMDpZV07+KcopnNH9T-ETnEpUg@mail.gmail.com>
In-Reply-To: <CAKfBN+eDYVx+Q1m8ee8PQqABtMDpZV07+KcopnNH9T-ETnEpUg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3095.1344497153.4697.python-list@python.org>
Lines: 40
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:26780

On 09.08.2012 01:58, Tom Russell wrote:
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>      for cell in row.findAll("td"):
>          print cell.findAll(text=True)
>
> brings in a list that looks like this:

[snip]

> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not. Also I want to do
> something like:
>
> If cell.contents[0] == "Advances":
>      Advances = next cell or whatever??---> this part I am not sure how to do.
>
> Can someone help point me in the right direction to get the first data
> point for the Advances row? I have others I will get as well but
> figure once I understand how to do this I can do the rest.

To get the header row you could do something like:

header_row = table.find(lambda tag: tag.td.string == "NYSE")

 From there you can look for the next row you are interested in:

advances_row = header_row.findNextSibling(lambda tag: tag.td.string == 
"Advances")

You could also iterate through all next siblings of the header_row:

for row in header_row.findNextSiblings("tr"):
      # do something

Bye, Andreas