Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!eweka.nl!hq-usenetpeers.eweka.nl!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Dieter Maurer <dieter@handshake.de>
Subject: Re: Beautiful Soup Table Parsing
Date: Thu, 09 Aug 2012 07:43:51 +0200
References: <CAKfBN+eDYVx+Q1m8ee8PQqABtMDpZV07+KcopnNH9T-ETnEpUg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
User-Agent: Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (linux)
Cancel-Lock: sha1:clpyYlCHOGav5fzPYWqL0nP3+Ac=
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3093.1344491046.4697.python-list@python.org>
Lines: 43
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:26778

Tom Russell <tsrdatatech@gmail.com> writes:

> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>     for cell in row.findAll("td"):
>         print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.

I am quite confident that it is possible (though I do not know
the details).

First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".

Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.