Re: Beautiful Soup Table Parsing

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!eweka.nl!hq-usenetpeers.eweka.nl!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.005
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'nested': 0.07; 'parsing': 0.07; 'tom': 0.07; 'indicates': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'assume': 0.11; '"break"': 0.16; 'levels,': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'row': 0.16; 'soup': 0.16; 'structure.': 0.16; 'instance': 0.17; 'variable': 0.20; 'controlled': 0.22; 'parse': 0.22; 'statement': 0.23; 'this:': 0.23; 'second': 0.24; 'header': 0.24; 'header:User-Agent:1': 0.26; 'looks': 0.26; 'leave': 0.26; 'values': 0.26; 'header:X -Complaints-To:1': 0.28; 'inspect': 0.29; 'loop,': 0.29; 'methods.': 0.29; 'writes:': 0.29; 'probably': 0.29; 'figure': 0.30; 'code': 0.31; 'not.': 0.32; 'print': 0.32; 'getting': 0.33; 'builds': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'skip:b 20': 0.34; 'list': 0.35; 'table': 0.35; 'something': 0.35; 'next': 0.35; 'received:org': 0.36; 'but': 0.36; 'charset:us- ascii': 0.36; 'possible': 0.37; 'quite': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'nothing': 0.38; 'page': 0.38; 'to:addr:python.org': 0.39; 'where': 0.40; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'further': 0.61; 'first': 0.61; 'url:public': 0.62; 'details': 0.63; 'note:': 0.64; 'received:217': 0.68; 'below:': 0.71; 'url:page': 0.71; 'grabbing': 0.84; 'resides': 0.84; 'russell': 0.84; 'url:online': 0.91
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Dieter Maurer <dieter@handshake.de>
Subject	Re: Beautiful Soup Table Parsing
Date	Thu, 09 Aug 2012 07:43:51 +0200
References	<CAKfBN+eDYVx+Q1m8ee8PQqABtMDpZV07+KcopnNH9T-ETnEpUg@mail.gmail.com>
Mime-Version	1.0
Content-Type	text/plain; charset=us-ascii
X-Gmane-NNTP-Posting-Host	pd9e09134.dip0.t-ipconnect.de
User-Agent	Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (linux)
Cancel-Lock	sha1:clpyYlCHOGav5fzPYWqL0nP3+Ac=
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.12
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.3093.1344491046.4697.python-list@python.org> (permalink)
Lines	43
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1344491046 news.xs4all.nl 6886 [2001:888:2000:d::a6]:49422
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:26778

Show key headers only | View raw

Tom Russell <tsrdatatech@gmail.com> writes:

> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>     for cell in row.findAll("td"):
>         print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.

I am quite confident that it is possible (though I do not know
the details).

First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".

Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: Beautiful Soup Table Parsing Dieter Maurer <dieter@handshake.de> - 2012-08-09 07:43 +0200

csiph-web