Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #26778

Re: Beautiful Soup Table Parsing

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!eweka.nl!hq-usenetpeers.eweka.nl!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.005
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'nested': 0.07; 'parsing': 0.07; 'tom': 0.07; 'indicates': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'assume': 0.11; '"break"': 0.16; 'levels,': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'row': 0.16; 'soup': 0.16; 'structure.': 0.16; 'instance': 0.17; 'variable': 0.20; 'controlled': 0.22; 'parse': 0.22; 'statement': 0.23; 'this:': 0.23; 'second': 0.24; 'header': 0.24; 'header:User-Agent:1': 0.26; 'looks': 0.26; 'leave': 0.26; 'values': 0.26; 'header:X -Complaints-To:1': 0.28; 'inspect': 0.29; 'loop,': 0.29; 'methods.': 0.29; 'writes:': 0.29; 'probably': 0.29; 'figure': 0.30; 'code': 0.31; 'not.': 0.32; 'print': 0.32; 'getting': 0.33; 'builds': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'skip:b 20': 0.34; 'list': 0.35; 'table': 0.35; 'something': 0.35; 'next': 0.35; 'received:org': 0.36; 'but': 0.36; 'charset:us- ascii': 0.36; 'possible': 0.37; 'quite': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'nothing': 0.38; 'page': 0.38; 'to:addr:python.org': 0.39; 'where': 0.40; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'further': 0.61; 'first': 0.61; 'url:public': 0.62; 'details': 0.63; 'note:': 0.64; 'received:217': 0.68; 'below:': 0.71; 'url:page': 0.71; 'grabbing': 0.84; 'resides': 0.84; 'russell': 0.84; 'url:online': 0.91
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Dieter Maurer <dieter@handshake.de>
Subject Re: Beautiful Soup Table Parsing
Date Thu, 09 Aug 2012 07:43:51 +0200
References <CAKfBN+eDYVx+Q1m8ee8PQqABtMDpZV07+KcopnNH9T-ETnEpUg@mail.gmail.com>
Mime-Version 1.0
Content-Type text/plain; charset=us-ascii
X-Gmane-NNTP-Posting-Host pd9e09134.dip0.t-ipconnect.de
User-Agent Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (linux)
Cancel-Lock sha1:clpyYlCHOGav5fzPYWqL0nP3+Ac=
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3093.1344491046.4697.python-list@python.org> (permalink)
Lines 43
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1344491046 news.xs4all.nl 6886 [2001:888:2000:d::a6]:49422
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:26778

Show key headers only | View raw


Tom Russell <tsrdatatech@gmail.com> writes:

> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>     for cell in row.findAll("td"):
>         print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.

I am quite confident that it is possible (though I do not know
the details).

First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".

Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Beautiful Soup Table Parsing Dieter Maurer <dieter@handshake.de> - 2012-08-09 07:43 +0200

csiph-web