Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #26778 > unrolled thread

Re: Beautiful Soup Table Parsing

Started byDieter Maurer <dieter@handshake.de>
First post2012-08-09 07:43 +0200
Last post2012-08-09 07:43 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Beautiful Soup Table Parsing Dieter Maurer <dieter@handshake.de> - 2012-08-09 07:43 +0200

#26778 — Re: Beautiful Soup Table Parsing

FromDieter Maurer <dieter@handshake.de>
Date2012-08-09 07:43 +0200
SubjectRe: Beautiful Soup Table Parsing
Message-ID<mailman.3093.1344491046.4697.python-list@python.org>
Tom Russell <tsrdatatech@gmail.com> writes:

> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>     for cell in row.findAll("td"):
>         print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.

I am quite confident that it is possible (though I do not know
the details).

First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".

Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web