Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #26778 > unrolled thread
| Started by | Dieter Maurer <dieter@handshake.de> |
|---|---|
| First post | 2012-08-09 07:43 +0200 |
| Last post | 2012-08-09 07:43 +0200 |
| Articles | 1 — 1 participant |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Beautiful Soup Table Parsing Dieter Maurer <dieter@handshake.de> - 2012-08-09 07:43 +0200
| From | Dieter Maurer <dieter@handshake.de> |
|---|---|
| Date | 2012-08-09 07:43 +0200 |
| Subject | Re: Beautiful Soup Table Parsing |
| Message-ID | <mailman.3093.1344491046.4697.python-list@python.org> |
Tom Russell <tsrdatatech@gmail.com> writes:
> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
> for cell in row.findAll("td"):
> print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.
I am quite confident that it is possible (though I do not know
the details).
First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".
Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.
Back to top | Article view | comp.lang.python
csiph-web