Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #26767
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <tsrdatatech@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.088 |
| X-Spam-Evidence | '*H*': 0.82; '*S*': 0.00; 'parsing': 0.07; 'tom': 0.07; 'row': 0.16; 'soup': 0.16; 'instance': 0.17; 'thanks,': 0.18; 'do.': 0.21; 'parse': 0.22; 'this:': 0.23; 'header': 0.24; 'looks': 0.26; 'skip:[ 10': 0.26; 'message-id:@mail.gmail.com': 0.27; 'received:209.85.213.174': 0.29; 'received:mail- yx0-f174.google.com': 0.29; 'figure': 0.30; 'code': 0.31; 'point': 0.31; 'not.': 0.32; 'print': 0.32; 'getting': 0.33; 'like:': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'skip:b 20': 0.34; 'received:google.com': 0.34; 'list': 0.35; 'direction': 0.35; 'table': 0.35; 'received:209.85': 0.35; 'something': 0.35; 'next': 0.35; 'but': 0.36; 'possible': 0.37; 'received:209': 0.37; 'data': 0.37; 'nothing': 0.38; 'sure': 0.38; 'page': 0.38; 'to:addr:python.org': 0.39; 'where': 0.40; 'header:Received:5': 0.40; 'help': 0.40; 'first': 0.61; 'url:public': 0.62; 'below:': 0.71; 'url:page': 0.71; 'grabbing': 0.84; 'resides': 0.84; 'url:online': 0.91 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=/nIW401PDd380MU17VA5ZdW4kCdoneUhcarxVTG/rJc=; b=NE+Kxp+VTtBpb9OqO1tMv4o6WfDFutMI1ie5bwLZL3CgwnX3B17yNBQn+HDbO4cxBU h6cKGyP5G4QYJkVqoQ9dBHlfr+lfHuNyth3XOg0KWN8KtQ+7xG++smbXjoBV3Dyconn8 b7t77RArwvM7vQ6xRlTXX7DurlaiVTozrymROKpSA5qa/dRUPdLGeEH9X2UDfzegOvYT mkGJXeiLc0itTIyKWne8j4Qg7nQXXZTrqYbqIiLKN9JNNL3nlVjDZBXQ2ssRyKHFrdfs WjtenoTz89E9CbD1aQU4X/bd8K9qD4eJJFCb0cJ/MSWUNwWj++Z7ULc+o6khMgjOK96l dRfg== |
| MIME-Version | 1.0 |
| Date | Wed, 8 Aug 2012 19:58:56 -0400 |
| Subject | Beautiful Soup Table Parsing |
| From | Tom Russell <tsrdatatech@gmail.com> |
| To | python-list@python.org |
| Content-Type | text/plain; charset=ISO-8859-1 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.3087.1344470340.4697.python-list@python.org> (permalink) |
| Lines | 254 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1344470340 news.xs4all.nl 6841 [2001:888:2000:d::a6]:47529 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:26767 |
Show key headers only | View raw
I am parsing out a web page at
http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
using BeautifulSoup.
My problem is that I can parse into the table where the data I want
resides but I cannot seem to figure out how to go about grabbing the
contents of the cell next to my row header I want.
For instance this code below:
soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
table = soup.find("table",{"class": "mdcTable"})
for row in table.findAll("tr"):
for cell in row.findAll("td"):
print cell.findAll(text=True)
brings in a list that looks like this:
[u'NYSE']
[u'Latest close']
[u'Previous close']
[u'Week ago']
[u'Issues traded']
[u'3,114']
[u'3,136']
[u'3,134']
[u'Advances']
[u'1,529']
[u'1,959']
[u'1,142']
[u'Declines']
[u'1,473']
[u'1,070']
[u'1,881']
[u'Unchanged']
[u'112']
[u'107']
[u'111']
[u'New highs']
[u'141']
[u'202']
[u'222']
[u'New lows']
[u'15']
[u'11']
[u'42']
[u'Adv. volume*']
[u'375,422,072']
[u'502,402,887']
[u'345,372,893']
[u'Decl. volume*']
[u'245,106,870']
[u'216,507,612']
[u'661,578,907']
[u'Total volume*']
[u'637,047,653']
[u'728,170,765']
[u'1,027,754,710']
[u'Closing tick']
[u'+131']
[u'+102']
[u'-505']
[u'Closing Arms (TRIN)\x86']
[u'0.62']
[u'0.77']
[u'1.20']
[u'Block trades*']
[u'3,874']
[u'4,106']
[u'4,463']
[u'Adv. volume']
[u'1,920,440,454']
[u'2,541,919,125']
[u'1,425,279,645']
[u'Decl. volume']
[u'1,149,672,387']
[u'1,063,007,504']
[u'2,812,073,564']
[u'Total volume']
[u'3,186,154,537']
[u'3,643,871,536']
[u'4,322,541,539']
[u'Nasdaq']
[u'Latest close']
[u'Previous close']
[u'Week ago']
[u'Issues traded']
[u'2,607']
[u'2,604']
[u'2,554']
[u'Advances']
[u'1,085']
[u'1,596']
[u'633']
[u'Declines']
[u'1,390']
[u'880']
[u'1,814']
[u'Unchanged']
[u'132']
[u'128']
[u'107']
[u'New highs']
[u'67']
[u'87']
[u'41']
[u'New lows']
[u'36']
[u'36']
[u'83']
[u'Closing tick']
[u'+225']
[u'+252']
[u'+588']
[u'Closing Arms (TRIN)\x86']
[u'0.48']
[u'0.46']
[u'0.69']
[u'Block trades']
[u'10,790']
[u'8,961']
[u'5,890']
[u'Adv. volume']
[u'1,114,620,628']
[u'1,486,955,619']
[u'566,904,549']
[u'Decl. volume']
[u'692,473,754']
[u'377,852,362']
[u'1,122,931,683']
[u'Total volume']
[u'1,856,979,279']
[u'1,883,468,274']
[u'1,714,837,606']
[u'NYSE Amex']
[u'Latest close']
[u'Previous close']
[u'Week ago']
[u'Issues traded']
[u'434']
[u'432']
[u'439']
[u'Advances']
[u'185']
[u'204']
[u'202']
[u'Declines']
[u'228']
[u'202']
[u'210']
[u'Unchanged']
[u'21']
[u'26']
[u'27']
[u'New highs']
[u'10']
[u'12']
[u'29']
[u'New lows']
[u'4']
[u'7']
[u'13']
[u'Adv. volume*']
[u'2,365,755']
[u'5,581,737']
[u'11,992,771']
[u'Decl. volume*']
[u'4,935,335']
[u'4,619,515']
[u'15,944,286']
[u'Total volume*']
[u'7,430,052']
[u'10,835,106']
[u'28,152,571']
[u'Closing tick']
[u'+32']
[u'+24']
[u'+24']
[u'Closing Arms (TRIN)\x86']
[u'1.63']
[u'0.64']
[u'1.12']
[u'Block trades*']
[u'75']
[u'113']
[u'171']
[u'NYSE Arca']
[u'Latest close']
[u'Previous close']
[u'Week ago']
[u'Issues traded']
[u'1,188']
[u'1,205']
[u'1,176']
[u'Advances']
[u'580']
[u'825']
[u'423']
[u'Declines']
[u'562']
[u'361']
[u'730']
[u'Unchanged']
[u'46']
[u'19']
[u'23']
[u'New highs']
[u'17']
[u'45']
[u'42']
[u'New lows']
[u'5']
[u'25']
[u'12']
[u'Adv. volume*']
[u'72,982,336']
[u'140,815,734']
[u'73,868,550']
[u'Decl. volume*']
[u'58,099,822']
[u'31,998,976']
[u'185,213,281']
[u'Total volume*']
[u'146,162,965']
[u'175,440,329']
[u'260,075,071']
[u'Closing tick']
[u'+213']
[u'+165']
[u'+83']
[u'Closing Arms (TRIN)\x86']
[u'0.86']
[u'0.73']
[u'1.37']
[u'Block trades*']
[u'834']
[u'1,043']
[u'1,593']
What I want to do is only be getting the data for NYSE and nothing
else so I do not know if that's possible or not. Also I want to do
something like:
If cell.contents[0] == "Advances":
Advances = next cell or whatever??---> this part I am not sure how to do.
Can someone help point me in the right direction to get the first data
point for the Advances row? I have others I will get as well but
figure once I understand how to do this I can do the rest.
Thanks,
Tom
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Beautiful Soup Table Parsing Tom Russell <tsrdatatech@gmail.com> - 2012-08-08 19:58 -0400
csiph-web