Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Denis McMahon Newsgroups: comp.lang.python Subject: Re: Suitable Python code to scrape specific details from web pages. Date: Wed, 13 Aug 2014 14:53:41 +0000 (UTC) Organization: A noiseless patient Spider Lines: 36 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 13 Aug 2014 14:53:41 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="66ffcfa4470a58bcddbdcd1913f98ab4"; logging-data="24813"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qa4bR61MJJXjAPCHedPETSWXhQd8YPTM=" User-Agent: Pan/0.136 (I'm far too busy being delicious; GIT 926a150 git://git.gnome.org/pan2) Cancel-Lock: sha1:FhQooKItpCAFjQRGkLfay/byHeY= Xref: csiph.com comp.lang.python:76206 On Tue, 12 Aug 2014 13:00:30 -0700, Simon Evans wrote: > in accessing from the 'Racing Post' on a daily basis. Anyhow, the code Following is some starter code. You will have to look at the output, compare it to the web page, and work out how you want to process it further. Note that I use beautifulsoup and requests. The output is the html for each cell in the table with a line of "+" characters at the table row breaks. I suggest you look at the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ to work out how you may wish to select which table cells contain data you are interested in and how to extract it. #!/usr/bin/python """ Program to extract data from racingpost. """ from bs4 import BeautifulSoup import requests r = requests.get( "http://www.racingpost.com/horses2/cards/card.sd? race_id=607466&r_date=2014-08-13#raceTabs=sc_" ) if r.status_code == 200: soup = BeautifulSoup( r.content ) table = soup.find( "table", id="sc_horseCard" ) for row in table.find_all( "tr" ): for cell in row.find_all( "td" ): print cell print "+++++++++++++++++++++++++++++++++++++" else: print "HTTP Status", r.status_code -- Denis McMahon, denismfmcmahon@gmail.com