Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Rob Gaddi Newsgroups: comp.lang.python Subject: Re: Suitable Python code to scrape specific details from web pages. Date: Tue, 12 Aug 2014 13:11:47 -0700 Organization: Highland Technology, Inc. Lines: 48 Message-ID: <20140812131147.5c99507c@rg.highlandtechnology.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Info: mx05.eternal-september.org; posting-host="903ac420d4384e8fcf51b0ca3b6abd1b"; logging-data="29605"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18SVQYsdrfUvXOma5dWYmax" X-Newsreader: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Cancel-Lock: sha1:uryMt1gSp0e2hZIOI3PFfSjf5mI= Xref: csiph.com comp.lang.python:76148 On Tue, 12 Aug 2014 13:00:30 -0700 (PDT) Simon Evans wrote: > Dear Programmers, > I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. > I would be glad if you could tell me where I am going wrong. > Yours faithfully > Simon Evans. > -------------------------------------------------------------------------------- > >>>import urllib > >>>import re > >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd? > > race_id=600048r_date=2014-05-08#raceTabs=sc_") > htmltext = htmlfile.read() > regex = '1Lively > > Baron9/4F
' > >>>pattern = re.compile(regex) > >>>odds=re.findall(pattern,htmltext) > >>>print odds > [] > >>> > -------------------------------------------------------------------------------- > >>>import urllib > >>>import re > >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd? > > >>>race_id=600048r_date=2014-05-08#raceTabs=sc_") > >>>htmltext = htmlfile.read() > >>>regex = '' > >>>pattern = re.compile(regex) > >>>odds=re.findall(pattern,htmltext) > >>>print odds > [] > >>> > ------------------------------------------------------------------------------- If you want web scraping, you want to use http://www.crummy.com/software/BeautifulSoup/ . End of story. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.