X-Received: by 10.224.45.131 with SMTP id e3mr521429qaf.1.1407883498444; Tue, 12 Aug 2014 15:44:58 -0700 (PDT) X-Received: by 10.140.32.227 with SMTP id h90mr8068qgh.26.1407883498428; Tue, 12 Aug 2014 15:44:58 -0700 (PDT) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!j15no6667950qaq.0!news-out.google.com!b3ni24360qac.1!nntp.google.com!j15no6667947qaq.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.python Date: Tue, 12 Aug 2014 15:44:58 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=146.90.214.3; posting-account=59tTfwoAAACIDa2nz1oVlQJc3aCJi_5b NNTP-Posting-Host: 146.90.214.3 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Suitable Python code to scrape specific details from web pages. From: Simon Evans Injection-Date: Tue, 12 Aug 2014 22:44:58 +0000 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Received-Bytes: 4017 X-Received-Body-CRC: 1676937530 Xref: csiph.com comp.lang.python:76154 On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote: > Dear Programmers, >=20 > I have been looking at the You tube 'Web Scraping Tutorials' of Chris Ree= ves. I have tried a few of his python programs in the Python27 command prom= pt, but altered them from accessing data using links say from the Dow Jones= index, to accessing the details I would be interested in accessing from th= e 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the= example I am going to give, is not the information I am seeking, instead o= f returning the given odds on a horse, it only returns a [], which isn't mu= ch use.=20 >=20 > I would be glad if you could tell me where I am going wrong.=20 >=20 > Yours faithfully >=20 > Simon Evans. >=20 > -------------------------------------------------------------------------= ------- >=20 > >>>import urllib >=20 > >>>import re >=20 > >>>htmlfile =3D urllib.urlopen("http://www.racingpost.com/horses2/cards/c= ard.sd? >=20 >=20 >=20 > race_id=3D600048r_date=3D2014-05-08#raceTabs=3Dsc_") >=20 > htmltext =3D htmlfile.read() >=20 > regex =3D '1=20 >=20 >=20 > horse_id=3D758752"onclick=3D"scorecards.send("horse_name":):retu= rn Html.popup(this, >=20 >=20 >=20 > {width:695,height:800})"title=3D"Full details about this HORSE">Lively=20 >=20 >=20 >=20 > Baron9/4F
' >=20 > >>>pattern =3D re.compile(regex) >=20 > >>>odds=3Dre.findall(pattern,htmltext) >=20 > >>>print odds >=20 > [] >=20 > >>> >=20 > -------------------------------------------------------------------------= ------- >=20 > >>>import urllib >=20 > >>>import re >=20 > >>>htmlfile =3D urllib.urlopen("http://www.racingpost.com/horses2/cards/c= ard.sd? >=20 >=20 >=20 > >>>race_id=3D600048r_date=3D2014-05-08#raceTabs=3Dsc_") >=20 > >>>htmltext =3D htmlfile.read() >=20 > >>>regex =3D '' >=20 > >>>pattern =3D re.compile(regex) >=20 > >>>odds=3Dre.findall(pattern,htmltext) >=20 > >>>print odds >=20 > [] >=20 > >>> >=20 > -------------------------------------------------------------------------= ------ Dear Programmers, Thank you for your responses. I have installed 'Beautiful= Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't s= eem to make any progress with it, I am too thick to make much use of it. I= was hoping I could scrape specified stuff off Web pages without using it. = I have installed 'Requests' also, is there any code I can use that you can = suggest that can access the sort of Web page values that I have referred to= ? such as odds, names of runners, stuff like that off the 'inspect elemen= t' or 'source' htaml pages, on www.Racingpost.com.=20