X-Received: by 10.224.45.131 with SMTP id e3mr521429qaf.1.1407883498444; Tue, 12 Aug 2014 15:44:58 -0700 (PDT)
X-Received: by 10.140.32.227 with SMTP id h90mr8068qgh.26.1407883498428; Tue, 12 Aug 2014 15:44:58 -0700 (PDT)
Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!j15no6667950qaq.0!news-out.google.com!b3ni24360qac.1!nntp.google.com!j15no6667947qaq.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.python
Date: Tue, 12 Aug 2014 15:44:58 -0700 (PDT)
In-Reply-To: <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=146.90.214.3; posting-account=59tTfwoAAACIDa2nz1oVlQJc3aCJi_5b
NNTP-Posting-Host: 146.90.214.3
References: <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e2011de5-10fa-4de1-89fa-4e41882a6646@googlegroups.com>
Subject: Re: Suitable Python code to scrape specific details from  web pages.
From: Simon Evans <musicalhacksaw@yahoo.co.uk>
Injection-Date: Tue, 12 Aug 2014 22:44:58 +0000
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4017
X-Received-Body-CRC: 1676937530
Xref: csiph.com comp.lang.python:76154

On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote:
> Dear Programmers,
>=20
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Ree=
ves. I have tried a few of his python programs in the Python27 command prom=
pt, but altered them from accessing data using links say from the Dow Jones=
 index, to accessing the details I would be interested in accessing from th=
e 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the=
 example I am going to give, is not the information I am seeking, instead o=
f returning the given odds on a horse, it only returns a [], which isn't mu=
ch use.=20
>=20
> I would be glad if you could tell me where I am going wrong.=20
>=20
> Yours faithfully
>=20
> Simon Evans.
>=20
> -------------------------------------------------------------------------=
-------
>=20
> >>>import urllib
>=20
> >>>import re
>=20
> >>>htmlfile =3D urllib.urlopen("http://www.racingpost.com/horses2/cards/c=
ard.sd?
>=20
>=20
>=20
> race_id=3D600048r_date=3D2014-05-08#raceTabs=3Dsc_")
>=20
> htmltext =3D htmlfile.read()
>=20
> regex =3D '<strong>1<a href=3D"http://www.racingpost.com/horses/horse_hom=
e.sd?
>=20
>=20
>=20
> horse_id=3D758752"onclick=3D"scorecards.send(&quot;horse_name&quot:):retu=
rn Html.popup(this,
>=20
>=20
>=20
> {width:695,height:800})"title=3D"Full details about this HORSE">Lively=20
>=20
>=20
>=20
> Baron</a>9/4F</strong><br/>'
>=20
> >>>pattern =3D re.compile(regex)
>=20
> >>>odds=3Dre.findall(pattern,htmltext)
>=20
> >>>print odds
>=20
> []
>=20
> >>>
>=20
> -------------------------------------------------------------------------=
-------
>=20
> >>>import urllib
>=20
> >>>import re
>=20
> >>>htmlfile =3D urllib.urlopen("http://www.racingpost.com/horses2/cards/c=
ard.sd?
>=20
>=20
>=20
> >>>race_id=3D600048r_date=3D2014-05-08#raceTabs=3Dsc_")
>=20
> >>>htmltext =3D htmlfile.read()
>=20
> >>>regex =3D '<a></a>'
>=20
> >>>pattern =3D re.compile(regex)
>=20
> >>>odds=3Dre.findall(pattern,htmltext)
>=20
> >>>print odds
>=20
> []
>=20
> >>>
>=20
> -------------------------------------------------------------------------=
------
Dear Programmers, Thank you for your responses. I have installed 'Beautiful=
 Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't s=
eem to make  any progress with it, I am too thick to make much use of it. I=
 was hoping I could scrape specified stuff off Web pages without using it. =
I have installed 'Requests' also, is there any code I can use that you can =
suggest that can access the sort of Web page values that I have referred to=
 ?  such as odds, names of runners, stuff like that off the 'inspect elemen=
t' or 'source' htaml pages, on www.Racingpost.com.=20