Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Sun, 24 Feb 2013 17:32:29 -0800 (PST)
In-Reply-To: <mailman.2463.1361752083.2939.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM
References: <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2463.1361752083.2939.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: using urllib on a more complex site
From: "Adam W." <AWasilenko@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "python-list@python.org" <python-list@python.org>, "Adam W." <AWasilenko@gmail.com>
Precedence: list
Message-ID: <mailman.2474.1361755952.2939.python-list@python.org>
Lines: 68
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39852

On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote:
> On Sunday, February 24, 2013, Adam W.  wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/=
#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>=20
>=20
>=20
>=20
> in order to send myself an email every day of the 99c movie of the day.
>=20
>=20
>=20
> However, using a simple command like (in Python 3.0):
>=20
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99=
c%20Rental%20of%20the%20day').read()
>=20
>=20
>=20
>=20
> I don't get the all the source I need, its just the navigation buttons. =
=A0Now I assume they are using some CSS/javascript witchcraft to load all t=
he useful data later, so my question is how do I make urllib "wait" and gra=
b that data as well?
>=20
>=20
>=20
>=20
>=20
> urllib isn't a web browser. It just requests=A0the single (in this case, =
HTML) file from the given URL. It does not parse the HTML (indeed, it doesn=
't care what kind of file you're dealing with); therefore, it obviously doe=
s not retrieve the other resources linked within the document (CSS, JS, ima=
ges, etc.) nor does it run any JavaScript. So, there's nothing to "wait" fo=
r; urllib is already=A0doing everything it was designed to do.
>=20
>=20
>=20
> Your best bet is to open the page in a web browser yourself and use the d=
eveloper tools/inspectors to watch what XHR requests the page's scripts are=
 making, find the one(s) that have the data you care about, and then make t=
hose requests instead via urllib (or the `requests` 3rd-party lib, or whate=
ver). If the URL(s) vary, reverse-engineering the scheme used to generate t=
hem will also be required.
>=20
>=20
>=20
> Alternatively, you could use something like Selenium, which let's you dri=
ve an actual full web browser (e.g. Firefox)=A0from Python.
>=20
>=20
> Cheers,
> Chris
>=20
>=20
> --=20
> Cheers,
> Chris
> --
> http://rebertia.com

Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/=
application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/=
count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/o=
ffset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundl=
e

Thanks for the tip about XHR's