Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Sun, 24 Feb 2013 17:28:00 -0800 (PST)
In-Reply-To: <mailman.2465.1361752225.2939.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM
References: <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2465.1361752225.2939.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: using urllib on a more complex site
From: "Adam W." <AWasilenko@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Message-ID: <mailman.2473.1361755689.2939.python-list@python.org>
Lines: 102
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39849

On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
>=20
> > I'm trying to write a simple script to scrape http://www.vudu.com/movie=
s/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>=20
> >
>=20
> > in order to send myself an email every day of the 99c movie of the day.
>=20
> >
>=20
> > However, using a simple command like (in Python 3.0):
>=20
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/=
99c%20Rental%20of%20the%20day').read()
>=20
> >
>=20
> > I don't get the all the source I need, its just the navigation buttons.=
  Now I assume they are using some CSS/javascript witchcraft to load all th=
e useful data later, so my question is how do I make urllib "wait" and grab=
 that data as well?
>=20
> >
>=20
>=20
>=20
> The CSS and the jpegs, and many other aspects of a web "page" are loaded=
=20
>=20
> explicitly, by the browser, when parsing the tags of the page you=20
>=20
> downloaded.  There is no sooner or later.  The website won't send the=20
>=20
> other files until you request them.
>=20
>=20
>=20
> For example, that site at the moment has one image (prob. jpeg)=20
>=20
> highlighted,
>=20
>=20
>=20
> <img class=3D"gwt-Image" src=3D"http://images2.vudu.com/poster2/179186-m"=
=20
>=20
> alt=3D"Sex and the City: The Movie (Theatrical)">
>=20
>=20
>=20
> if you want to look at that jpeg, you need to download the file url=20
>=20
> specified by the src attribute of that img element.
>=20
>=20
>=20
> Or perhaps you can just look at the 'alt' attribute, which is mainly=20
>=20
> there for browsers who don't happen to do graphics, for example, the=20
>=20
> ones for the blind.
>=20
>=20
>=20
> Naturally, there may be dozens of images on the page, and there's no=20
>=20
> guarantee that the website author is trying to make it easy for you.=20
>=20
> Why not check if there's a defined api for extracting the information=20
>=20
> you want?  Check the site, or send a message to the webmaster.
>=20
>=20
>=20
> No guarantee that tomorrow, the information won't be buried in some=20
>=20
> javascript fragment.  Again, if you want to see that, you might need to=
=20
>=20
> write a javascript interpreter.  it could use any algorithm at all to=20
>=20
> build webpage information, and the encoding could change day by day, or=
=20
>=20
> hour by hour.
>=20
>=20
>=20
> --=20
>=20
> DaveA

The problem is, the image url you found is not returned in the data urllib =
grabs.  To be clear, I was aware of what urllib is supposed to do (ie not d=
ownload image data when loading a page), I've used it before many times, ju=
st never had to jump through hoops to get at the content I needed.

I'll look into figuring out how to find XHR requests in Chrome, I didn't kn=
ow what they called that after the fact loading, so now my searching will b=
e more productive.