Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder1.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: chris@rebertia.com
In-Reply-To: <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
References: <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
Date: Sun, 24 Feb 2013 16:27:54 -0800
Subject: Re: using urllib on a more complex site
From: Chris Rebert <clp2@rebertia.com>
To: "Adam W." <AWasilenko@gmail.com>
Content-Type: multipart/alternative; boundary=f46d0402ac45b3214c04d6819cf7
Cc: "python-list@python.org" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2463.1361752083.2939.python-list@python.org>
Lines: 95
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:39828

--f46d0402ac45b3214c04d6819cf7
Content-Type: text/plain; charset=UTF-8

On Sunday, February 24, 2013, Adam W. wrote:

> I'm trying to write a simple script to scrape
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> in order to send myself an email every day of the 99c movie of the day.
>
> However, using a simple command like (in Python 3.0):
> urllib.request.urlopen('
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read(
> )
>
> I don't get the all the source I need, its just the navigation buttons.
>  Now I assume they are using some CSS/javascript witchcraft to load all the
> useful data later, so my question is how do I make urllib "wait" and grab
> that data as well?
>

urllib isn't a web browser. It just requests the single (in this case,
HTML) file from the given URL. It does not parse the HTML (indeed, it
doesn't care what kind of file you're dealing with); therefore, it
obviously does not retrieve the other resources linked within the document
(CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing
to "wait" for; urllib is already doing everything it was designed to do.

Your best bet is to open the page in a web browser yourself and use the
developer tools/inspectors to watch what XHR requests the page's scripts
are making, find the one(s) that have the data you care about, and then
make those requests instead via urllib (or the `requests` 3rd-party lib, or
whatever). If the URL(s) vary, reverse-engineering the scheme used to
generate them will also be required.

Alternatively, you could use something like Selenium, which let's you drive
an actual full web browser (e.g. Firefox) from Python.

Cheers,
Chris


-- 
Cheers,
Chris
--
http://rebertia.com

--f46d0402ac45b3214c04d6819cf7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sunday, February 24, 2013, Adam W.  wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">I&#39;m trying to write a simple script to scrape <a href=3D"http://ww=
w.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day" target=
=3D"_blank">http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20o=
f%20the%20day</a><br>

<br>
in order to send myself an email every day of the 99c movie of the day.<br>
<br>
However, using a simple command like (in Python 3.0):<br>
urllib.request.urlopen(&#39;<a href=3D"http://www.vudu.com/movies/#tag/99ce=
ntOfTheDay/99c%20Rental%20of%20the%20day&#39;).read(" target=3D"_blank">htt=
p://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day&#=
39;).read(</a>)<br>

<br>
I don&#39;t get the all the source I need, its just the navigation buttons.=
 =C2=A0Now I assume they are using some CSS/javascript witchcraft to load a=
ll the useful data later, so my question is how do I make urllib &quot;wait=
&quot; and grab that data as well?<br>

</blockquote><div><br></div><div>urllib isn&#39;t a web browser. It just re=
quests=C2=A0the single (in this case, HTML) file from the given URL. It doe=
s not parse the HTML (indeed, it doesn&#39;t care what kind of file you&#39=
;re dealing with); therefore, it obviously does not retrieve the other reso=
urces linked within the document (CSS, JS, images, etc.) nor does it run an=
y JavaScript. So, there&#39;s nothing to &quot;wait&quot; for; urllib is al=
ready=C2=A0doing everything it was designed to do.</div>
<div><br></div><div>Your best bet is to open the page in a web browser your=
self and use the developer tools/inspectors to watch what XHR requests the =
page&#39;s scripts are making, find the one(s) that have the data you care =
about, and then make those requests instead via urllib (or the `requests` 3=
rd-party lib, or whatever). If the URL(s) vary, reverse-engineering the sch=
eme used to generate them will also be required.</div>
<div><br></div><div>Alternatively, you could use something like Selenium, w=
hich let&#39;s you drive an actual full web browser (e.g. Firefox)=C2=A0<sp=
an></span>from Python.</div><div><br></div><div>Cheers,</div><div>Chris</di=
v>
<br><br>-- <br>Cheers,<br>Chris<br>--<br><a href=3D"http://rebertia.com" ta=
rget=3D"_blank">http://rebertia.com</a><br>

--f46d0402ac45b3214c04d6819cf7--