Groups > comp.lang.python > #39825 > unrolled thread

using urllib on a more complex site

Started by	"Adam W." <AWasilenko@gmail.com>
First post	2013-02-24 16:02 -0800
Last post	2013-02-24 17:28 -0800
Articles	7 — 3 participants

Back to article view | Back to comp.lang.python

  using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
    Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
      Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
      Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
    Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
      Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
      Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800

#39825 — using urllib on a more complex site

From	"Adam W." <AWasilenko@gmail.com>
Date	2013-02-24 16:02 -0800
Subject	using urllib on a more complex site
Message-ID	<e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>

I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day

in order to send myself an email every day of the 99c movie of the day.

However, using a simple command like (in Python 3.0): 
urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()

I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?

[toc] | [next] | [standalone]

#39828

From	Chris Rebert <clp2@rebertia.com>
Date	2013-02-24 16:27 -0800
Message-ID	<mailman.2463.1361752083.2939.python-list@python.org>
In reply to	#39825

[Multipart message — attachments visible in raw view] — view raw

On Sunday, February 24, 2013, Adam W. wrote:

> I'm trying to write a simple script to scrape
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> in order to send myself an email every day of the 99c movie of the day.
>
> However, using a simple command like (in Python 3.0):
> urllib.request.urlopen('
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read(
> )
>
> I don't get the all the source I need, its just the navigation buttons.
>  Now I assume they are using some CSS/javascript witchcraft to load all the
> useful data later, so my question is how do I make urllib "wait" and grab
> that data as well?
>

urllib isn't a web browser. It just requests the single (in this case,
HTML) file from the given URL. It does not parse the HTML (indeed, it
doesn't care what kind of file you're dealing with); therefore, it
obviously does not retrieve the other resources linked within the document
(CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing
to "wait" for; urllib is already doing everything it was designed to do.

Your best bet is to open the page in a web browser yourself and use the
developer tools/inspectors to watch what XHR requests the page's scripts
are making, find the one(s) that have the data you care about, and then
make those requests instead via urllib (or the `requests` 3rd-party lib, or
whatever). If the URL(s) vary, reverse-engineering the scheme used to
generate them will also be required.

Alternatively, you could use something like Selenium, which let's you drive
an actual full web browser (e.g. Firefox) from Python.

Cheers,
Chris

-- 
Cheers,
Chris
--
http://rebertia.com

[toc] | [prev] | [next] | [standalone]

#39851

From	"Adam W." <AWasilenko@gmail.com>
Date	2013-02-24 17:32 -0800
Message-ID	<03299449-cba6-413e-9957-8bde6187838a@googlegroups.com>
In reply to	#39828

On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote:
> On Sunday, February 24, 2013, Adam W.  wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> 
> 
> 
> in order to send myself an email every day of the 99c movie of the day.
> 
> 
> 
> However, using a simple command like (in Python 3.0):
> 
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> 
> 
> 
> I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> 
> 
> 
> 
> urllib isn't a web browser. It just requests the single (in this case, HTML) file from the given URL. It does not parse the HTML (indeed, it doesn't care what kind of file you're dealing with); therefore, it obviously does not retrieve the other resources linked within the document (CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing to "wait" for; urllib is already doing everything it was designed to do.
> 
> 
> 
> Your best bet is to open the page in a web browser yourself and use the developer tools/inspectors to watch what XHR requests the page's scripts are making, find the one(s) that have the data you care about, and then make those requests instead via urllib (or the `requests` 3rd-party lib, or whatever). If the URL(s) vary, reverse-engineering the scheme used to generate them will also be required.
> 
> 
> 
> Alternatively, you could use something like Selenium, which let's you drive an actual full web browser (e.g. Firefox) from Python.
> 
> 
> Cheers,
> Chris
> 
> 
> -- 
> Cheers,
> Chris
> --
> http://rebertia.com

Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/offset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundle

Thanks for the tip about XHR's

[toc] | [prev] | [next] | [standalone]

#39852

From	"Adam W." <AWasilenko@gmail.com>
Date	2013-02-24 17:32 -0800
Message-ID	<mailman.2474.1361755952.2939.python-list@python.org>
In reply to	#39828

On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote:
> On Sunday, February 24, 2013, Adam W.  wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> 
> 
> 
> in order to send myself an email every day of the 99c movie of the day.
> 
> 
> 
> However, using a simple command like (in Python 3.0):
> 
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> 
> 
> 
> I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> 
> 
> 
> 
> urllib isn't a web browser. It just requests the single (in this case, HTML) file from the given URL. It does not parse the HTML (indeed, it doesn't care what kind of file you're dealing with); therefore, it obviously does not retrieve the other resources linked within the document (CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing to "wait" for; urllib is already doing everything it was designed to do.
> 
> 
> 
> Your best bet is to open the page in a web browser yourself and use the developer tools/inspectors to watch what XHR requests the page's scripts are making, find the one(s) that have the data you care about, and then make those requests instead via urllib (or the `requests` 3rd-party lib, or whatever). If the URL(s) vary, reverse-engineering the scheme used to generate them will also be required.
> 
> 
> 
> Alternatively, you could use something like Selenium, which let's you drive an actual full web browser (e.g. Firefox) from Python.
> 
> 
> Cheers,
> Chris
> 
> 
> -- 
> Cheers,
> Chris
> --
> http://rebertia.com

Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/offset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundle

Thanks for the tip about XHR's

[toc] | [prev] | [next] | [standalone]

#39830

From	Dave Angel <davea@davea.name>
Date	2013-02-24 19:30 -0500
Message-ID	<mailman.2465.1361752225.2939.python-list@python.org>
In reply to	#39825

On 02/24/2013 07:02 PM, Adam W. wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> in order to send myself an email every day of the 99c movie of the day.
>
> However, using a simple command like (in Python 3.0):
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
>
> I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
>

The CSS and the jpegs, and many other aspects of a web "page" are loaded 
explicitly, by the browser, when parsing the tags of the page you 
downloaded.  There is no sooner or later.  The website won't send the 
other files until you request them.

For example, that site at the moment has one image (prob. jpeg) 
highlighted,

<img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
alt="Sex and the City: The Movie (Theatrical)">

if you want to look at that jpeg, you need to download the file url 
specified by the src attribute of that img element.

Or perhaps you can just look at the 'alt' attribute, which is mainly 
there for browsers who don't happen to do graphics, for example, the 
ones for the blind.

Naturally, there may be dozens of images on the page, and there's no 
guarantee that the website author is trying to make it easy for you. 
Why not check if there's a defined api for extracting the information 
you want?  Check the site, or send a message to the webmaster.

No guarantee that tomorrow, the information won't be buried in some 
javascript fragment.  Again, if you want to see that, you might need to 
write a javascript interpreter.  it could use any algorithm at all to 
build webpage information, and the encoding could change day by day, or 
hour by hour.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#39848

From	"Adam W." <AWasilenko@gmail.com>
Date	2013-02-24 17:28 -0800
Message-ID	<6f637533-9d65-43dd-a810-08f8cf9a88b9@googlegroups.com>
In reply to	#39830

On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
> 
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> >
> 
> > in order to send myself an email every day of the 99c movie of the day.
> 
> >
> 
> > However, using a simple command like (in Python 3.0):
> 
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> >
> 
> > I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> >
> 
> 
> 
> The CSS and the jpegs, and many other aspects of a web "page" are loaded 
> 
> explicitly, by the browser, when parsing the tags of the page you 
> 
> downloaded.  There is no sooner or later.  The website won't send the 
> 
> other files until you request them.
> 
> 
> 
> For example, that site at the moment has one image (prob. jpeg) 
> 
> highlighted,
> 
> 
> 
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
> 
> alt="Sex and the City: The Movie (Theatrical)">
> 
> 
> 
> if you want to look at that jpeg, you need to download the file url 
> 
> specified by the src attribute of that img element.
> 
> 
> 
> Or perhaps you can just look at the 'alt' attribute, which is mainly 
> 
> there for browsers who don't happen to do graphics, for example, the 
> 
> ones for the blind.
> 
> 
> 
> Naturally, there may be dozens of images on the page, and there's no 
> 
> guarantee that the website author is trying to make it easy for you. 
> 
> Why not check if there's a defined api for extracting the information 
> 
> you want?  Check the site, or send a message to the webmaster.
> 
> 
> 
> No guarantee that tomorrow, the information won't be buried in some 
> 
> javascript fragment.  Again, if you want to see that, you might need to 
> 
> write a javascript interpreter.  it could use any algorithm at all to 
> 
> build webpage information, and the encoding could change day by day, or 
> 
> hour by hour.
> 
> 
> 
> -- 
> 
> DaveA

The problem is, the image url you found is not returned in the data urllib grabs.  To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.

I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.

[toc] | [prev] | [next] | [standalone]

#39849

From	"Adam W." <AWasilenko@gmail.com>
Date	2013-02-24 17:28 -0800
Message-ID	<mailman.2473.1361755689.2939.python-list@python.org>
In reply to	#39830

On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
> 
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> >
> 
> > in order to send myself an email every day of the 99c movie of the day.
> 
> >
> 
> > However, using a simple command like (in Python 3.0):
> 
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> >
> 
> > I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> >
> 
> 
> 
> The CSS and the jpegs, and many other aspects of a web "page" are loaded 
> 
> explicitly, by the browser, when parsing the tags of the page you 
> 
> downloaded.  There is no sooner or later.  The website won't send the 
> 
> other files until you request them.
> 
> 
> 
> For example, that site at the moment has one image (prob. jpeg) 
> 
> highlighted,
> 
> 
> 
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
> 
> alt="Sex and the City: The Movie (Theatrical)">
> 
> 
> 
> if you want to look at that jpeg, you need to download the file url 
> 
> specified by the src attribute of that img element.
> 
> 
> 
> Or perhaps you can just look at the 'alt' attribute, which is mainly 
> 
> there for browsers who don't happen to do graphics, for example, the 
> 
> ones for the blind.
> 
> 
> 
> Naturally, there may be dozens of images on the page, and there's no 
> 
> guarantee that the website author is trying to make it easy for you. 
> 
> Why not check if there's a defined api for extracting the information 
> 
> you want?  Check the site, or send a message to the webmaster.
> 
> 
> 
> No guarantee that tomorrow, the information won't be buried in some 
> 
> javascript fragment.  Again, if you want to see that, you might need to 
> 
> write a javascript interpreter.  it could use any algorithm at all to 
> 
> build webpage information, and the encoding could change day by day, or 
> 
> hour by hour.
> 
> 
> 
> -- 
> 
> DaveA

The problem is, the image url you found is not returned in the data urllib grabs.  To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.

I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.

[toc] | [prev] | [standalone]

csiph-web

using urllib on a more complex site

Contents

#39825 — using urllib on a more complex site

#39828

#39851

#39852

#39830

#39848

#39849