Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #39849
| Newsgroups | comp.lang.python |
|---|---|
| Date | 2013-02-24 17:28 -0800 |
| References | <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2465.1361752225.2939.python-list@python.org> |
| Subject | Re: using urllib on a more complex site |
| From | "Adam W." <AWasilenko@gmail.com> |
| Message-ID | <mailman.2473.1361755689.2939.python-list@python.org> (permalink) |
On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
>
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> >
>
> > in order to send myself an email every day of the 99c movie of the day.
>
> >
>
> > However, using a simple command like (in Python 3.0):
>
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
>
> >
>
> > I don't get the all the source I need, its just the navigation buttons. Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
>
> >
>
>
>
> The CSS and the jpegs, and many other aspects of a web "page" are loaded
>
> explicitly, by the browser, when parsing the tags of the page you
>
> downloaded. There is no sooner or later. The website won't send the
>
> other files until you request them.
>
>
>
> For example, that site at the moment has one image (prob. jpeg)
>
> highlighted,
>
>
>
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m"
>
> alt="Sex and the City: The Movie (Theatrical)">
>
>
>
> if you want to look at that jpeg, you need to download the file url
>
> specified by the src attribute of that img element.
>
>
>
> Or perhaps you can just look at the 'alt' attribute, which is mainly
>
> there for browsers who don't happen to do graphics, for example, the
>
> ones for the blind.
>
>
>
> Naturally, there may be dozens of images on the page, and there's no
>
> guarantee that the website author is trying to make it easy for you.
>
> Why not check if there's a defined api for extracting the information
>
> you want? Check the site, or send a message to the webmaster.
>
>
>
> No guarantee that tomorrow, the information won't be buried in some
>
> javascript fragment. Again, if you want to see that, you might need to
>
> write a javascript interpreter. it could use any algorithm at all to
>
> build webpage information, and the encoding could change day by day, or
>
> hour by hour.
>
>
>
> --
>
> DaveA
The problem is, the image url you found is not returned in the data urllib grabs. To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.
I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
csiph-web