Re: using urllib on a more complex site

Path	csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<AWasilenko@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'algorithm': 0.03; 'attribute': 0.05; 'interpreter.': 0.07; 'parsing': 0.07; 'api': 0.09; 'python': 0.09; 'subject:using': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'assume': 0.11; 'times,': 0.13; 'encoding': 0.15; '24,': 0.16; 'adam': 0.16; 'attribute,': 0.16; 'chrome,': 0.16; 'downloaded.': 0.16; 'element.': 0.16; 'explicitly,': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'urllib': 0.16; 'webmaster.': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'requests': 0.18; 'load': 0.19; 'trying': 0.21; 'supposed': 0.21; 'browsers': 0.22; 'defined': 0.22; 'cc:20': 0.23; 'needed.': 0.23; 'specified': 0.23; "i've": 0.23; 'cc:no real name:20': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'question': 0.27; 'css': 0.27; 'hour.': 0.29; 'img': 0.29; 'src': 0.29; 'source': 0.29; "i'm": 0.29; 'returned': 0.30; 'file': 0.32; 'could': 0.32; 'city:': 0.33; 'loading': 0.33; 'problem': 0.33; 'that,': 0.34; 'received:google.com': 0.34; 'pm,': 0.35; "won't": 0.35; 'there': 0.35; 'loaded': 0.36; "didn't": 0.36; 'useful': 0.36; "i'll": 0.36; 'author': 0.37; 'why': 0.37; 'moment': 0.37; 'ones': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'fact': 0.38; 'some': 0.38; 'page': 0.38; 'build': 0.39; 'called': 0.39; 'easy': 0.60; 'day,': 0.60; 'you.': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'information': 0.63; 'information,': 0.63; 'more': 0.63; 'grab': 0.64; 'webpage': 0.65; 'hour': 0.69; 'day': 0.73; '(ie': 0.84; '2013': 0.84; 'tomorrow,': 0.84; 'dozens': 0.91; 'angel': 0.93
X-Received	by 10.49.1.162 with SMTP id 2mr694495qen.2.1361755680754; Sun, 24 Feb 2013 17:28:00 -0800 (PST)
Newsgroups	comp.lang.python
Date	Sun, 24 Feb 2013 17:28:00 -0800 (PST)
In-Reply-To	<mailman.2465.1361752225.2939.python-list@python.org>
Complaints-To	groups-abuse@google.com
Injection-Info	glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM
References	<e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2465.1361752225.2939.python-list@python.org>
User-Agent	G2/1.0
X-Google-Web-Client	true
X-Google-IP	69.80.108.19
MIME-Version	1.0
Subject	Re: using urllib on a more complex site
From	"Adam W." <AWasilenko@gmail.com>
To	comp.lang.python@googlegroups.com
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	quoted-printable
Cc	python-list@python.org
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Message-ID	<mailman.2473.1361755689.2939.python-list@python.org> (permalink)
Lines	102
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1361755689 news.xs4all.nl 6968 [2001:888:2000:d::a6]:38928
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:39849

Show key headers only | View raw

On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
> 
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> >
> 
> > in order to send myself an email every day of the 99c movie of the day.
> 
> >
> 
> > However, using a simple command like (in Python 3.0):
> 
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> >
> 
> > I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> >
> 
> 
> 
> The CSS and the jpegs, and many other aspects of a web "page" are loaded 
> 
> explicitly, by the browser, when parsing the tags of the page you 
> 
> downloaded.  There is no sooner or later.  The website won't send the 
> 
> other files until you request them.
> 
> 
> 
> For example, that site at the moment has one image (prob. jpeg) 
> 
> highlighted,
> 
> 
> 
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
> 
> alt="Sex and the City: The Movie (Theatrical)">
> 
> 
> 
> if you want to look at that jpeg, you need to download the file url 
> 
> specified by the src attribute of that img element.
> 
> 
> 
> Or perhaps you can just look at the 'alt' attribute, which is mainly 
> 
> there for browsers who don't happen to do graphics, for example, the 
> 
> ones for the blind.
> 
> 
> 
> Naturally, there may be dozens of images on the page, and there's no 
> 
> guarantee that the website author is trying to make it easy for you. 
> 
> Why not check if there's a defined api for extracting the information 
> 
> you want?  Check the site, or send a message to the webmaster.
> 
> 
> 
> No guarantee that tomorrow, the information won't be buried in some 
> 
> javascript fragment.  Again, if you want to see that, you might need to 
> 
> write a javascript interpreter.  it could use any algorithm at all to 
> 
> build webpage information, and the encoding could change day by day, or 
> 
> hour by hour.
> 
> 
> 
> -- 
> 
> DaveA

The problem is, the image url you found is not returned in the data urllib grabs.  To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.

I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
  Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
  Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800

csiph-web