Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.03; 'attribute': 0.05; 'interpreter.': 0.07; 'parsing': 0.07; 'api': 0.09; 'python': 0.09; 'subject:using': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'assume': 0.11; 'times,': 0.13; 'encoding': 0.15; '24,': 0.16; 'adam': 0.16; 'attribute,': 0.16; 'chrome,': 0.16; 'downloaded.': 0.16; 'element.': 0.16; 'explicitly,': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'urllib': 0.16; 'webmaster.': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'requests': 0.18; 'load': 0.19; 'trying': 0.21; 'supposed': 0.21; 'browsers': 0.22; 'defined': 0.22; 'cc:2**0': 0.23; 'needed.': 0.23; 'specified': 0.23; "i've": 0.23; 'cc:no real name:2**0': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'question': 0.27; 'css': 0.27; 'hour.': 0.29; 'img': 0.29; 'src': 0.29; 'source': 0.29; "i'm": 0.29; 'returned': 0.30; 'file': 0.32; 'could': 0.32; 'city:': 0.33; 'loading': 0.33; 'problem': 0.33; 'that,': 0.34; 'received:google.com': 0.34; 'pm,': 0.35; "won't": 0.35; 'there': 0.35; 'loaded': 0.36; "didn't": 0.36; 'useful': 0.36; "i'll": 0.36; 'author': 0.37; 'why': 0.37; 'moment': 0.37; 'ones': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'fact': 0.38; 'some': 0.38; 'page': 0.38; 'build': 0.39; 'called': 0.39; 'easy': 0.60; 'day,': 0.60; 'you.': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'information': 0.63; 'information,': 0.63; 'more': 0.63; 'grab': 0.64; 'webpage': 0.65; 'hour': 0.69; 'day': 0.73; '(ie': 0.84; '2013': 0.84; 'tomorrow,': 0.84; 'dozens': 0.91; 'angel': 0.93 X-Received: by 10.49.1.162 with SMTP id 2mr694495qen.2.1361755680754; Sun, 24 Feb 2013 17:28:00 -0800 (PST) Newsgroups: comp.lang.python Date: Sun, 24 Feb 2013 17:28:00 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 69.80.108.19 MIME-Version: 1.0 Subject: Re: using urllib on a more complex site From: "Adam W." To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 102 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1361755689 news.xs4all.nl 6968 [2001:888:2000:d::a6]:38928 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:39849 On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote: > On 02/24/2013 07:02 PM, Adam W. wrote: >=20 > > I'm trying to write a simple script to scrape http://www.vudu.com/movie= s/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day >=20 > > >=20 > > in order to send myself an email every day of the 99c movie of the day. >=20 > > >=20 > > However, using a simple command like (in Python 3.0): >=20 > > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/= 99c%20Rental%20of%20the%20day').read() >=20 > > >=20 > > I don't get the all the source I need, its just the navigation buttons.= Now I assume they are using some CSS/javascript witchcraft to load all th= e useful data later, so my question is how do I make urllib "wait" and grab= that data as well? >=20 > > >=20 >=20 >=20 > The CSS and the jpegs, and many other aspects of a web "page" are loaded= =20 >=20 > explicitly, by the browser, when parsing the tags of the page you=20 >=20 > downloaded. There is no sooner or later. The website won't send the=20 >=20 > other files until you request them. >=20 >=20 >=20 > For example, that site at the moment has one image (prob. jpeg)=20 >=20 > highlighted, >=20 >=20 >=20 > =20 > alt=3D"Sex and the City: The Movie (Theatrical)"> >=20 >=20 >=20 > if you want to look at that jpeg, you need to download the file url=20 >=20 > specified by the src attribute of that img element. >=20 >=20 >=20 > Or perhaps you can just look at the 'alt' attribute, which is mainly=20 >=20 > there for browsers who don't happen to do graphics, for example, the=20 >=20 > ones for the blind. >=20 >=20 >=20 > Naturally, there may be dozens of images on the page, and there's no=20 >=20 > guarantee that the website author is trying to make it easy for you.=20 >=20 > Why not check if there's a defined api for extracting the information=20 >=20 > you want? Check the site, or send a message to the webmaster. >=20 >=20 >=20 > No guarantee that tomorrow, the information won't be buried in some=20 >=20 > javascript fragment. Again, if you want to see that, you might need to= =20 >=20 > write a javascript interpreter. it could use any algorithm at all to=20 >=20 > build webpage information, and the encoding could change day by day, or= =20 >=20 > hour by hour. >=20 >=20 >=20 > --=20 >=20 > DaveA The problem is, the image url you found is not returned in the data urllib = grabs. To be clear, I was aware of what urllib is supposed to do (ie not d= ownload image data when loading a page), I've used it before many times, ju= st never had to jump through hoops to get at the content I needed. I'll look into figuring out how to find XHR requests in Chrome, I didn't kn= ow what they called that after the fact loading, so now my searching will b= e more productive.