Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'python.': 0.02; 'scripts': 0.09; 'python': 0.09; 'subject:using': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'assume': 0.11; '24,': 0.16; 'adam': 0.16; 'html)': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'url.': 0.16; 'url:tag': 0.16; 'urllib': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'obviously': 0.18; 'requests': 0.18; '(or': 0.18; 'developer': 0.19; 'load': 0.19; 'trying': 0.21; 'do.': 0.21; 'parse': 0.22; 'required.': 0.22; 'cheers,': 0.23; 'cc:2**1': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; '(e.g.': 0.27; 'cc:addr:gmail.com': 0.27; 'question': 0.27; "doesn't": 0.28; 'actual': 0.28; 'chris': 0.28; 'run': 0.28; 'case,': 0.29; 'source': 0.29; "i'm": 0.29; 'file': 0.32; 'etc.)': 0.32; 'could': 0.32; 'received:google.com': 0.34; 'thanks': 0.34; 'so,': 0.35; 'open': 0.35; 'received:209.85': 0.35; 'something': 0.35; 'useful': 0.36; 'does': 0.37; 'resources': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'nothing': 0.38; 'page': 0.38; 'instead': 0.39; 'your': 0.60; 'kind': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'within': 0.64; 'grab': 0.64; 'url:0': 0.67; 'day': 0.73; 'yourself': 0.77; 'browser.': 0.81; '2013': 0.84; 'about,': 0.84; 'url:program': 0.84; 'url:rebertia': 0.84 X-Received: by 10.49.34.135 with SMTP id z7mr676618qei.1.1361755949995; Sun, 24 Feb 2013 17:32:29 -0800 (PST) Newsgroups: comp.lang.python Date: Sun, 24 Feb 2013 17:32:29 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 69.80.108.19 MIME-Version: 1.0 Subject: Re: using urllib on a more complex site From: "Adam W." To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: "python-list@python.org" , "Adam W." X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 68 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1361755952 news.xs4all.nl 6977 [2001:888:2000:d::a6]:44372 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:39852 On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote: > On Sunday, February 24, 2013, Adam W. wrote: > I'm trying to write a simple script to scrape http://www.vudu.com/movies/= #tag/99centOfTheDay/99c%20Rental%20of%20the%20day >=20 >=20 >=20 >=20 > in order to send myself an email every day of the 99c movie of the day. >=20 >=20 >=20 > However, using a simple command like (in Python 3.0): >=20 > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99= c%20Rental%20of%20the%20day').read() >=20 >=20 >=20 >=20 > I don't get the all the source I need, its just the navigation buttons. = =A0Now I assume they are using some CSS/javascript witchcraft to load all t= he useful data later, so my question is how do I make urllib "wait" and gra= b that data as well? >=20 >=20 >=20 >=20 >=20 > urllib isn't a web browser. It just requests=A0the single (in this case, = HTML) file from the given URL. It does not parse the HTML (indeed, it doesn= 't care what kind of file you're dealing with); therefore, it obviously doe= s not retrieve the other resources linked within the document (CSS, JS, ima= ges, etc.) nor does it run any JavaScript. So, there's nothing to "wait" fo= r; urllib is already=A0doing everything it was designed to do. >=20 >=20 >=20 > Your best bet is to open the page in a web browser yourself and use the d= eveloper tools/inspectors to watch what XHR requests the page's scripts are= making, find the one(s) that have the data you care about, and then make t= hose requests instead via urllib (or the `requests` 3rd-party lib, or whate= ver). If the URL(s) vary, reverse-engineering the scheme used to generate t= hem will also be required. >=20 >=20 >=20 > Alternatively, you could use something like Selenium, which let's you dri= ve an actual full web browser (e.g. Firefox)=A0from Python. >=20 >=20 > Cheers, > Chris >=20 >=20 > --=20 > Cheers, > Chris > -- > http://rebertia.com Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/= application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/= count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/o= ffset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundl= e Thanks for the tip about XHR's