Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39852

Re: using urllib on a more complex site

Path csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <AWasilenko@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'python.': 0.02; 'scripts': 0.09; 'python': 0.09; 'subject:using': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'assume': 0.11; '24,': 0.16; 'adam': 0.16; 'html)': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'url.': 0.16; 'url:tag': 0.16; 'urllib': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'obviously': 0.18; 'requests': 0.18; '(or': 0.18; 'developer': 0.19; 'load': 0.19; 'trying': 0.21; 'do.': 0.21; 'parse': 0.22; 'required.': 0.22; 'cheers,': 0.23; 'cc:2**1': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; '(e.g.': 0.27; 'cc:addr:gmail.com': 0.27; 'question': 0.27; "doesn't": 0.28; 'actual': 0.28; 'chris': 0.28; 'run': 0.28; 'case,': 0.29; 'source': 0.29; "i'm": 0.29; 'file': 0.32; 'etc.)': 0.32; 'could': 0.32; 'received:google.com': 0.34; 'thanks': 0.34; 'so,': 0.35; 'open': 0.35; 'received:209.85': 0.35; 'something': 0.35; 'useful': 0.36; 'does': 0.37; 'resources': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'nothing': 0.38; 'page': 0.38; 'instead': 0.39; 'your': 0.60; 'kind': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'within': 0.64; 'grab': 0.64; 'url:0': 0.67; 'day': 0.73; 'yourself': 0.77; 'browser.': 0.81; '2013': 0.84; 'about,': 0.84; 'url:program': 0.84; 'url:rebertia': 0.84
X-Received by 10.49.34.135 with SMTP id z7mr676618qei.1.1361755949995; Sun, 24 Feb 2013 17:32:29 -0800 (PST)
Newsgroups comp.lang.python
Date Sun, 24 Feb 2013 17:32:29 -0800 (PST)
In-Reply-To <mailman.2463.1361752083.2939.python-list@python.org>
Complaints-To groups-abuse@google.com
Injection-Info glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM
References <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2463.1361752083.2939.python-list@python.org>
User-Agent G2/1.0
X-Google-Web-Client true
X-Google-IP 69.80.108.19
MIME-Version 1.0
Subject Re: using urllib on a more complex site
From "Adam W." <AWasilenko@gmail.com>
To comp.lang.python@googlegroups.com
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
Cc "python-list@python.org" <python-list@python.org>, "Adam W." <AWasilenko@gmail.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Message-ID <mailman.2474.1361755952.2939.python-list@python.org> (permalink)
Lines 68
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1361755952 news.xs4all.nl 6977 [2001:888:2000:d::a6]:44372
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:39852

Show key headers only | View raw


On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote:
> On Sunday, February 24, 2013, Adam W.  wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> 
> 
> 
> in order to send myself an email every day of the 99c movie of the day.
> 
> 
> 
> However, using a simple command like (in Python 3.0):
> 
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> 
> 
> 
> I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> 
> 
> 
> 
> urllib isn't a web browser. It just requests the single (in this case, HTML) file from the given URL. It does not parse the HTML (indeed, it doesn't care what kind of file you're dealing with); therefore, it obviously does not retrieve the other resources linked within the document (CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing to "wait" for; urllib is already doing everything it was designed to do.
> 
> 
> 
> Your best bet is to open the page in a web browser yourself and use the developer tools/inspectors to watch what XHR requests the page's scripts are making, find the one(s) that have the data you care about, and then make those requests instead via urllib (or the `requests` 3rd-party lib, or whatever). If the URL(s) vary, reverse-engineering the scheme used to generate them will also be required.
> 
> 
> 
> Alternatively, you could use something like Selenium, which let's you drive an actual full web browser (e.g. Firefox) from Python.
> 
> 
> Cheers,
> Chris
> 
> 
> -- 
> Cheers,
> Chris
> --
> http://rebertia.com

Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/offset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundle

Thanks for the tip about XHR's

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
  Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
  Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800

csiph-web