Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39828

Re: using urllib on a more complex site

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder1.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <chris@rebertia.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.005
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'python.': 0.02; 'scripts': 0.09; 'python': 0.09; 'subject:using': 0.09; 'cc:addr:python- list': 0.10; 'assume': 0.11; '24,': 0.16; 'adam': 0.16; 'html)': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'url.': 0.16; 'urllib': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'obviously': 0.18; 'requests': 0.18; '(or': 0.18; 'developer': 0.19; 'load': 0.19; 'trying': 0.21; 'do.': 0.21; 'parse': 0.22; 'required.': 0.22; 'cheers,': 0.23; 'cc:2**0': 0.23; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; '(e.g.': 0.27; 'question': 0.27; 'message-id:@mail.gmail.com': 0.27; "doesn't": 0.28; 'actual': 0.28; 'chris': 0.28; 'run': 0.28; 'case,': 0.29; 'skip:& 10': 0.29; 'source': 0.29; "i'm": 0.29; 'file': 0.32; 'etc.)': 0.32; 'could': 0.32; 'received:google.com': 0.34; 'so,': 0.35; 'open': 0.35; 'doing': 0.35; 'something': 0.35; 'skip:u 20': 0.36; 'useful': 0.36; 'does': 0.37; 'resources': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'nothing': 0.38; 'page': 0.38; 'instead': 0.39; 'your': 0.60; 'kind': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'within': 0.64; 'grab': 0.64; 'day': 0.73; 'yourself': 0.77; 'browser.': 0.81; 'about,': 0.84; 'firefox)': 0.84; 'sender:addr:chris': 0.84; 'url:rebertia': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=rebertia.com; s=google; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=/kd8PbCWv6EdfrnZ3J+WHbeInrydEezd6ZETsuyFsD8=; b=VUc3liQ4Lv+XLMRSDMi8pYPGSz2SJM/SC64Iw9L84NALrr28B9vtek1AUWGh2Lhkjw ijrYXkW8KvEOhyha0oA7uXLEF++YFTSxjfy++xKlP3fcK7TTZTXq2Pe9lTIm1GvquJbZ /qVHRKy3EEcWEXMZdzDKdRrWE5X44mKrfo5mc=
X-Google-DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :x-gm-message-state; bh=/kd8PbCWv6EdfrnZ3J+WHbeInrydEezd6ZETsuyFsD8=; b=Uw9VvIlP1tZ/7z74Rg8E/q5n5B+58B94yh4Nz1S/QkV7f+GFMllb6ogdmL3Oq44z6d hbIT01zXtexWh8oZXZ4y0LGcgaT8owfwbn0VsvqsAqhLMSDize9DQv5OU5CkgcXC13Oq hl/2nb92+p0vArpWHXeDeb8npCefRKG7XVsu2KE3U8murkHphL4wEHBzxXA51EViNcO6 QbZyrT3Xstu6vme3EgcHi3PUs402rwhJ4iEAtPjYVC1op4mgNirBeHgtcmz7Q48DsNLx 18p1ym7XWt4i3nrCi/ZZk+uEEU+2yajrpnRPAOnKtPbWJ39lcjQSsG8V3xVKbIHZsq7I YDTg==
MIME-Version 1.0
X-Received by 10.50.56.236 with SMTP id d12mr2589249igq.92.1361752074945; Sun, 24 Feb 2013 16:27:54 -0800 (PST)
Sender chris@rebertia.com
In-Reply-To <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
References <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
Date Sun, 24 Feb 2013 16:27:54 -0800
X-Google-Sender-Auth ew1f0DP4IFxsXjwprq-KEiVmx2o
Subject Re: using urllib on a more complex site
From Chris Rebert <clp2@rebertia.com>
To "Adam W." <AWasilenko@gmail.com>
Content-Type multipart/alternative; boundary=f46d0402ac45b3214c04d6819cf7
X-Gm-Message-State ALoCoQndtBmENZxqzV0+5ZkZL0x5xNIeN2wvYiL0KbnkTNTpDMIGzCFUmAmq5liwMEGHFrB1jy7g
Cc "python-list@python.org" <python-list@python.org>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.2463.1361752083.2939.python-list@python.org> (permalink)
Lines 95
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1361752083 news.xs4all.nl 6886 [2001:888:2000:d::a6]:41971
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:39828

Show key headers only | View raw


[Multipart message — attachments visible in raw view] - view raw

On Sunday, February 24, 2013, Adam W. wrote:

> I'm trying to write a simple script to scrape
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> in order to send myself an email every day of the 99c movie of the day.
>
> However, using a simple command like (in Python 3.0):
> urllib.request.urlopen('
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read(
> )
>
> I don't get the all the source I need, its just the navigation buttons.
>  Now I assume they are using some CSS/javascript witchcraft to load all the
> useful data later, so my question is how do I make urllib "wait" and grab
> that data as well?
>

urllib isn't a web browser. It just requests the single (in this case,
HTML) file from the given URL. It does not parse the HTML (indeed, it
doesn't care what kind of file you're dealing with); therefore, it
obviously does not retrieve the other resources linked within the document
(CSS, JS, images, etc.) nor does it run any JavaScript. So, there's nothing
to "wait" for; urllib is already doing everything it was designed to do.

Your best bet is to open the page in a web browser yourself and use the
developer tools/inspectors to watch what XHR requests the page's scripts
are making, find the one(s) that have the data you care about, and then
make those requests instead via urllib (or the `requests` 3rd-party lib, or
whatever). If the URL(s) vary, reverse-engineering the scheme used to
generate them will also be required.

Alternatively, you could use something like Selenium, which let's you drive
an actual full web browser (e.g. Firefox) from Python.

Cheers,
Chris


-- 
Cheers,
Chris
--
http://rebertia.com

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
  Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
  Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800

csiph-web