Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #39830

Re: using urllib on a more complex site

Path csiph.com!usenet.pasdenom.info!goblin3!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <davea@davea.name>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.005
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'algorithm': 0.03; 'attribute': 0.05; 'interpreter.': 0.07; 'parsing': 0.07; 'api': 0.09; 'python': 0.09; 'subject:using': 0.09; 'assume': 0.11; 'encoding': 0.15; 'adam': 0.16; 'attribute,': 0.16; 'downloaded.': 0.16; 'element.': 0.16; 'explicitly,': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'urllib': 0.16; 'webmaster.': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'load': 0.19; 'trying': 0.21; 'browsers': 0.22; 'defined': 0.22; 'specified': 0.23; 'command': 0.24; 'script': 0.24; 'header :In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'question': 0.27; 'css': 0.27; 'hour.': 0.29; 'img': 0.29; 'src': 0.29; 'source': 0.29; "i'm": 0.29; 'file': 0.32; 'could': 0.32; 'to:addr :python-list': 0.33; 'that,': 0.34; 'pm,': 0.35; "won't": 0.35; 'there': 0.35; 'loaded': 0.36; 'useful': 0.36; 'author': 0.37; 'why': 0.37; 'moment': 0.37; 'ones': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'some': 0.38; 'page': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'build': 0.39; 'received:192.168': 0.40; 'easy': 0.60; 'day,': 0.60; 'you.': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'information': 0.63; 'information,': 0.63; 'grab': 0.64; 'webpage': 0.65; 'hour': 0.69; 'received:74.208': 0.71; 'day': 0.73; 'received:74.208.4.194': 0.84; 'tomorrow,': 0.84; 'dozens': 0.91
Date Sun, 24 Feb 2013 19:30:00 -0500
From Dave Angel <davea@davea.name>
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2
MIME-Version 1.0
To python-list@python.org
Subject Re: using urllib on a more complex site
References <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
In-Reply-To <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-Provags-ID V02:K0:1cgLp4j5sEz/iiREDHZHhkPR1PprmGVdE1cwLkNGrDX zbMDNJSTM1Ux1Yd4qkNXOVbtXvPmxlH9pqdFWfiVkmSOo2YmRz /4/ykK7H5O1G8SjzgyoE4ZnrQnD3k+Sf2GxdGGI+KEHQPKviyo VemUzyd4FTuXpCSFy5KM/ymXbP6N9gDRQYd0uIU44omAa2QdrI 1zkQMIeNCYY+/AgpxFI99LgEWfKdFHPIz55lo7LcOeaY2GVtWa l9sfr3D+lxd9ZfTZ+dU88byOKm38+Mh4OJGB1fAZmCd5c09C2l hjFbu9/IGTZwiQZ/hyip7tFseTrpyqft4ajWMXYXC7AMCwUVw= =
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.2465.1361752225.2939.python-list@python.org> (permalink)
Lines 42
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1361752225 news.xs4all.nl 6909 [2001:888:2000:d::a6]:43471
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:39830

Show key headers only | View raw


On 02/24/2013 07:02 PM, Adam W. wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> in order to send myself an email every day of the 99c movie of the day.
>
> However, using a simple command like (in Python 3.0):
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
>
> I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
>

The CSS and the jpegs, and many other aspects of a web "page" are loaded 
explicitly, by the browser, when parsing the tags of the page you 
downloaded.  There is no sooner or later.  The website won't send the 
other files until you request them.

For example, that site at the moment has one image (prob. jpeg) 
highlighted,

<img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
alt="Sex and the City: The Movie (Theatrical)">

if you want to look at that jpeg, you need to download the file url 
specified by the src attribute of that img element.

Or perhaps you can just look at the 'alt' attribute, which is mainly 
there for browsers who don't happen to do graphics, for example, the 
ones for the blind.

Naturally, there may be dozens of images on the page, and there's no 
guarantee that the website author is trying to make it easy for you. 
Why not check if there's a defined api for extracting the information 
you want?  Check the site, or send a message to the webmaster.

No guarantee that tomorrow, the information won't be buried in some 
javascript fragment.  Again, if you want to see that, you might need to 
write a javascript interpreter.  it could use any algorithm at all to 
build webpage information, and the encoding could change day by day, or 
hour by hour.

-- 
DaveA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
  Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
  Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
    Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800

csiph-web