Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #39849
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <AWasilenko@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.03; 'attribute': 0.05; 'interpreter.': 0.07; 'parsing': 0.07; 'api': 0.09; 'python': 0.09; 'subject:using': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'assume': 0.11; 'times,': 0.13; 'encoding': 0.15; '24,': 0.16; 'adam': 0.16; 'attribute,': 0.16; 'chrome,': 0.16; 'downloaded.': 0.16; 'element.': 0.16; 'explicitly,': 0.16; 'navigation': 0.16; 'scrape': 0.16; 'subject:urllib': 0.16; 'urllib': 0.16; 'webmaster.': 0.16; 'wrote:': 0.17; 'skip:u 30': 0.17; '(in': 0.18; 'requests': 0.18; 'load': 0.19; 'trying': 0.21; 'supposed': 0.21; 'browsers': 0.22; 'defined': 0.22; 'cc:2**0': 0.23; 'needed.': 0.23; 'specified': 0.23; "i've": 0.23; 'cc:no real name:2**0': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'question': 0.27; 'css': 0.27; 'hour.': 0.29; 'img': 0.29; 'src': 0.29; 'source': 0.29; "i'm": 0.29; 'returned': 0.30; 'file': 0.32; 'could': 0.32; 'city:': 0.33; 'loading': 0.33; 'problem': 0.33; 'that,': 0.34; 'received:google.com': 0.34; 'pm,': 0.35; "won't": 0.35; 'there': 0.35; 'loaded': 0.36; "didn't": 0.36; 'useful': 0.36; "i'll": 0.36; 'author': 0.37; 'why': 0.37; 'moment': 0.37; 'ones': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'fact': 0.38; 'some': 0.38; 'page': 0.38; 'build': 0.39; 'called': 0.39; 'easy': 0.60; 'day,': 0.60; 'you.': 0.61; 'day.': 0.63; 'url:%20': 0.63; 'information': 0.63; 'information,': 0.63; 'more': 0.63; 'grab': 0.64; 'webpage': 0.65; 'hour': 0.69; 'day': 0.73; '(ie': 0.84; '2013': 0.84; 'tomorrow,': 0.84; 'dozens': 0.91; 'angel': 0.93 |
| X-Received | by 10.49.1.162 with SMTP id 2mr694495qen.2.1361755680754; Sun, 24 Feb 2013 17:28:00 -0800 (PST) |
| Newsgroups | comp.lang.python |
| Date | Sun, 24 Feb 2013 17:28:00 -0800 (PST) |
| In-Reply-To | <mailman.2465.1361752225.2939.python-list@python.org> |
| Complaints-To | groups-abuse@google.com |
| Injection-Info | glegroupsg2000goo.googlegroups.com; posting-host=69.80.108.19; posting-account=hEeMqAoAAAAN2L2NtWcUUUG7LStm2lEM |
| References | <e3e061a0-493b-4b8b-992b-a175dbecd4ac@googlegroups.com> <mailman.2465.1361752225.2939.python-list@python.org> |
| User-Agent | G2/1.0 |
| X-Google-Web-Client | true |
| X-Google-IP | 69.80.108.19 |
| MIME-Version | 1.0 |
| Subject | Re: using urllib on a more complex site |
| From | "Adam W." <AWasilenko@gmail.com> |
| To | comp.lang.python@googlegroups.com |
| Content-Type | text/plain; charset=ISO-8859-1 |
| Content-Transfer-Encoding | quoted-printable |
| Cc | python-list@python.org |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Message-ID | <mailman.2473.1361755689.2939.python-list@python.org> (permalink) |
| Lines | 102 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1361755689 news.xs4all.nl 6968 [2001:888:2000:d::a6]:38928 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:39849 |
Show key headers only | View raw
On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
>
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> >
>
> > in order to send myself an email every day of the 99c movie of the day.
>
> >
>
> > However, using a simple command like (in Python 3.0):
>
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
>
> >
>
> > I don't get the all the source I need, its just the navigation buttons. Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
>
> >
>
>
>
> The CSS and the jpegs, and many other aspects of a web "page" are loaded
>
> explicitly, by the browser, when parsing the tags of the page you
>
> downloaded. There is no sooner or later. The website won't send the
>
> other files until you request them.
>
>
>
> For example, that site at the moment has one image (prob. jpeg)
>
> highlighted,
>
>
>
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m"
>
> alt="Sex and the City: The Movie (Theatrical)">
>
>
>
> if you want to look at that jpeg, you need to download the file url
>
> specified by the src attribute of that img element.
>
>
>
> Or perhaps you can just look at the 'alt' attribute, which is mainly
>
> there for browsers who don't happen to do graphics, for example, the
>
> ones for the blind.
>
>
>
> Naturally, there may be dozens of images on the page, and there's no
>
> guarantee that the website author is trying to make it easy for you.
>
> Why not check if there's a defined api for extracting the information
>
> you want? Check the site, or send a message to the webmaster.
>
>
>
> No guarantee that tomorrow, the information won't be buried in some
>
> javascript fragment. Again, if you want to see that, you might need to
>
> write a javascript interpreter. it could use any algorithm at all to
>
> build webpage information, and the encoding could change day by day, or
>
> hour by hour.
>
>
>
> --
>
> DaveA
The problem is, the image url you found is not returned in the data urllib grabs. To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.
I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 16:02 -0800
Re: using urllib on a more complex site Chris Rebert <clp2@rebertia.com> - 2013-02-24 16:27 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:32 -0800
Re: using urllib on a more complex site Dave Angel <davea@davea.name> - 2013-02-24 19:30 -0500
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
Re: using urllib on a more complex site "Adam W." <AWasilenko@gmail.com> - 2013-02-24 17:28 -0800
csiph-web