Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=I8kTBySKh8iBH4a9m2fRIgZ1K2VcykFL/NfU/Z+ICG0OOu6PgnXe4D1CGdkDTxIquA xMhm+Nb4Sj35pLXdC1qEuwaNZ0EX1QVwLnoLLmKLh2uC7sfe0O2bO3O/iBkWgqULjXUU SydFPg/R4FhvXa25vtENdK9YajNZ/uIgDYyjM=
MIME-Version: 1.0
Sender: chris@rebertia.com
In-Reply-To: <BANLkTimNciwLaZ=WmALJyAH26XzhM-hFWQ@mail.gmail.com>
References: <BANLkTimNciwLaZ=WmALJyAH26XzhM-hFWQ@mail.gmail.com>
Date: Tue, 12 Apr 2011 11:30:39 -0700
Subject: Re: download web pages that are updated by ajax
From: Chris Rebert <clp2@rebertia.com>
To: Jabba Laci <jabba.laci@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: Python mailing list <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.273.1302633041.9059.python-list@python.org>
Lines: 21
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:3075

On Tue, Apr 12, 2011 at 7:47 AM, Jabba Laci <jabba.laci@gmail.com> wrote:
> Hi,
>
> I want to download a web page that is updated by AJAX. The page
> requires no human interaction, it is updated automatically:
> http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1
>
> If I download it with wget, I get a file of size 97 KB. The source is
> full of AJAX calls, i.e. the content of the page is not expanded.
> If I open it in a browser and save it manually, the result is a file
> of almost 5 MB whose content is expanded.
>
> (1) How to download such a page with Python? I need the post-AJAX
> version of the page.

I've heard you can drive a web browser using Selenium
(http://code.google.com/p/selenium/ ), have it visit the webpage and
run the JavaScript on it, and then grab the final result.

Cheers,
Chris