Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!194.25.134.126.MISMATCH!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de!t-online.de!news.nask.pl!news.nask.org.pl!news.cyf-kr.edu.pl!agh.edu.pl!news.agh.edu.pl!news.onet.pl!.POSTED!not-for-mail From: Michal Kleczek Newsgroups: comp.lang.java.programmer Subject: Re: JavaScript and Screenscraping Followup-To: comp.lang.java.programmer Date: Wed, 30 Mar 2011 16:27:23 +0200 Organization: http://onet.pl Lines: 25 Message-ID: References: NNTP-Posting-Host: 77-252-124-164.ip.netia.com.pl Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Trace: news.onet.pl 1301495243 28892 77.252.124.164 (30 Mar 2011 14:27:23 GMT) X-Complaints-To: niusy@onet.pl NNTP-Posting-Date: Wed, 30 Mar 2011 14:27:23 +0000 (UTC) User-Agent: KNode/4.4.9 Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:2589 Roedy Green wrote: > I am working on a screenscraping project that is turning out to much > more time-consuming that I thought it would be. I am trying to gather > a database of information about all the motherboards sold my major > manufacturers. The idea is to eventually create a comparison shopper > to help you narrow down models that fit your needs. > > Oddly motherboard manufacturers don't use a database and generate > their specification pages. These are all hand-compiled with theme and > a dozen variations on every field. This is can handle. > > However, Asus decided to obfuscate their web pages with JavaScript. > There are no data on them. > > I wondered if there exists a tool that is like browser in that it will > read a page and render the JavaScript, but unlike a browser, it would > not show the information on the screen, just dump the generated HTML > or raw text and accept a script of pages to analyse. > http://htmlunit.sourceforge.net/ -- Michal