Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!newsfeed1.swip.net!uio.no!ntnu.no!not-for-mail
From: Bent C Dalager <bcd@pvv.ntnu.no>
Newsgroups: comp.lang.java.programmer
Subject: Re: Looking for Java web crawler api
Date: Tue, 12 Jul 2011 09:44:38 +0000 (UTC)
Organization: Norwegian university of science and technology
Lines: 10
Message-ID: <slrnj1o5s6.e9i.bcd@microbel.pvv.ntnu.no>
References: <4e1bf464$0$314$14726298@news.sunsite.dk>
NNTP-Posting-Host: microbel.pvv.ntnu.no
X-Trace: orkan.itea.ntnu.no 1310463878 9075 129.241.210.179 (12 Jul 2011 09:44:38 GMT)
X-Complaints-To: usenet@ntnu.no
NNTP-Posting-Date: Tue, 12 Jul 2011 09:44:38 +0000 (UTC)
User-Agent: slrn/pre1.0.0-18 (Linux)
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:6107

I found JSoup (jsoup.org) to be a fine library for web scraping. It
lets you easily set cookies and headers, fetches the URL for you, and
converts the tangled mess of HTML you tend to receive into a
well-formed XML document model.

Cheers,
	Bent D.
-- 
Bent Dalager - bcd@pvv.org - http://www.pvv.org/~bcd
                                    powered by emacs