Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!newsfeed1.swip.net!uio.no!ntnu.no!not-for-mail From: Bent C Dalager Newsgroups: comp.lang.java.programmer Subject: Re: Looking for Java web crawler api Date: Tue, 12 Jul 2011 09:44:38 +0000 (UTC) Organization: Norwegian university of science and technology Lines: 10 Message-ID: References: <4e1bf464$0$314$14726298@news.sunsite.dk> NNTP-Posting-Host: microbel.pvv.ntnu.no X-Trace: orkan.itea.ntnu.no 1310463878 9075 129.241.210.179 (12 Jul 2011 09:44:38 GMT) X-Complaints-To: usenet@ntnu.no NNTP-Posting-Date: Tue, 12 Jul 2011 09:44:38 +0000 (UTC) User-Agent: slrn/pre1.0.0-18 (Linux) Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:6107 I found JSoup (jsoup.org) to be a fine library for web scraping. It lets you easily set cookies and headers, fetches the URL for you, and converts the tangled mess of HTML you tend to receive into a well-formed XML document model. Cheers, Bent D. -- Bent Dalager - bcd@pvv.org - http://www.pvv.org/~bcd powered by emacs