Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Eric Sosman Newsgroups: comp.lang.java.programmer Subject: Re: screen scraping gotcha Date: Wed, 14 Dec 2011 08:34:34 -0500 Organization: A noiseless patient Spider Lines: 32 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 14 Dec 2011 13:34:37 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="HSlJAUb3pGXi3i7ZL/HoAw"; logging-data="20429"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19e6aAZVJgdg+OwRNea2RAi" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20111105 Thunderbird/8.0 In-Reply-To: Cancel-Lock: sha1:saJgPcX4CNkm4vMznRMhgBq9roc= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:10726 On 12/14/2011 3:52 AM, Roedy Green wrote: > I used a thread pool to speed up the screenscraping I use to find out > which bookstores carry which books. Then I discovered some bookstores > sometimes were returning 403 forbidden codes. I think they do this if > you have more than one request outstanding from a given IP. I later > discovered that Xenu link checker was getting 403 codes that > BrokenLinks (which does one probe at a time) was finding were 200 > (ok). > > So I think screenscraping/link checking etc code needs some mechanism > to optionally avoid hitting a site with more than one request at a > time or perhaps even with a pause of X seconds between requests. > > It might do that with an explicit Semaphore, ordering the requests to > increased distance between probes to the same site, reducing the pool > size... ?? I'd suggest making the request scheduling explicit in the data structures, and not burying it in the locking mechanisms. Maintain a pool of "requests contemplated" and another of "requests in progress," and limit the number of in-progress requests for any one site. When the in-progress pool completes a site S request, it can fish in the contemplated pool for another S request, but not for a T request. If you want to get fancier, you could try to discover each site's throttling mechanism on the fly, by observing the 403's. But I think keeping things simple to start with would be better -- after all, you are only hypothesizing about the natures of the throttles! -- Eric Sosman esosman@ieee-dot-org.invalid