Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10726

Re: screen scraping gotcha

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From Eric Sosman <esosman@ieee-dot-org.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: screen scraping gotcha
Date Wed, 14 Dec 2011 08:34:34 -0500
Organization A noiseless patient Spider
Lines 32
Message-ID <jca8ld$jud$1@dont-email.me> (permalink)
References <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
Injection-Date Wed, 14 Dec 2011 13:34:37 +0000 (UTC)
Injection-Info mx04.eternal-september.org; posting-host="HSlJAUb3pGXi3i7ZL/HoAw"; logging-data="20429"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19e6aAZVJgdg+OwRNea2RAi"
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20111105 Thunderbird/8.0
In-Reply-To <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com>
Cancel-Lock sha1:saJgPcX4CNkm4vMznRMhgBq9roc=
Xref x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:10726

Show key headers only | View raw


On 12/14/2011 3:52 AM, Roedy Green wrote:
> I used a thread pool to speed up the screenscraping I use to find out
> which bookstores carry which books. Then I discovered some bookstores
> sometimes were returning 403 forbidden codes.  I think they do this if
> you have more than one request outstanding from a given IP.  I later
> discovered that Xenu link checker was getting 403 codes that
> BrokenLinks (which does one probe at a time) was finding were 200
> (ok).
>
> So I think screenscraping/link checking etc code needs some mechanism
> to optionally avoid hitting a site with more than one request at a
> time or perhaps even with a pause of X seconds between requests.
>
> It might do that with an explicit Semaphore, ordering the requests to
> increased distance between probes to the same site, reducing the pool
> size... ??

     I'd suggest making the request scheduling explicit in the data
structures, and not burying it in the locking mechanisms.  Maintain
a pool of "requests contemplated" and another of "requests in progress,"
and limit the number of in-progress requests for any one site.  When
the in-progress pool completes a site S request, it can fish in the
contemplated pool for another S request, but not for a T request.

     If you want to get fancier, you could try to discover each site's
throttling mechanism on the fly, by observing the 403's.  But I think
keeping things simple to start with would be better -- after all, you
are only hypothesizing about the natures of the throttles!

-- 
Eric Sosman
esosman@ieee-dot-org.invalid

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web