Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10727

Re: screen scraping gotcha

Date 2011-12-14 06:18 -0800
From Patricia Shanahan <pats@acm.org>
Newsgroups comp.lang.java.programmer
Subject Re: screen scraping gotcha
References <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com>
Message-ID <jLWdnVA8Bte8LXXTnZ2dnUVZ_vydnZ2d@earthlink.com> (permalink)

Show all headers | View raw


On 12/14/2011 12:52 AM, Roedy Green wrote:
> I used a thread pool to speed up the screenscraping I use to find out
> which bookstores carry which books. Then I discovered some bookstores
> sometimes were returning 403 forbidden codes.  I think they do this if
> you have more than one request outstanding from a given IP.  I later
> discovered that Xenu link checker was getting 403 codes that
> BrokenLinks (which does one probe at a time) was finding were 200
> (ok).
>
> So I think screenscraping/link checking etc code needs some mechanism
> to optionally avoid hitting a site with more than one request at a
> time or perhaps even with a pause of X seconds between requests.
>
> It might do that with an explicit Semaphore, ordering the requests to
> increased distance between probes to the same site, reducing the pool
> size... ??
>

What percentage of a thread's time is spent doing work but without
having an outstanding request?

If that is small, then the number of outstanding requests is likely to
be the critical resource, and should have a site-appropriate limit, in
some cases one. That would limit the thread count, for that site, to one.

If a thread spends a significant amount of time doing other work, then
it might make sense to have more threads than the request limit and use
a semaphore to restrict the requests.

Patricia

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web