Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10722

screen scraping gotcha

From Roedy Green <see_website@mindprod.com.invalid>
Newsgroups comp.lang.java.programmer
Subject screen scraping gotcha
Date 2011-12-14 00:52 -0800
Organization Canadian Mind Products
Message-ID <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> (permalink)

Show all headers | View raw


I used a thread pool to speed up the screenscraping I use to find out
which bookstores carry which books. Then I discovered some bookstores
sometimes were returning 403 forbidden codes.  I think they do this if
you have more than one request outstanding from a given IP.  I later
discovered that Xenu link checker was getting 403 codes that
BrokenLinks (which does one probe at a time) was finding were 200
(ok). 

So I think screenscraping/link checking etc code needs some mechanism
to optionally avoid hitting a site with more than one request at a
time or perhaps even with a pause of X seconds between requests.

It might do that with an explicit Semaphore, ordering the requests to
increased distance between probes to the same site, reducing the pool
size... ??
 
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 

Back to comp.lang.java.programmer | Previous | NextNext in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web