Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10722
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | screen scraping gotcha |
| Date | 2011-12-14 00:52 -0800 |
| Organization | Canadian Mind Products |
| Message-ID | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> (permalink) |
I used a thread pool to speed up the screenscraping I use to find out which bookstores carry which books. Then I discovered some bookstores sometimes were returning 403 forbidden codes. I think they do this if you have more than one request outstanding from a given IP. I later discovered that Xenu link checker was getting 403 codes that BrokenLinks (which does one probe at a time) was finding were 200 (ok). So I think screenscraping/link checking etc code needs some mechanism to optionally avoid hitting a site with more than one request at a time or perhaps even with a pause of X seconds between requests. It might do that with an explicit Semaphore, ordering the requests to increased distance between probes to the same site, reducing the pool size... ?? -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
Back to comp.lang.java.programmer | Previous | Next — Next in thread | Find similar | Unroll thread
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
csiph-web