Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10727
| Date | 2011-12-14 06:18 -0800 |
|---|---|
| From | Patricia Shanahan <pats@acm.org> |
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: screen scraping gotcha |
| References | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> |
| Message-ID | <jLWdnVA8Bte8LXXTnZ2dnUVZ_vydnZ2d@earthlink.com> (permalink) |
On 12/14/2011 12:52 AM, Roedy Green wrote: > I used a thread pool to speed up the screenscraping I use to find out > which bookstores carry which books. Then I discovered some bookstores > sometimes were returning 403 forbidden codes. I think they do this if > you have more than one request outstanding from a given IP. I later > discovered that Xenu link checker was getting 403 codes that > BrokenLinks (which does one probe at a time) was finding were 200 > (ok). > > So I think screenscraping/link checking etc code needs some mechanism > to optionally avoid hitting a site with more than one request at a > time or perhaps even with a pause of X seconds between requests. > > It might do that with an explicit Semaphore, ordering the requests to > increased distance between probes to the same site, reducing the pool > size... ?? > What percentage of a thread's time is spent doing work but without having an outstanding request? If that is small, then the number of outstanding requests is likely to be the critical resource, and should have a site-appropriate limit, in some cases one. That would limit the thread count, for that site, to one. If a thread spends a significant amount of time doing other work, then it might make sense to have more threads than the request limit and use a semaphore to restrict the requests. Patricia
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
csiph-web