Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10732
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: screen scraping gotcha |
| References | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> |
| Message-ID | <LV5Gq.18341$2e7.12664@newsfe18.iad> (permalink) |
| Date | 2011-12-14 10:28 -0800 |
On 12/14/11 9:29 AM, Roedy Green wrote: > On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green > <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted > someone who said : > >> >> It might do that with an explicit Semaphore, ordering the requests to >> increased distance between probes to the same site, reducing the pool >> size... ?? > > I have tried throttling so that requests are separated by 30 seconds, > it is still sending me 403s. Yet when I hit the site with browser, > instantly all is forgiven. > > The stupid buggers don't seem to realise I am trying to HELP them sell > books. If they had half a brain they would give me a soap interface > where I could submit a list of ISBNs and they would give be back a > list of booleans telling me which ones they have in stock. > > > Most online stores go to extreme lengths to foil screen scraping. Many > affiliate programs want to you go to their site and spend ten minutes > to set up the html just to sell one product. > > Allposters.com invented a SOAP interface, but then left out sizes, > formats and prices, and it was not in sync with the web site. not > even the sizes of jpgs were correct.I have to start positing malice > the incompetence is so extreme. > If you're going to violate the TOS and robots.txt, you might as well do it right: Make sure you spoof an appropriate "Referrer" header and User Agent header. Keep track of cookies. If possible, pre-process which requests you will make, and then build a thread-per-site thread pool each with a Queue of requests to make, and a randomized delay between each request. Also, I would recommend supporting Cache headers of various sorts (etags, Expires on, time to live, etc...) This reduces load on the remote server, bandwidth, and processing time.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
csiph-web