Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10732

Re: screen scraping gotcha

From Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Newsgroups comp.lang.java.programmer
Subject Re: screen scraping gotcha
References <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com>
Message-ID <LV5Gq.18341$2e7.12664@newsfe18.iad> (permalink)
Date 2011-12-14 10:28 -0800

Show all headers | View raw


On 12/14/11 9:29 AM, Roedy Green wrote:
> On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green
> <see_website@mindprod.com.invalid>  wrote, quoted or indirectly quoted
> someone who said :
>
>>
>> It might do that with an explicit Semaphore, ordering the requests to
>> increased distance between probes to the same site, reducing the pool
>> size... ??
>
> I have tried throttling so that requests are separated by 30 seconds,
> it is still sending me 403s.  Yet when I hit the site with  browser,
> instantly all is forgiven.
>
> The stupid buggers don't seem to realise I am trying to HELP them sell
> books. If they had half a brain they would give me a soap interface
> where I could submit a list of ISBNs and they would give be back a
> list of booleans telling me which ones they have in stock.
>
>
> Most online stores go to extreme lengths to foil screen scraping. Many
> affiliate programs want to you go to their site and spend ten minutes
> to set up the html just to sell one product.
>
> Allposters.com invented a SOAP interface, but then left out sizes,
> formats and prices, and it was not in sync with the web site.   not
> even the sizes of jpgs were correct.I have to start positing malice
> the incompetence is so extreme.
>

If you're going to violate the TOS and robots.txt, you might as well do 
it right:

Make sure you spoof an appropriate "Referrer" header and User Agent 
header. Keep track of cookies.  If possible, pre-process which requests 
you will make, and then build a thread-per-site thread pool each with a 
Queue of requests to make, and a randomized delay between each request.

Also, I would recommend supporting Cache headers of various sorts 
(etags, Expires on, time to live, etc...)  This reduces load on the 
remote server, bandwidth, and processing time.

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web