Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10783

Re: screen scraping gotcha

From Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Newsgroups comp.lang.java.programmer
Subject Re: screen scraping gotcha
References <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> <LV5Gq.18341$2e7.12664@newsfe18.iad> <bsvje7t8drio985goeapea3lngum0tovhr@4ax.com>
Message-ID <18wGq.26283$cN1.1142@newsfe12.iad> (permalink)
Date 2011-12-15 16:19 -0800

Show all headers | View raw


On 12/15/11 6:18 AM, Roedy Green wrote:
> On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts
> <newsgroup.nospam@virtualinfinity.net>  wrote, quoted or indirectly
> quoted someone who said :
>
>>
>> If you're going to violate the TOS and robots.txt, you might as well do
>> it right:
> TOS = Terms of Service
>
>> Make sure you spoof an appropriate "Referrer" header and User Agent
>> header. Keep track of cookies.  If possible, pre-process which requests
>> you will make, and then build a thread-per-site thread pool each with a
>> Queue of requests to make, and a randomized delay between each request.
>
> I figured I did not need a referrer since browsers don't send one .  I
> can try supporting cookies.  I figured they too would not be necessary
> since many browsers refuse them.
>
>> Also, I would recommend supporting Cache headers of various sorts
>> (etags, Expires on, time to live, etc...)  This reduces load on the
>> remote server, bandwidth, and processing time.
>
> It seems to me those are about asking for the same page more once.  I
> don't understand what my app would do differently.
>
> I have written Abe Books asking for a computer friendly interface
> arguing it would attract more bulk book displayers, bookfinders, and
> that the bandwidth would be much lower than screenscraping.
>
> I once wrote the ASP people about their giant list of PADsites,
> suggesting some things to make it more computer friendly.  They
> responded that they did not want anyone USING the list, just casually
> looking at small parts of it.  So their goofy formatting was
> deliberately designed to frustrate those trying to import information
> from it.  I was baffled by the dog in a manger attitude. Why go to all
> that work then not let people use it.
>
> I maintain a similar list, http://mindprod.com/jgloss/hassle.html
> better pruned of deadwood.  I let people view it in HTML or download
> it as csv files.
>
Well, in whatever case, a few hours with Wireshark might help you 
understand what is different.  The rest of my advice involving queues 
and all is potentially worth looking at.

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web