Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10774

Re: screen scraping gotcha

From Roedy Green <see_website@mindprod.com.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: screen scraping gotcha
Date 2011-12-15 06:18 -0800
Organization Canadian Mind Products
Message-ID <bsvje7t8drio985goeapea3lngum0tovhr@4ax.com> (permalink)
References <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> <LV5Gq.18341$2e7.12664@newsfe18.iad>

Show all headers | View raw


On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly
quoted someone who said :

>
>If you're going to violate the TOS and robots.txt, you might as well do 
>it right:
TOS = Terms of Service

>Make sure you spoof an appropriate "Referrer" header and User Agent 
>header. Keep track of cookies.  If possible, pre-process which requests 
>you will make, and then build a thread-per-site thread pool each with a 
>Queue of requests to make, and a randomized delay between each request.

I figured I did not need a referrer since browsers don't send one .  I
can try supporting cookies.  I figured they too would not be necessary
since many browsers refuse them.

>Also, I would recommend supporting Cache headers of various sorts 
>(etags, Expires on, time to live, etc...)  This reduces load on the 
>remote server, bandwidth, and processing time.

It seems to me those are about asking for the same page more once.  I
don't understand what my app would do differently.

I have written Abe Books asking for a computer friendly interface
arguing it would attract more bulk book displayers, bookfinders, and
that the bandwidth would be much lower than screenscraping.

I once wrote the ASP people about their giant list of PADsites,
suggesting some things to make it more computer friendly.  They
responded that they did not want anyone USING the list, just casually
looking at small parts of it.  So their goofy formatting was
deliberately designed to frustrate those trying to import information
from it.  I was baffled by the dog in a manger attitude. Why go to all
that work then not let people use it.

I maintain a similar list, http://mindprod.com/jgloss/hassle.html
better pruned of deadwood.  I let people view it in HTML or download
it as csv files. 

-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
  Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
  Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
  Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
    Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
      Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
        Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

csiph-web