Groups > comp.lang.java.programmer > #10722 > unrolled thread

screen scraping gotcha

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2011-12-14 00:52 -0800
Last post	2011-12-15 16:19 -0800
Articles	7 — 4 participants

Back to article view | Back to comp.lang.java.programmer

  screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
    Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
    Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
    Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
      Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
        Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
          Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800

#10722 — screen scraping gotcha

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-14 00:52 -0800
Subject	screen scraping gotcha
Message-ID	<ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com>

I used a thread pool to speed up the screenscraping I use to find out
which bookstores carry which books. Then I discovered some bookstores
sometimes were returning 403 forbidden codes.  I think they do this if
you have more than one request outstanding from a given IP.  I later
discovered that Xenu link checker was getting 403 codes that
BrokenLinks (which does one probe at a time) was finding were 200
(ok). 

So I think screenscraping/link checking etc code needs some mechanism
to optionally avoid hitting a site with more than one request at a
time or perhaps even with a pause of X seconds between requests.

It might do that with an explicit Semaphore, ordering the requests to
increased distance between probes to the same site, reducing the pool
size... ??
 
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [next] | [standalone]

#10726

From	Eric Sosman <esosman@ieee-dot-org.invalid>
Date	2011-12-14 08:34 -0500
Message-ID	<jca8ld$jud$1@dont-email.me>
In reply to	#10722

On 12/14/2011 3:52 AM, Roedy Green wrote:
> I used a thread pool to speed up the screenscraping I use to find out
> which bookstores carry which books. Then I discovered some bookstores
> sometimes were returning 403 forbidden codes.  I think they do this if
> you have more than one request outstanding from a given IP.  I later
> discovered that Xenu link checker was getting 403 codes that
> BrokenLinks (which does one probe at a time) was finding were 200
> (ok).
>
> So I think screenscraping/link checking etc code needs some mechanism
> to optionally avoid hitting a site with more than one request at a
> time or perhaps even with a pause of X seconds between requests.
>
> It might do that with an explicit Semaphore, ordering the requests to
> increased distance between probes to the same site, reducing the pool
> size... ??

     I'd suggest making the request scheduling explicit in the data
structures, and not burying it in the locking mechanisms.  Maintain
a pool of "requests contemplated" and another of "requests in progress,"
and limit the number of in-progress requests for any one site.  When
the in-progress pool completes a site S request, it can fish in the
contemplated pool for another S request, but not for a T request.

     If you want to get fancier, you could try to discover each site's
throttling mechanism on the fly, by observing the 403's.  But I think
keeping things simple to start with would be better -- after all, you
are only hypothesizing about the natures of the throttles!

-- 
Eric Sosman
esosman@ieee-dot-org.invalid

[toc] | [prev] | [next] | [standalone]

#10727

From	Patricia Shanahan <pats@acm.org>
Date	2011-12-14 06:18 -0800
Message-ID	<jLWdnVA8Bte8LXXTnZ2dnUVZ_vydnZ2d@earthlink.com>
In reply to	#10722

On 12/14/2011 12:52 AM, Roedy Green wrote:
> I used a thread pool to speed up the screenscraping I use to find out
> which bookstores carry which books. Then I discovered some bookstores
> sometimes were returning 403 forbidden codes.  I think they do this if
> you have more than one request outstanding from a given IP.  I later
> discovered that Xenu link checker was getting 403 codes that
> BrokenLinks (which does one probe at a time) was finding were 200
> (ok).
>
> So I think screenscraping/link checking etc code needs some mechanism
> to optionally avoid hitting a site with more than one request at a
> time or perhaps even with a pause of X seconds between requests.
>
> It might do that with an explicit Semaphore, ordering the requests to
> increased distance between probes to the same site, reducing the pool
> size... ??
>

What percentage of a thread's time is spent doing work but without
having an outstanding request?

If that is small, then the number of outstanding requests is likely to
be the critical resource, and should have a site-appropriate limit, in
some cases one. That would limit the thread count, for that site, to one.

If a thread spends a significant amount of time doing other work, then
it might make sense to have more threads than the request limit and use
a semaphore to restrict the requests.

Patricia

[toc] | [prev] | [next] | [standalone]

#10729

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-14 09:29 -0800
Message-ID	<uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com>
In reply to	#10722

On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>
>It might do that with an explicit Semaphore, ordering the requests to
>increased distance between probes to the same site, reducing the pool
>size... ??

I have tried throttling so that requests are separated by 30 seconds,
it is still sending me 403s.  Yet when I hit the site with  browser,
instantly all is forgiven.

The stupid buggers don't seem to realise I am trying to HELP them sell
books. If they had half a brain they would give me a soap interface
where I could submit a list of ISBNs and they would give be back a
list of booleans telling me which ones they have in stock.


Most online stores go to extreme lengths to foil screen scraping. Many
affiliate programs want to you go to their site and spend ten minutes
to set up the html just to sell one product.

Allposters.com invented a SOAP interface, but then left out sizes,
formats and prices, and it was not in sync with the web site.   not
even the sizes of jpgs were correct.I have to start positing malice
the incompetence is so extreme.

-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [prev] | [next] | [standalone]

#10732

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2011-12-14 10:28 -0800
Message-ID	<LV5Gq.18341$2e7.12664@newsfe18.iad>
In reply to	#10729

On 12/14/11 9:29 AM, Roedy Green wrote:
> On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green
> <see_website@mindprod.com.invalid>  wrote, quoted or indirectly quoted
> someone who said :
>
>>
>> It might do that with an explicit Semaphore, ordering the requests to
>> increased distance between probes to the same site, reducing the pool
>> size... ??
>
> I have tried throttling so that requests are separated by 30 seconds,
> it is still sending me 403s.  Yet when I hit the site with  browser,
> instantly all is forgiven.
>
> The stupid buggers don't seem to realise I am trying to HELP them sell
> books. If they had half a brain they would give me a soap interface
> where I could submit a list of ISBNs and they would give be back a
> list of booleans telling me which ones they have in stock.
>
>
> Most online stores go to extreme lengths to foil screen scraping. Many
> affiliate programs want to you go to their site and spend ten minutes
> to set up the html just to sell one product.
>
> Allposters.com invented a SOAP interface, but then left out sizes,
> formats and prices, and it was not in sync with the web site.   not
> even the sizes of jpgs were correct.I have to start positing malice
> the incompetence is so extreme.
>

If you're going to violate the TOS and robots.txt, you might as well do 
it right:

Make sure you spoof an appropriate "Referrer" header and User Agent 
header. Keep track of cookies.  If possible, pre-process which requests 
you will make, and then build a thread-per-site thread pool each with a 
Queue of requests to make, and a randomized delay between each request.

Also, I would recommend supporting Cache headers of various sorts 
(etags, Expires on, time to live, etc...)  This reduces load on the 
remote server, bandwidth, and processing time.

[toc] | [prev] | [next] | [standalone]

#10774

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-15 06:18 -0800
Message-ID	<bsvje7t8drio985goeapea3lngum0tovhr@4ax.com>
In reply to	#10732

On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly
quoted someone who said :

>
>If you're going to violate the TOS and robots.txt, you might as well do 
>it right:
TOS = Terms of Service

>Make sure you spoof an appropriate "Referrer" header and User Agent 
>header. Keep track of cookies.  If possible, pre-process which requests 
>you will make, and then build a thread-per-site thread pool each with a 
>Queue of requests to make, and a randomized delay between each request.

I figured I did not need a referrer since browsers don't send one .  I
can try supporting cookies.  I figured they too would not be necessary
since many browsers refuse them.

>Also, I would recommend supporting Cache headers of various sorts 
>(etags, Expires on, time to live, etc...)  This reduces load on the 
>remote server, bandwidth, and processing time.

It seems to me those are about asking for the same page more once.  I
don't understand what my app would do differently.

I have written Abe Books asking for a computer friendly interface
arguing it would attract more bulk book displayers, bookfinders, and
that the bandwidth would be much lower than screenscraping.

I once wrote the ASP people about their giant list of PADsites,
suggesting some things to make it more computer friendly.  They
responded that they did not want anyone USING the list, just casually
looking at small parts of it.  So their goofy formatting was
deliberately designed to frustrate those trying to import information
from it.  I was baffled by the dog in a manger attitude. Why go to all
that work then not let people use it.

I maintain a similar list, http://mindprod.com/jgloss/hassle.html
better pruned of deadwood.  I let people view it in HTML or download
it as csv files. 

-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [prev] | [next] | [standalone]

#10783

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2011-12-15 16:19 -0800
Message-ID	<18wGq.26283$cN1.1142@newsfe12.iad>
In reply to	#10774

On 12/15/11 6:18 AM, Roedy Green wrote:
> On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts
> <newsgroup.nospam@virtualinfinity.net>  wrote, quoted or indirectly
> quoted someone who said :
>
>>
>> If you're going to violate the TOS and robots.txt, you might as well do
>> it right:
> TOS = Terms of Service
>
>> Make sure you spoof an appropriate "Referrer" header and User Agent
>> header. Keep track of cookies.  If possible, pre-process which requests
>> you will make, and then build a thread-per-site thread pool each with a
>> Queue of requests to make, and a randomized delay between each request.
>
> I figured I did not need a referrer since browsers don't send one .  I
> can try supporting cookies.  I figured they too would not be necessary
> since many browsers refuse them.
>
>> Also, I would recommend supporting Cache headers of various sorts
>> (etags, Expires on, time to live, etc...)  This reduces load on the
>> remote server, bandwidth, and processing time.
>
> It seems to me those are about asking for the same page more once.  I
> don't understand what my app would do differently.
>
> I have written Abe Books asking for a computer friendly interface
> arguing it would attract more bulk book displayers, bookfinders, and
> that the bandwidth would be much lower than screenscraping.
>
> I once wrote the ASP people about their giant list of PADsites,
> suggesting some things to make it more computer friendly.  They
> responded that they did not want anyone USING the list, just casually
> looking at small parts of it.  So their goofy formatting was
> deliberately designed to frustrate those trying to import information
> from it.  I was baffled by the dog in a manger attitude. Why go to all
> that work then not let people use it.
>
> I maintain a similar list, http://mindprod.com/jgloss/hassle.html
> better pruned of deadwood.  I let people view it in HTML or download
> it as csv files.
>
Well, in whatever case, a few hours with Wireshark might help you 
understand what is different.  The rest of my advice involving queues 
and all is potentially worth looking at.

[toc] | [prev] | [standalone]

csiph-web

screen scraping gotcha

Contents

#10722 — screen scraping gotcha

#10726

#10727

#10729

#10732

#10774

#10783