Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10722 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2011-12-14 00:52 -0800 |
| Last post | 2011-12-15 16:19 -0800 |
| Articles | 7 — 4 participants |
Back to article view | Back to comp.lang.java.programmer
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-14 00:52 -0800 |
| Subject | screen scraping gotcha |
| Message-ID | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> |
I used a thread pool to speed up the screenscraping I use to find out which bookstores carry which books. Then I discovered some bookstores sometimes were returning 403 forbidden codes. I think they do this if you have more than one request outstanding from a given IP. I later discovered that Xenu link checker was getting 403 codes that BrokenLinks (which does one probe at a time) was finding were 200 (ok). So I think screenscraping/link checking etc code needs some mechanism to optionally avoid hitting a site with more than one request at a time or perhaps even with a pause of X seconds between requests. It might do that with an explicit Semaphore, ordering the requests to increased distance between probes to the same site, reducing the pool size... ?? -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [next] | [standalone]
| From | Eric Sosman <esosman@ieee-dot-org.invalid> |
|---|---|
| Date | 2011-12-14 08:34 -0500 |
| Message-ID | <jca8ld$jud$1@dont-email.me> |
| In reply to | #10722 |
On 12/14/2011 3:52 AM, Roedy Green wrote:
> I used a thread pool to speed up the screenscraping I use to find out
> which bookstores carry which books. Then I discovered some bookstores
> sometimes were returning 403 forbidden codes. I think they do this if
> you have more than one request outstanding from a given IP. I later
> discovered that Xenu link checker was getting 403 codes that
> BrokenLinks (which does one probe at a time) was finding were 200
> (ok).
>
> So I think screenscraping/link checking etc code needs some mechanism
> to optionally avoid hitting a site with more than one request at a
> time or perhaps even with a pause of X seconds between requests.
>
> It might do that with an explicit Semaphore, ordering the requests to
> increased distance between probes to the same site, reducing the pool
> size... ??
I'd suggest making the request scheduling explicit in the data
structures, and not burying it in the locking mechanisms. Maintain
a pool of "requests contemplated" and another of "requests in progress,"
and limit the number of in-progress requests for any one site. When
the in-progress pool completes a site S request, it can fish in the
contemplated pool for another S request, but not for a T request.
If you want to get fancier, you could try to discover each site's
throttling mechanism on the fly, by observing the 403's. But I think
keeping things simple to start with would be better -- after all, you
are only hypothesizing about the natures of the throttles!
--
Eric Sosman
esosman@ieee-dot-org.invalid
[toc] | [prev] | [next] | [standalone]
| From | Patricia Shanahan <pats@acm.org> |
|---|---|
| Date | 2011-12-14 06:18 -0800 |
| Message-ID | <jLWdnVA8Bte8LXXTnZ2dnUVZ_vydnZ2d@earthlink.com> |
| In reply to | #10722 |
On 12/14/2011 12:52 AM, Roedy Green wrote: > I used a thread pool to speed up the screenscraping I use to find out > which bookstores carry which books. Then I discovered some bookstores > sometimes were returning 403 forbidden codes. I think they do this if > you have more than one request outstanding from a given IP. I later > discovered that Xenu link checker was getting 403 codes that > BrokenLinks (which does one probe at a time) was finding were 200 > (ok). > > So I think screenscraping/link checking etc code needs some mechanism > to optionally avoid hitting a site with more than one request at a > time or perhaps even with a pause of X seconds between requests. > > It might do that with an explicit Semaphore, ordering the requests to > increased distance between probes to the same site, reducing the pool > size... ?? > What percentage of a thread's time is spent doing work but without having an outstanding request? If that is small, then the number of outstanding requests is likely to be the critical resource, and should have a site-appropriate limit, in some cases one. That would limit the thread count, for that site, to one. If a thread spends a significant amount of time doing other work, then it might make sense to have more threads than the request limit and use a semaphore to restrict the requests. Patricia
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-14 09:29 -0800 |
| Message-ID | <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> |
| In reply to | #10722 |
On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted someone who said : > >It might do that with an explicit Semaphore, ordering the requests to >increased distance between probes to the same site, reducing the pool >size... ?? I have tried throttling so that requests are separated by 30 seconds, it is still sending me 403s. Yet when I hit the site with browser, instantly all is forgiven. The stupid buggers don't seem to realise I am trying to HELP them sell books. If they had half a brain they would give me a soap interface where I could submit a list of ISBNs and they would give be back a list of booleans telling me which ones they have in stock. Most online stores go to extreme lengths to foil screen scraping. Many affiliate programs want to you go to their site and spend ten minutes to set up the html just to sell one product. Allposters.com invented a SOAP interface, but then left out sizes, formats and prices, and it was not in sync with the web site. not even the sizes of jpgs were correct.I have to start positing malice the incompetence is so extreme. -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-12-14 10:28 -0800 |
| Message-ID | <LV5Gq.18341$2e7.12664@newsfe18.iad> |
| In reply to | #10729 |
On 12/14/11 9:29 AM, Roedy Green wrote: > On Wed, 14 Dec 2011 00:52:08 -0800, Roedy Green > <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted > someone who said : > >> >> It might do that with an explicit Semaphore, ordering the requests to >> increased distance between probes to the same site, reducing the pool >> size... ?? > > I have tried throttling so that requests are separated by 30 seconds, > it is still sending me 403s. Yet when I hit the site with browser, > instantly all is forgiven. > > The stupid buggers don't seem to realise I am trying to HELP them sell > books. If they had half a brain they would give me a soap interface > where I could submit a list of ISBNs and they would give be back a > list of booleans telling me which ones they have in stock. > > > Most online stores go to extreme lengths to foil screen scraping. Many > affiliate programs want to you go to their site and spend ten minutes > to set up the html just to sell one product. > > Allposters.com invented a SOAP interface, but then left out sizes, > formats and prices, and it was not in sync with the web site. not > even the sizes of jpgs were correct.I have to start positing malice > the incompetence is so extreme. > If you're going to violate the TOS and robots.txt, you might as well do it right: Make sure you spoof an appropriate "Referrer" header and User Agent header. Keep track of cookies. If possible, pre-process which requests you will make, and then build a thread-per-site thread pool each with a Queue of requests to make, and a randomized delay between each request. Also, I would recommend supporting Cache headers of various sorts (etags, Expires on, time to live, etc...) This reduces load on the remote server, bandwidth, and processing time.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-15 06:18 -0800 |
| Message-ID | <bsvje7t8drio985goeapea3lngum0tovhr@4ax.com> |
| In reply to | #10732 |
On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts <newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly quoted someone who said : > >If you're going to violate the TOS and robots.txt, you might as well do >it right: TOS = Terms of Service >Make sure you spoof an appropriate "Referrer" header and User Agent >header. Keep track of cookies. If possible, pre-process which requests >you will make, and then build a thread-per-site thread pool each with a >Queue of requests to make, and a randomized delay between each request. I figured I did not need a referrer since browsers don't send one . I can try supporting cookies. I figured they too would not be necessary since many browsers refuse them. >Also, I would recommend supporting Cache headers of various sorts >(etags, Expires on, time to live, etc...) This reduces load on the >remote server, bandwidth, and processing time. It seems to me those are about asking for the same page more once. I don't understand what my app would do differently. I have written Abe Books asking for a computer friendly interface arguing it would attract more bulk book displayers, bookfinders, and that the bandwidth would be much lower than screenscraping. I once wrote the ASP people about their giant list of PADsites, suggesting some things to make it more computer friendly. They responded that they did not want anyone USING the list, just casually looking at small parts of it. So their goofy formatting was deliberately designed to frustrate those trying to import information from it. I was baffled by the dog in a manger attitude. Why go to all that work then not let people use it. I maintain a similar list, http://mindprod.com/jgloss/hassle.html better pruned of deadwood. I let people view it in HTML or download it as csv files. -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-12-15 16:19 -0800 |
| Message-ID | <18wGq.26283$cN1.1142@newsfe12.iad> |
| In reply to | #10774 |
On 12/15/11 6:18 AM, Roedy Green wrote: > On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts > <newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly > quoted someone who said : > >> >> If you're going to violate the TOS and robots.txt, you might as well do >> it right: > TOS = Terms of Service > >> Make sure you spoof an appropriate "Referrer" header and User Agent >> header. Keep track of cookies. If possible, pre-process which requests >> you will make, and then build a thread-per-site thread pool each with a >> Queue of requests to make, and a randomized delay between each request. > > I figured I did not need a referrer since browsers don't send one . I > can try supporting cookies. I figured they too would not be necessary > since many browsers refuse them. > >> Also, I would recommend supporting Cache headers of various sorts >> (etags, Expires on, time to live, etc...) This reduces load on the >> remote server, bandwidth, and processing time. > > It seems to me those are about asking for the same page more once. I > don't understand what my app would do differently. > > I have written Abe Books asking for a computer friendly interface > arguing it would attract more bulk book displayers, bookfinders, and > that the bandwidth would be much lower than screenscraping. > > I once wrote the ASP people about their giant list of PADsites, > suggesting some things to make it more computer friendly. They > responded that they did not want anyone USING the list, just casually > looking at small parts of it. So their goofy formatting was > deliberately designed to frustrate those trying to import information > from it. I was baffled by the dog in a manger attitude. Why go to all > that work then not let people use it. > > I maintain a similar list, http://mindprod.com/jgloss/hassle.html > better pruned of deadwood. I let people view it in HTML or download > it as csv files. > Well, in whatever case, a few hours with Wireshark might help you understand what is different. The rest of my advice involving queues and all is potentially worth looking at.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web