Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10783
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: screen scraping gotcha |
| References | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> <LV5Gq.18341$2e7.12664@newsfe18.iad> <bsvje7t8drio985goeapea3lngum0tovhr@4ax.com> |
| Message-ID | <18wGq.26283$cN1.1142@newsfe12.iad> (permalink) |
| Date | 2011-12-15 16:19 -0800 |
On 12/15/11 6:18 AM, Roedy Green wrote: > On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts > <newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly > quoted someone who said : > >> >> If you're going to violate the TOS and robots.txt, you might as well do >> it right: > TOS = Terms of Service > >> Make sure you spoof an appropriate "Referrer" header and User Agent >> header. Keep track of cookies. If possible, pre-process which requests >> you will make, and then build a thread-per-site thread pool each with a >> Queue of requests to make, and a randomized delay between each request. > > I figured I did not need a referrer since browsers don't send one . I > can try supporting cookies. I figured they too would not be necessary > since many browsers refuse them. > >> Also, I would recommend supporting Cache headers of various sorts >> (etags, Expires on, time to live, etc...) This reduces load on the >> remote server, bandwidth, and processing time. > > It seems to me those are about asking for the same page more once. I > don't understand what my app would do differently. > > I have written Abe Books asking for a computer friendly interface > arguing it would attract more bulk book displayers, bookfinders, and > that the bandwidth would be much lower than screenscraping. > > I once wrote the ASP people about their giant list of PADsites, > suggesting some things to make it more computer friendly. They > responded that they did not want anyone USING the list, just casually > looking at small parts of it. So their goofy formatting was > deliberately designed to frustrate those trying to import information > from it. I was baffled by the dog in a manger attitude. Why go to all > that work then not let people use it. > > I maintain a similar list, http://mindprod.com/jgloss/hassle.html > better pruned of deadwood. I let people view it in HTML or download > it as csv files. > Well, in whatever case, a few hours with Wireshark might help you understand what is different. The rest of my advice involving queues and all is potentially worth looking at.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Find similar | Unroll thread
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
csiph-web