Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10774
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: screen scraping gotcha |
| Date | 2011-12-15 06:18 -0800 |
| Organization | Canadian Mind Products |
| Message-ID | <bsvje7t8drio985goeapea3lngum0tovhr@4ax.com> (permalink) |
| References | <ucoge7heqtrc4ju4jlg40b9c1pt0th421d@4ax.com> <uqmhe71ovk82uojpe6l72673tcafuq6g9h@4ax.com> <LV5Gq.18341$2e7.12664@newsfe18.iad> |
On Wed, 14 Dec 2011 10:28:57 -0800, Daniel Pitts <newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly quoted someone who said : > >If you're going to violate the TOS and robots.txt, you might as well do >it right: TOS = Terms of Service >Make sure you spoof an appropriate "Referrer" header and User Agent >header. Keep track of cookies. If possible, pre-process which requests >you will make, and then build a thread-per-site thread pool each with a >Queue of requests to make, and a randomized delay between each request. I figured I did not need a referrer since browsers don't send one . I can try supporting cookies. I figured they too would not be necessary since many browsers refuse them. >Also, I would recommend supporting Cache headers of various sorts >(etags, Expires on, time to live, etc...) This reduces load on the >remote server, bandwidth, and processing time. It seems to me those are about asking for the same page more once. I don't understand what my app would do differently. I have written Abe Books asking for a computer friendly interface arguing it would attract more bulk book displayers, bookfinders, and that the bandwidth would be much lower than screenscraping. I once wrote the ASP people about their giant list of PADsites, suggesting some things to make it more computer friendly. They responded that they did not want anyone USING the list, just casually looking at small parts of it. So their goofy formatting was deliberately designed to frustrate those trying to import information from it. I was baffled by the dog in a manger attitude. Why go to all that work then not let people use it. I maintain a similar list, http://mindprod.com/jgloss/hassle.html better pruned of deadwood. I let people view it in HTML or download it as csv files. -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 00:52 -0800
Re: screen scraping gotcha Eric Sosman <esosman@ieee-dot-org.invalid> - 2011-12-14 08:34 -0500
Re: screen scraping gotcha Patricia Shanahan <pats@acm.org> - 2011-12-14 06:18 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-14 09:29 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-14 10:28 -0800
Re: screen scraping gotcha Roedy Green <see_website@mindprod.com.invalid> - 2011-12-15 06:18 -0800
Re: screen scraping gotcha Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-15 16:19 -0800
csiph-web