Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #29423
| From | Nomen Nescio <nobody@dizum.com> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | a few questions about scrapy |
| Message-ID | <762702f8ad5d3fc831efd1e8b3c8e97c@dizum.com> (permalink) |
| Date | 2012-09-18 13:36 +0200 |
| Organization | mail2news@dizum.com |
I've installed scrapy and gotten a basic set-up working, and I have a few odd questions that I haven't been able to find in the documentation. I plan to run it occasionally from the command line or as a cron job, to scrape new content from a few sites. To avoid duplication, I have in memory two sets of long with the md5 hashes of the URLs and files crawled, and the spider ignores any that it has seen before. I need to load them from two disk files when the scrapy job starts, and save them to disk when it ends. Are there hooks or something similar for start-up and shut-down tasks? How can I put a random waiting interval between HTTP GET calls? Is there any way to set the proxy configuration in my Python code, or do I have so set the environment variables http_proxy and https_proxy before running scrapy? thanks
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
a few questions about scrapy Nomen Nescio <nobody@dizum.com> - 2012-09-18 13:36 +0200
csiph-web