Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #29423 > unrolled thread

a few questions about scrapy

Started byNomen Nescio <nobody@dizum.com>
First post2012-09-18 13:36 +0200
Last post2012-09-18 13:36 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python


Contents

  a few questions about scrapy Nomen Nescio <nobody@dizum.com> - 2012-09-18 13:36 +0200

#29423 — a few questions about scrapy

FromNomen Nescio <nobody@dizum.com>
Date2012-09-18 13:36 +0200
Subjecta few questions about scrapy
Message-ID<762702f8ad5d3fc831efd1e8b3c8e97c@dizum.com>
I've installed scrapy and gotten a basic set-up working, and I have a
few odd questions that I haven't been able to find in the
documentation.


I plan to run it occasionally from the command line or as a cron job,
to scrape new content from a few sites. To avoid duplication, I have
in memory two sets of long with the md5 hashes of the URLs and files
crawled, and the spider ignores any that it has seen before. I need to
load them from two disk files when the scrapy job starts, and save
them to disk when it ends. Are there hooks or something similar for
start-up and shut-down tasks?

How can I put a random waiting interval between HTTP GET calls?

Is there any way to set the proxy configuration in my Python code, or
do I have so set the environment variables http_proxy and https_proxy
before running scrapy?

thanks

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web