Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #25097

Re: How to safely maintain a status file

References <CAOV1wRVtm27yWez1HZuN8=ia-TyM2aXp9QCUbSZ5aZExP_ZChA@mail.gmail.com> <jtbsb1$8fj$1@dough.gmane.org> <mailman.1920.1341766046.4697.python-list@python.org> <a3448382-831e-4fe8-ad89-0e7b71452e76@oo8g2000pbc.googlegroups.com>
Date 2012-07-09 16:47 -0400
Subject Re: How to safely maintain a status file
From Michael Hrivnak <mhrivnak@hrivnak.org>
Newsgroups comp.lang.python
Message-ID <mailman.1957.1341866844.4697.python-list@python.org> (permalink)

Show all headers | View raw


Please consider batching this data and doing larger writes.  Thrashing
the hard drive is not a good plan for performance or hardware
longevity.  For example, crawl an entire FQDN and then write out the
results in one operation.  If your job fails in the middle and you
have to start that FQDN over, no big deal.  If that's too big of a
chunk for your purposes, perhaps break each FQDN up into top-level
directories and crawl each of those in one operation before writing to
disk.

There are existing solutions for managing job queues, so you can
choose what you like.  If you're unfamiliar, maybe start by looking at
celery.

Michael

On Mon, Jul 9, 2012 at 1:52 AM, Plumo <richardbp@gmail.com> wrote:
>> What are you keeping in this status file that needs to be saved
>> several times per second?  Depending on what type of state you're
>> storing and how persistent it needs to be, there may be a better way
>> to store it.
>>
>> Michael
>
> This is for a threaded web crawler. I want to cache what URL's are
> currently in the queue so if terminated the crawler can continue next
> time from the same point.
> --
> http://mail.python.org/mailman/listinfo/python-list

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-08 12:47 -0400
  Re: How to safely maintain a status file Plumo <richardbp@gmail.com> - 2012-07-08 22:52 -0700
    Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-09 16:47 -0400

csiph-web