Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #25044 > unrolled thread

Re: How to safely maintain a status file

Started byMichael Hrivnak <mhrivnak@hrivnak.org>
First post2012-07-08 12:47 -0400
Last post2012-07-09 16:47 -0400
Articles 3 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-08 12:47 -0400
    Re: How to safely maintain a status file Plumo <richardbp@gmail.com> - 2012-07-08 22:52 -0700
      Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-09 16:47 -0400

#25044 — Re: How to safely maintain a status file

FromMichael Hrivnak <mhrivnak@hrivnak.org>
Date2012-07-08 12:47 -0400
SubjectRe: How to safely maintain a status file
Message-ID<mailman.1920.1341766046.4697.python-list@python.org>
What are you keeping in this status file that needs to be saved
several times per second?  Depending on what type of state you're
storing and how persistent it needs to be, there may be a better way
to store it.

Michael

On Sun, Jul 8, 2012 at 7:53 AM, Christian Heimes <lists@cheimes.de> wrote:
> Am 08.07.2012 13:29, schrieb Richard Baron Penman:
>> My initial solution was a thread that writes status to a tmp file
>> first and then renames:
>>
>> open(tmp_file, 'w').write(status)
>> os.rename(tmp_file, status_file)
>
> You algorithm may not write and flush all data to disk. You need to do
> additional work. You must also store the tmpfile on the same partition
> (better: same directory) as the status file
>
> with open(tmp_file, "w") as f:
>     f.write(status)
>     # flush buffer and write data/metadata to disk
>     f.flush()
>     os.fsync(f.fileno())
>
> # now rename the file
> os.rename(tmp_file, status_file)
>
> # finally flush metadata of directory to disk
> dirfd = os.open(os.path.dirname(status_file), os.O_RDONLY)
> try:
>     os.fsync(dirfd)
> finally:
>     os.close(dirfd)
>
>
>> This works well on Linux but Windows raises an error when status_file
>> already exists.
>> http://docs.python.org/library/os.html#os.rename
>
> Windows doesn't suppport atomic renames if the right side exists.  I
> suggest that you implement two code paths:
>
> if os.name == "posix":
>     rename = os.rename
> else:
>     def rename(a, b):
>         try:
>             os.rename(a, b)
>         except OSError, e:
>             if e.errno != 183:
>                 raise
>             os.unlink(b)
>             os.rename(a, b)
>
> Christian
>
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [next] | [standalone]


#25064

FromPlumo <richardbp@gmail.com>
Date2012-07-08 22:52 -0700
Message-ID<a3448382-831e-4fe8-ad89-0e7b71452e76@oo8g2000pbc.googlegroups.com>
In reply to#25044
> What are you keeping in this status file that needs to be saved
> several times per second?  Depending on what type of state you're
> storing and how persistent it needs to be, there may be a better way
> to store it.
>
> Michael

This is for a threaded web crawler. I want to cache what URL's are
currently in the queue so if terminated the crawler can continue next
time from the same point.

[toc] | [prev] | [next] | [standalone]


#25097

FromMichael Hrivnak <mhrivnak@hrivnak.org>
Date2012-07-09 16:47 -0400
Message-ID<mailman.1957.1341866844.4697.python-list@python.org>
In reply to#25064
Please consider batching this data and doing larger writes.  Thrashing
the hard drive is not a good plan for performance or hardware
longevity.  For example, crawl an entire FQDN and then write out the
results in one operation.  If your job fails in the middle and you
have to start that FQDN over, no big deal.  If that's too big of a
chunk for your purposes, perhaps break each FQDN up into top-level
directories and crawl each of those in one operation before writing to
disk.

There are existing solutions for managing job queues, so you can
choose what you like.  If you're unfamiliar, maybe start by looking at
celery.

Michael

On Mon, Jul 9, 2012 at 1:52 AM, Plumo <richardbp@gmail.com> wrote:
>> What are you keeping in this status file that needs to be saved
>> several times per second?  Depending on what type of state you're
>> storing and how persistent it needs to be, there may be a better way
>> to store it.
>>
>> Michael
>
> This is for a threaded web crawler. I want to cache what URL's are
> currently in the queue so if terminated the crawler can continue next
> time from the same point.
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web