Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #25044 > unrolled thread
| Started by | Michael Hrivnak <mhrivnak@hrivnak.org> |
|---|---|
| First post | 2012-07-08 12:47 -0400 |
| Last post | 2012-07-09 16:47 -0400 |
| Articles | 3 — 2 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-08 12:47 -0400
Re: How to safely maintain a status file Plumo <richardbp@gmail.com> - 2012-07-08 22:52 -0700
Re: How to safely maintain a status file Michael Hrivnak <mhrivnak@hrivnak.org> - 2012-07-09 16:47 -0400
| From | Michael Hrivnak <mhrivnak@hrivnak.org> |
|---|---|
| Date | 2012-07-08 12:47 -0400 |
| Subject | Re: How to safely maintain a status file |
| Message-ID | <mailman.1920.1341766046.4697.python-list@python.org> |
What are you keeping in this status file that needs to be saved several times per second? Depending on what type of state you're storing and how persistent it needs to be, there may be a better way to store it. Michael On Sun, Jul 8, 2012 at 7:53 AM, Christian Heimes <lists@cheimes.de> wrote: > Am 08.07.2012 13:29, schrieb Richard Baron Penman: >> My initial solution was a thread that writes status to a tmp file >> first and then renames: >> >> open(tmp_file, 'w').write(status) >> os.rename(tmp_file, status_file) > > You algorithm may not write and flush all data to disk. You need to do > additional work. You must also store the tmpfile on the same partition > (better: same directory) as the status file > > with open(tmp_file, "w") as f: > f.write(status) > # flush buffer and write data/metadata to disk > f.flush() > os.fsync(f.fileno()) > > # now rename the file > os.rename(tmp_file, status_file) > > # finally flush metadata of directory to disk > dirfd = os.open(os.path.dirname(status_file), os.O_RDONLY) > try: > os.fsync(dirfd) > finally: > os.close(dirfd) > > >> This works well on Linux but Windows raises an error when status_file >> already exists. >> http://docs.python.org/library/os.html#os.rename > > Windows doesn't suppport atomic renames if the right side exists. I > suggest that you implement two code paths: > > if os.name == "posix": > rename = os.rename > else: > def rename(a, b): > try: > os.rename(a, b) > except OSError, e: > if e.errno != 183: > raise > os.unlink(b) > os.rename(a, b) > > Christian > > -- > http://mail.python.org/mailman/listinfo/python-list
[toc] | [next] | [standalone]
| From | Plumo <richardbp@gmail.com> |
|---|---|
| Date | 2012-07-08 22:52 -0700 |
| Message-ID | <a3448382-831e-4fe8-ad89-0e7b71452e76@oo8g2000pbc.googlegroups.com> |
| In reply to | #25044 |
> What are you keeping in this status file that needs to be saved > several times per second? Depending on what type of state you're > storing and how persistent it needs to be, there may be a better way > to store it. > > Michael This is for a threaded web crawler. I want to cache what URL's are currently in the queue so if terminated the crawler can continue next time from the same point.
[toc] | [prev] | [next] | [standalone]
| From | Michael Hrivnak <mhrivnak@hrivnak.org> |
|---|---|
| Date | 2012-07-09 16:47 -0400 |
| Message-ID | <mailman.1957.1341866844.4697.python-list@python.org> |
| In reply to | #25064 |
Please consider batching this data and doing larger writes. Thrashing the hard drive is not a good plan for performance or hardware longevity. For example, crawl an entire FQDN and then write out the results in one operation. If your job fails in the middle and you have to start that FQDN over, no big deal. If that's too big of a chunk for your purposes, perhaps break each FQDN up into top-level directories and crawl each of those in one operation before writing to disk. There are existing solutions for managing job queues, so you can choose what you like. If you're unfamiliar, maybe start by looking at celery. Michael On Mon, Jul 9, 2012 at 1:52 AM, Plumo <richardbp@gmail.com> wrote: >> What are you keeping in this status file that needs to be saved >> several times per second? Depending on what type of state you're >> storing and how persistent it needs to be, there may be a better way >> to store it. >> >> Michael > > This is for a threaded web crawler. I want to cache what URL's are > currently in the queue so if terminated the crawler can continue next > time from the same point. > -- > http://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web