Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #19951 > unrolled thread
| Started by | silentnights <silentquote@gmail.com> |
|---|---|
| First post | 2012-02-07 01:33 -0800 |
| Last post | 2012-02-08 09:53 -0500 |
| Articles | 3 — 3 participants |
Back to article view | Back to comp.lang.python
python file synchronization silentnights <silentquote@gmail.com> - 2012-02-07 01:33 -0800
Re: python file synchronization Cameron Simpson <cs@zip.com.au> - 2012-02-08 14:40 +1100
Re: python file synchronization Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-08 09:53 -0500
| From | silentnights <silentquote@gmail.com> |
|---|---|
| Date | 2012-02-07 01:33 -0800 |
| Subject | python file synchronization |
| Message-ID | <24cb8965-9211-4601-b096-4bf482aead18@h6g2000yqk.googlegroups.com> |
Hi All, I have the following problem, I have an appliance (A) which generates records and write them into file (X), the appliance is accessible throw ftp from a server (B). I have another central server (C) that runs a Django App, that I need to get continuously the records from file (A). The problems are as follows: 1. (A) is heavily writing to the file, so copying the file will result of uncompleted line at the end. 2. I have many (A)s and (B)s that I need to get the data from. 3. I can't afford losing any records from file (X) My current implementation is as follows: 1. Server (B) copy the file (X) throw FTP. 2. Server (B) make a copy of file (X) to file (Y.time_stamp) ignoring the last line to avoid incomplete lines. 3. Server (B) periodically make copies of file (X) and copy the lines starting from previous ignored line to file (Y.time_stamp) 4. Server (C) mounts the diffs_dir locally. 5. Server (C) create file (Y.time_stamp.lock) on target_dir then copy file (Y.time_stamp) to local target_dir then delete (Y.time_stamp.lock) 6. A deamon running in Server (C) read file list from the target_dir, and process those file that doesn't have a matching *.lock file, this procedure to avoid reading the file until It's completely copied. The above is implemented and working, the problem is that It required so many syncs and has a high overhead and It's hard to debug. I greatly appreciate your thoughts and suggestions. Lastly I want to note that am not a programming guru, still a noob, but I am trying to learn from the experts. :-)
[toc] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2012-02-08 14:40 +1100 |
| Message-ID | <mailman.5532.1328672437.27778.python-list@python.org> |
| In reply to | #19951 |
On 07Feb2012 01:33, silentnights <silentquote@gmail.com> wrote:
| I have the following problem, I have an appliance (A) which generates
| records and write them into file (X), the appliance is accessible
| throw ftp from a server (B). I have another central server (C) that
| runs a Django App, that I need to get continuously the records from
| file (A).
|
| The problems are as follows:
| 1. (A) is heavily writing to the file, so copying the file will result
| of uncompleted line at the end.
| 2. I have many (A)s and (B)s that I need to get the data from.
| 3. I can't afford losing any records from file (X)
[...]
| The above is implemented and working, the problem is that It required
| so many syncs and has a high overhead and It's hard to debug.
Yep.
I would change the file discipline. Accept that FTP is slow and has no
locking. Accept that reading records from an actively growing file is
often tricky and sometimes impossible depending on the record format.
So don't. Hand off completed files regularly and keep the incomplete
file small.
Have (A) write records to a file whose name clearly shows the file to be
incomplete. Eg "data.new". Every so often (even once a second), _if_ the
file is not empty: close it, _rename_ to "data.timestamp" or
"data.sequence-number", open a new "data.new" for new records.
Have the FTP client fetch only the completed files.
You can perform a similar effort for the socket daemon: look only for
completed data files. Reading the filenames from a directory is very
fast if you don't stat() them (i.e. just os.listdir). Just open and scan
any new files that appear.
That would be my first cut.
--
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
Performing random acts of moral ambiguity.
- Jeff Miller <jxmill2@gonix.com>
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2012-02-08 09:53 -0500 |
| Message-ID | <mailman.5544.1328712858.27778.python-list@python.org> |
| In reply to | #19951 |
On Wed, 8 Feb 2012 08:57:43 +0200, Sherif Shehab Aldin
<silentquote@gmail.com> wrote:
>
>After searching more yesterday, I found that local mv is atomic, so instead
>of creating the lock files, I will copy the new diffs to tmp dir, and after
>the copy is over, mv it to actual diffs dir, that will avoid reading It
>while It's still being copied.
>
Are your tmp directory and your "diffs" directory on the same
physical volume? If so, "mv" is a rename operation, that only affects
the directory information. If the volumes are different, then "mv"
reverts to a copy/delete file operation.
To avoid problems in the future (say the "diffs" machine is
reconfigured with an additional drive and "tmp" is now mounted on the
new drive) you might be better off taking part of the suggestion to use
a special file name to indicate an "in-work" file...
diffs.timestamp.part
say, and when ready, just
mv diffs.timestamp.part diffs.timestamp
This leaves them in the same physical location and directory.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web