Groups > comp.lang.python > #19951 > unrolled thread

python file synchronization

Started by	silentnights <silentquote@gmail.com>
First post	2012-02-07 01:33 -0800
Last post	2012-02-08 09:53 -0500
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  python file synchronization silentnights <silentquote@gmail.com> - 2012-02-07 01:33 -0800
    Re: python file synchronization Cameron Simpson <cs@zip.com.au> - 2012-02-08 14:40 +1100
    Re: python file synchronization Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-08 09:53 -0500

#19951 — python file synchronization

From	silentnights <silentquote@gmail.com>
Date	2012-02-07 01:33 -0800
Subject	python file synchronization
Message-ID	<24cb8965-9211-4601-b096-4bf482aead18@h6g2000yqk.googlegroups.com>

Hi All,

I have the following problem, I have an appliance (A) which generates
records and write them into file (X), the appliance is accessible
throw ftp from a server (B). I have another central server (C) that
runs a Django App, that I need to get continuously the records from
file (A).

The problems are as follows:
1. (A) is heavily writing to the file, so copying the file will result
of uncompleted line at the end.
2. I have many (A)s and (B)s  that I need to get the data from.
3. I can't afford losing any records from file (X)

My current implementation is as follows:
1. Server (B) copy the file (X) throw FTP.
2. Server (B) make a copy of file (X) to file (Y.time_stamp) ignoring
the last line to avoid incomplete lines.
3. Server (B) periodically make copies of file (X) and copy the lines
starting from previous ignored line to file (Y.time_stamp)

4. Server (C) mounts the diffs_dir locally.
5. Server (C) create file (Y.time_stamp.lock) on target_dir then copy
file (Y.time_stamp) to local target_dir then delete
(Y.time_stamp.lock)

6. A deamon running in Server (C) read file list from the target_dir,
and process those file that doesn't have a matching *.lock file, this
procedure to avoid reading the file until It's completely copied.

The above is implemented and working, the problem is that It required
so many syncs and has a high overhead and It's hard to debug.

I greatly appreciate your thoughts and suggestions.

Lastly I want to note that am not a programming guru, still a noob,
but I am trying to learn from the experts. :-)

[toc] | [next] | [standalone]

#20005

From	Cameron Simpson <cs@zip.com.au>
Date	2012-02-08 14:40 +1100
Message-ID	<mailman.5532.1328672437.27778.python-list@python.org>
In reply to	#19951

On 07Feb2012 01:33, silentnights <silentquote@gmail.com> wrote:
| I have the following problem, I have an appliance (A) which generates
| records and write them into file (X), the appliance is accessible
| throw ftp from a server (B). I have another central server (C) that
| runs a Django App, that I need to get continuously the records from
| file (A).
| 
| The problems are as follows:
| 1. (A) is heavily writing to the file, so copying the file will result
| of uncompleted line at the end.
| 2. I have many (A)s and (B)s  that I need to get the data from.
| 3. I can't afford losing any records from file (X)
[...]
| The above is implemented and working, the problem is that It required
| so many syncs and has a high overhead and It's hard to debug.

Yep.

I would change the file discipline. Accept that FTP is slow and has no
locking. Accept that reading records from an actively growing file is
often tricky and sometimes impossible depending on the record format.
So don't. Hand off completed files regularly and keep the incomplete
file small.

Have (A) write records to a file whose name clearly shows the file to be
incomplete. Eg "data.new". Every so often (even once a second), _if_ the
file is not empty: close it, _rename_ to "data.timestamp" or
"data.sequence-number", open a new "data.new" for new records. 

Have the FTP client fetch only the completed files.

You can perform a similar effort for the socket daemon: look only for
completed data files. Reading the filenames from a directory is very
fast if you don't stat() them (i.e. just os.listdir). Just open and scan
any new files that appear.

That would be my first cut.
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Performing random acts of moral ambiguity.
        - Jeff Miller <jxmill2@gonix.com>

[toc] | [prev] | [next] | [standalone]

#20024

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-02-08 09:53 -0500
Message-ID	<mailman.5544.1328712858.27778.python-list@python.org>
In reply to	#19951

On Wed, 8 Feb 2012 08:57:43 +0200, Sherif Shehab Aldin
<silentquote@gmail.com> wrote:

>
>After searching more yesterday, I found that local mv is atomic, so instead
>of creating the lock files, I will copy the new diffs to tmp dir, and after
>the copy is over, mv it to actual diffs dir, that will avoid reading It
>while It's still being copied.
>
	Are your tmp directory and your "diffs" directory on the same
physical volume? If so, "mv" is a rename operation, that only affects
the directory information. If the volumes are different, then "mv"
reverts to a copy/delete file operation.

	To avoid problems in the future (say the "diffs" machine is
reconfigured with an additional drive and "tmp" is now mounted on the
new drive) you might be better off taking part of the suggestion to use
a special file name to indicate an "in-work" file...

diffs.timestamp.part

say, and when ready, just

mv diffs.timestamp.part diffs.timestamp

	This leaves them in the same physical location and directory.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

python file synchronization

Contents

#19951 — python file synchronization

#20005

#20024