Groups > comp.lang.python > #78043 > unrolled thread

Re: program to generate data helpful in finding duplicate large files

Started by	Peter Otten <__peter__@web.de>
First post	2014-09-18 22:23 +0200
Last post	2014-09-18 22:23 +0200
Articles	1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: program to generate data helpful in finding duplicate large files Peter Otten <__peter__@web.de> - 2014-09-18 22:23 +0200

#78043 — Re: program to generate data helpful in finding duplicate large files

From	Peter Otten <__peter__@web.de>
Date	2014-09-18 22:23 +0200
Subject	Re: program to generate data helpful in finding duplicate large files
Message-ID	<mailman.14123.1411071819.18130.python-list@python.org>

David Alban wrote:

> *    sep = ascii_nul*
> 
> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

file_path may contain newlines, therefore you should probably use "\0" to 
separate the records. The other fields may not contain whitespace, so it's 
safe to use " " as the field separator. When you deserialize the record you 
can prevent the file_path from being broken by providing maxsplit to the 
str.split() method:

for record in infile.read().split("\0"):
    print(record.split(" ", 6))

Splitting into records without reading the whole data into memory left as an 
exercise ;)

[toc] | [standalone]

csiph-web

Re: program to generate data helpful in finding duplicate large files

Contents

#78043 — Re: program to generate data helpful in finding duplicate large files