Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'scripts': 0.03; 'argument': 0.05; 'cpython': 0.05; 'explicitly': 0.05; 'importing': 0.05; 'output': 0.05; 'root': 0.05; 'tree': 0.05; "'',": 0.07; 'binary': 0.07; 'dev': 0.07; 'odd': 0.07; 'parser': 0.07; 'paths': 0.07; 'sys': 0.07; 'string': 0.09; '"if': 0.09; 'accelerator': 0.09; 'differently.': 0.09; 'etc).': 0.09; 'executed': 0.09; 'imported': 0.09; 'namespace': 0.09; 'output,': 0.09; 'parsing': 0.09; 'subject:files': 0.09; 'windows,': 0.09; 'worse': 0.09; 'api': 0.11; 'python': 0.11; 'suggest': 0.14; '"%s"': 0.16; '"\\r\\n"': 0.16; '"with"': 0.16; "'r'": 0.16; "'rb')": 0.16; '(eg,': 0.16; '*args': 0.16; 'block.': 0.16; 'closed,': 0.16; 'code),': 0.16; 'command-line': 0.16; 'enough.': 0.16; 'example).': 0.16; 'file).': 0.16; 'file_name': 0.16; 'filesystem': 0.16; 'formatting.': 0.16; 'imports': 0.16; 'incorrect': 0.16; 'md5': 0.16; 'namespace,': 0.16; 'overriding': 0.16; 'permissions,': 0.16; 'pythonic': 0.16; 'readable': 0.16; 'referenced,': 0.16; 'reimport': 0.16; 'retrieving': 0.16; 'slash.': 0.16; 'subject:program': 0.16; '(you': 0.16; 'files.': 0.16; 'size,': 0.16; 'wrote:': 0.18; 'module': 0.19; 'thu,': 0.19; 'later': 0.20; 'seems': 0.21; 'command': 0.22; 'input': 0.22; 'platforms': 0.22; 'import': 0.22; 'to:name:python- list@python.org': 0.22; 'print': 0.22; 'error': 0.23; 'format,': 0.24; 'instead.': 0.24; 'module,': 0.24; 'specify': 0.24; 'tend': 0.24; 'fine': 0.24; 'initial': 0.24; 'regardless': 0.24; 'versions': 0.24; '(or': 0.24; 'script': 0.25; 'handling': 0.26; 'least': 0.26; '(for': 0.26; 'primary': 0.26; 'skip:" 20': 0.27; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'on,': 0.29; 'am,': 0.29; 'generally': 0.29; 'errors': 0.30; 'mode': 0.30; 'needed.': 0.30; 'newer': 0.30; 'statement': 0.30; 'timely': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; '(which': 0.31; 'code': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'accidentally': 0.31; 'everywhere': 0.31; 'pipe': 0.31; 'sep': 0.31; 'universal': 0.31; 'file': 0.32; 'probably': 0.32; 'regular': 0.32; 'linux': 0.33; 'running': 0.33; 'skip:# 10': 0.33; 'skip:& 30': 0.33; 'maybe': 0.34; 'received:74.125.82': 0.34; 'could': 0.34; 'except': 0.35; 'case,': 0.35; 'computing': 0.35; 'convert': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; '2.6': 0.36; 'joined': 0.36; 'skip:f 40': 0.36; 'subject:data': 0.36; 'done': 0.36; 'more': 0.64; 'here': 0.66; 'between': 0.67; 'close': 0.67; 'believe': 0.68; 'skip:a 40': 0.72; 'manner.': 0.74; '\xc2\xa0\xc2\xa0': 0.74; 'results,': 0.84; 'stat': 0.84; '*for': 0.91; 'processes,': 0.91; '<>*': 0.95 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=WQZhDmx68+va0OW9tCo+BUNVdOaztIQ5N4TPum15AQc=; b=SUJuEVfzOPp9nuP/+dIb1m6D8Rzp8Y0f7FhlsPnzk0KQ5Eas/tq1VWIZ0o7KQRDR5l ZiRdLxSutHpukZQ1d24Lf9K0R5ndmMYYcCtMNYY6/y/qHl3tQstD2yDoXwGZ7Ximjj9W FBLDh1YIsevsMyoPDn3uQ3uC69nXMc3W+bAisDDYmCKsjL55rzklViDbRomhCij98f5k dyQ8F6nXNcK6mhVGcKT6IDGi4NR7jcua+cHRjEf/lKgYTSQ68N/GLzpSZZqqOqC9Ha6d w4lsa8EEHsNimrcTagVZvzmTQdnjPlJt7UKHoIws7/JzemWAlAAUZXkJ1ud7tSU3KIit xUkQ== X-Gm-Message-State: ALoCoQkCaC4XKI5vu/OaHrou/+VP9InRTSOE+ybOcL8FY7/zart/ZFHxTb3CAr5oaA284gxN7TSN X-Received: by 10.194.202.231 with SMTP id kl7mr4453608wjc.134.1411065961552; Thu, 18 Sep 2014 11:46:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: From: Chris Kaynor Date: Thu, 18 Sep 2014 11:45:41 -0700 Subject: Re: program to generate data helpful in finding duplicate large files To: "python-list@python.org" Content-Type: multipart/alternative; boundary=047d7bae483664bc3505035b6589 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 340 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1411065969 news.xs4all.nl 2904 [2001:888:2000:d::a6]:42230 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:78032 --047d7bae483664bc3505035b6589 Content-Type: text/plain; charset=UTF-8 On Thu, Sep 18, 2014 at 11:11 AM, David Alban wrote: > *#!/usr/bin/python* > > *import argparse* > *import hashlib* > *import os* > *import re* > *import socket* > *import sys* > > *from stat import ** > Generally, from import * imports are discouraged as they tend to populate your namespace and have issues with accidentally overriding imported functions/variables. Generally, its more Pythonic to use the other imports (or import as) and reference with the namespace, as you are doing everywhere else. The main case where from import * is recommended is API imports (for example, importing the API of one module into another, such as for inter-platform, inter-version, or accelerator support). > > *ascii_nul = chr(0)* > > * # from: > http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python > * > * # except that i use hexdigest() rather than digest()* > *def md5_for_file(f, block_size=2**20):* > * md5 = hashlib.md5()* > * while True:* > * data = f.read(block_size)* > * if not data:* > * break* > * md5.update(data)* > * return md5.hexdigest()* > > *thishost = socket.gethostname()* > > *parser = argparse.ArgumentParser(description='scan files in a tree and > print a line of information about each regular file')* > *parser.add_argument('--start-directory', '-d', default='.', > help='specifies the root of the filesystem tree to be processed')* > *args = parser.parse_args()* > > *start_directory = re.sub( '/+$', '', args.start_directory )* > I'm not sure this is actually needed. Its also not platform-independent as some platforms (eg, Windows) primary uses "\" instead. > > *for directory_path, directory_names, file_names in os.walk( > start_directory ):* > * for file_name in file_names:* > * file_path = "%s/%s" % ( directory_path, file_name )* > os.path.join would be more cross-platform than the string formatting. Basically, this line would become file_path = os.path.join(directory_path, file_name) os.path.join will also ensure that, regardless of the inputs, the paths will only be joined by a single slash. > * lstat_info = os.lstat( file_path )* > > * mode = lstat_info.st_mode* > > * if not S_ISREG( mode ) or S_ISLNK( mode ):* > * continue* > > * f = open( file_path, 'r' )* > * md5sum = md5_for_file( f )* > The Pythonic thing to do here would be to use a "with" statement to ensure the file is closed in a timely manner. This requires Python 2.6 or newer (2.5 works as well with a future directive). This would require the above two lines to become: with open( file_path, 'r' ) as f: md5sum = md5_for_file( f ) I do note that you never explicitly close the files (which is done via the with statement in my example). While generally fine as CPython will close them automatically when no longer referenced, its not a good practice to get into. Other versions of Python may have delays before the file is closed, which could then result in errors if processing a huge number of files. The with statement will ensure the file is closed immediately after the md5 computation finishes, even if there is an error computing the md5. Note that in any case, the OS should automatically close the file when the process exits, but this is likely even worse than relying on Python to close them for you. Additionally, you may want to specify binary mode by using open(file_path, 'rb') to ensure platform-independence ('r' uses Universal newlines, which means on Windows, Python will convert "\r\n" to "\n" while reading the file). Additionally, some platforms will treat binary files differently. You may also want to put some additional error handling in here. For example, the file could be deleted between the "walk" call and the "open" call, the file may not be readable (locked by other processes, incorrect permissions, etc). Without knowing your use case, you may need to deal with those cases, or maybe having the script fail out with an error message is good enough. > * dev = lstat_info.st_dev* > * ino = lstat_info.st_ino* > * nlink = lstat_info.st_nlink* > * size = lstat_info.st_size* > > * sep = ascii_nul* > > * print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep, > dev, sep, ino, sep, nlink, sep, size, sep, file_path )* > You could use sep.join(thishost, md5sum, dev, nio, nlink, size, file_path) rather than a string format here, presuming all the input values are strings (you can call the str function on the values to convert them, which will do the same as the "%s" formatter). I don't know how much control you have over the output format (you said you intend to pipe this output into other code), but if you can change it, I would suggest either using a pure binary format, using a more human-readable separator than chr(0), or at least providing an argument to the script to set the separator (I believe Linux has a -0 argument for many of its scripts). Also, it seems odd that you include socket.gethostname() in the output, as that will always be the system you are running the code on, and not the system you are retrieving data for (os.walk will work on network paths, including UNC paths). *exit( 0 )* > The only other thing I see is that I would probably break the code into a few additional functions, and put the argument parsing and initial call into a "if name == '__main__':" block. This would allow your code to be imported in the future and called by other Python scripts as a module, as well as allowing it to be executed as a script from the command line. This will not matter if you only ever intend to use this script as a command-line call, but could be useful if you want to reuse the code later in a larger project. To do this, however, you would need to make the function yield/return the results, rather than directly print. --047d7bae483664bc3505035b6589 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On T= hu, Sep 18, 2014 at 11:11 AM, David Alban <extasia@extasia.org> wrote:
#!/usr/bin/python
import argparse
import hashlib
import os
import re
import socket
<= div>import sys

from st= at import *

Ge= nerally, from import * imports are discouraged as they tend to populate you= r namespace and have issues with accidentally overriding imported functions= /variables. Generally, its more Pythonic to use the other imports (or impor= t as) and reference with the namespace, as you are doing everywhere else. T= he main case where from import * is recommended is API imports (for example= , importing the API of one module into another, such as for inter-platform,= inter-version, or accelerator support).
=C2=A0

ascii_nul =3D chr(0)

=C2=A0 =C2=A0 =C2=A0# except that i use hexdige= st() rather than digest()
def md5_for_file(f, block_size=3D2**20):=
=C2=A0 md5 = =3D hashlib.md5()
=C2=A0 while True:
=C2=A0 =C2=A0 data =3D f.read(block_size)<= /b>
= =C2=A0 =C2=A0 if not data:
=C2=A0 =C2=A0 =C2=A0 break
=C2=A0 =C2=A0 md5.update(= data)
=C2=A0 return md5.hexdigest()

thishost =3D socket.gethostname()

=
parse= r =3D argparse.ArgumentParser(description=3D'scan files in a tree and p= rint a line of information about each regular file')
parser.add_argument(= '--start-directory', '-d', default=3D'.', help=3D&#= 39;specifies the root of the filesystem tree to be processed')
args =3D p= arser.parse_args()

start_directory =3D re.sub( '/+$', '', = args.start_directory )

I'm not sure this is actually needed. Its also not platform-in= dependent as some platforms (eg, Windows) primary uses "\" instea= d.
=C2=A0

for directory_path, direct= ory_names, file_names in os.walk( start_directory ):
<= font face=3D"courier new, monospace" size=3D"1">=C2=A0 for file_name in = file_names:
=C2=A0 =C2=A0 file_path =3D "%s/%s" % ( directory_path,= file_name )

o= s.path.join would be more cross-platform than the string formatting. Basica= lly, this line would become
file_path =3D os.path.join(directory_path, file_name)
os.path.join will also ensure that, regardless of the= inputs, the paths will only be joined by a single slash.
=C2=A0
=
=C2=A0 =C2=A0 lstat_info =3D os.lstat( file_path )

<= /div>
=C2=A0 =C2=A0= mode =3D lstat_info.st_mode

=C2=A0 =C2=A0 if not S_ISREG( mode ) or S_ISL= NK( mode ):
=C2=A0 =C2=A0 =C2=A0 continue

=C2=A0 =C2=A0 f =3D open( file_pa= th, 'r' )=C2=A0
=C2=A0 =C2=A0 md5sum =3D md5_for_file( f )=

The Pythonic = thing to do here would be to use a "with" statement to ensure the= file is closed in a timely manner. This requires Python 2.6 or newer (2.5 = works as well with a future directive).
This would require = the above two lines to become:
with open( file_path, 'r' ) as f:
=
=C2=A0 =C2=A0 md5sum = =3D md5_for_file( f )

I do note that you never expli= citly close the files (which is done via the with statement in my example).= While generally fine as CPython will close them automatically when no long= er referenced, its not a good practice to get into. Other versions of Pytho= n may have delays before the file is closed, which could then result in err= ors if processing a huge number of files. The with statement will ensure th= e file is closed immediately after the md5 computation finishes, even if th= ere is an error computing the md5. Note that in any case, the OS should aut= omatically close the file when the process exits, but this is likely even w= orse than relying on Python to close them for you.

Additionally, you may want to specify binary mode by using open(file_path,= 'rb') to ensure platform-independence ('r' uses Universal = newlines, which means on Windows, Python will convert "\r\n" to &= quot;\n" while reading the file). Additionally, some platforms will tr= eat binary files differently.

You may also want to= put some additional error handling in here. For example, the file could be= deleted between the "walk" call and the "open" call, t= he file may not be readable (locked by other processes, incorrect permissio= ns, etc). Without knowing your use case, you may need to deal with those ca= ses, or maybe having the script fail out with an error message is good enou= gh.


=C2=A0 =C2=A0 dev =C2= =A0 =3D lstat_info.st_dev
=C2=A0 =C2=A0 ino =C2=A0 =3D lstat_info.st_ino<= /font>
=C2=A0= =C2=A0 nlink =3D lstat_info.st_nlink
=C2=A0 =C2=A0 size =C2=A0=3D lstat_info= .st_size

=C2=A0 =C2=A0 sep =3D ascii_nul

=C2=A0 =C2=A0 print "%s%c= %s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep, dev, sep, ino,= sep, nlink, sep, size, sep, file_path )
=
=C2=A0
You could use sep.join(thishost, md5= sum, dev, nio, nlink, size, file_path) rather than a string format here, pr= esuming all the input values are strings (you can call the str function on = the values to convert them, which will do the same as the "%s" fo= rmatter).

I don't know how much control you ha= ve over the output format (you said you intend to pipe this output into oth= er code), but if you can change it, I would suggest either using a pure bin= ary format, using a more human-readable separator than chr(0), or at least = providing an argument to the script to set the separator (I believe Linux h= as a -0 argument for many of its scripts).

Also, i= t seems odd that you include=C2=A0socket.gethostname() in the output, as th= at will always be the system you are running the code on, and not the syste= m you are retrieving data for (os.walk will work on network paths, includin= g UNC paths).

=
exit( 0 )=

The only ot= her thing I see is that I would probably break the code into a few addition= al functions, and put the argument parsing and initial call into a "if= name =3D=3D '__main__':" block. This would allow your code to= be imported in the future and called by other Python scripts as a module, = as well as allowing it to be executed as a script from the command line. Th= is will not matter if you only ever intend to use this script as a command-= line call, but could be useful if you want to reuse the code later in a lar= ger project.

To do this, however, you would need t= o make the function yield/return the results, rather than directly print.
--047d7bae483664bc3505035b6589--