Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CALDD_==AYbQNPu29jRoLFp8WPZaZ9mMs79334-m_z3dgdxZRJw@mail.gmail.com>
References: <CALDD_==AYbQNPu29jRoLFp8WPZaZ9mMs79334-m_z3dgdxZRJw@mail.gmail.com>
From: Chris Kaynor <ckaynor@zindagigames.com>
Date: Thu, 18 Sep 2014 11:45:41 -0700
Subject: Re: program to generate data helpful in finding duplicate large files
To: "python-list@python.org" <python-list@python.org>
Content-Type: multipart/alternative; boundary=047d7bae483664bc3505035b6589
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.14115.1411065969.18130.python-list@python.org>
Lines: 340
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:78032

--047d7bae483664bc3505035b6589
Content-Type: text/plain; charset=UTF-8

On Thu, Sep 18, 2014 at 11:11 AM, David Alban <extasia@extasia.org> wrote:

> *#!/usr/bin/python*
>
> *import argparse*
> *import hashlib*
> *import os*
> *import re*
> *import socket*
> *import sys*
>
> *from stat import **
>

Generally, from import * imports are discouraged as they tend to populate
your namespace and have issues with accidentally overriding imported
functions/variables. Generally, its more Pythonic to use the other imports
(or import as) and reference with the namespace, as you are doing
everywhere else. The main case where from import * is recommended is API
imports (for example, importing the API of one module into another, such as
for inter-platform, inter-version, or accelerator support).


>
> *ascii_nul = chr(0)*
>
> *     # from:
> http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
> <http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
> *     # except that i use hexdigest() rather than digest()*
> *def md5_for_file(f, block_size=2**20):*
> *  md5 = hashlib.md5()*
> *  while True:*
> *    data = f.read(block_size)*
> *    if not data:*
> *      break*
> *    md5.update(data)*
> *  return md5.hexdigest()*
>
> *thishost = socket.gethostname()*
>
> *parser = argparse.ArgumentParser(description='scan files in a tree and
> print a line of information about each regular file')*
> *parser.add_argument('--start-directory', '-d', default='.',
> help='specifies the root of the filesystem tree to be processed')*
> *args = parser.parse_args()*
>
> *start_directory = re.sub( '/+$', '', args.start_directory )*
>

I'm not sure this is actually needed. Its also not platform-independent as
some platforms (eg, Windows) primary uses "\" instead.


>
> *for directory_path, directory_names, file_names in os.walk(
> start_directory ):*
> *  for file_name in file_names:*
> *    file_path = "%s/%s" % ( directory_path, file_name )*
>

os.path.join would be more cross-platform than the string formatting.
Basically, this line would become

file_path = os.path.join(directory_path, file_name)

os.path.join will also ensure that, regardless of the inputs, the paths
will only be joined by a single slash.


> *    lstat_info = os.lstat( file_path )*
>
> *    mode = lstat_info.st_mode*
>
> *    if not S_ISREG( mode ) or S_ISLNK( mode ):*
> *      continue*
>
> *    f = open( file_path, 'r' )*
>
*    md5sum = md5_for_file( f )*
>

The Pythonic thing to do here would be to use a "with" statement to ensure
the file is closed in a timely manner. This requires Python 2.6 or newer
(2.5 works as well with a future directive).
This would require the above two lines to become:

with open( file_path, 'r' ) as f:
    md5sum = md5_for_file( f )


I do note that you never explicitly close the files (which is done via the
with statement in my example). While generally fine as CPython will close
them automatically when no longer referenced, its not a good practice to
get into. Other versions of Python may have delays before the file is
closed, which could then result in errors if processing a huge number of
files. The with statement will ensure the file is closed immediately after
the md5 computation finishes, even if there is an error computing the md5.
Note that in any case, the OS should automatically close the file when the
process exits, but this is likely even worse than relying on Python to
close them for you.

Additionally, you may want to specify binary mode by using open(file_path,
'rb') to ensure platform-independence ('r' uses Universal newlines, which
means on Windows, Python will convert "\r\n" to "\n" while reading the
file). Additionally, some platforms will treat binary files differently.

You may also want to put some additional error handling in here. For
example, the file could be deleted between the "walk" call and the "open"
call, the file may not be readable (locked by other processes, incorrect
permissions, etc). Without knowing your use case, you may need to deal with
those cases, or maybe having the script fail out with an error message is
good enough.


> *    dev   = lstat_info.st_dev*
> *    ino   = lstat_info.st_ino*
> *    nlink = lstat_info.st_nlink*
> *    size  = lstat_info.st_size*
>
> *    sep = ascii_nul*
>
> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*
>

You could use sep.join(thishost, md5sum, dev, nio, nlink, size, file_path)
rather than a string format here, presuming all the input values are
strings (you can call the str function on the values to convert them, which
will do the same as the "%s" formatter).

I don't know how much control you have over the output format (you said you
intend to pipe this output into other code), but if you can change it, I
would suggest either using a pure binary format, using a more
human-readable separator than chr(0), or at least providing an argument to
the script to set the separator (I believe Linux has a -0 argument for many
of its scripts).

Also, it seems odd that you include socket.gethostname() in the output, as
that will always be the system you are running the code on, and not the
system you are retrieving data for (os.walk will work on network paths,
including UNC paths).

*exit( 0 )*
>

The only other thing I see is that I would probably break the code into a
few additional functions, and put the argument parsing and initial call
into a "if name == '__main__':" block. This would allow your code to be
imported in the future and called by other Python scripts as a module, as
well as allowing it to be executed as a script from the command line. This
will not matter if you only ever intend to use this script as a
command-line call, but could be useful if you want to reuse the code later
in a larger project.

To do this, however, you would need to make the function yield/return the
results, rather than directly print.

--047d7bae483664bc3505035b6589
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On T=
hu, Sep 18, 2014 at 11:11 AM, David Alban <span dir=3D"ltr">&lt;<a href=3D"=
mailto:extasia@extasia.org" target=3D"_blank">extasia@extasia.org</a>&gt;</=
span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-le=
ft-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><b style=3D"font-fam=
ily:&#39;courier new&#39;,monospace;font-size:x-small">#!/usr/bin/python</b=
><br></div><div><div><font face=3D"courier new, monospace" size=3D"1"><b><b=
r></b></font></div><div><font face=3D"courier new, monospace" size=3D"1"><b=
>import argparse</b></font></div><div><font face=3D"courier new, monospace"=
 size=3D"1"><b>import hashlib</b></font></div><div><font face=3D"courier ne=
w, monospace" size=3D"1"><b>import os</b></font></div><div><font face=3D"co=
urier new, monospace" size=3D"1"><b>import re</b></font></div><div><font fa=
ce=3D"courier new, monospace" size=3D"1"><b>import socket</b></font></div><=
div><font face=3D"courier new, monospace" size=3D"1"><b>import sys</b></fon=
t></div><div><font face=3D"courier new, monospace" size=3D"1"><b><br></b></=
font></div><div><font face=3D"courier new, monospace" size=3D"1"><b>from st=
at import *</b></font></div></div></div></blockquote><div><br></div><div>Ge=
nerally, from import * imports are discouraged as they tend to populate you=
r namespace and have issues with accidentally overriding imported functions=
/variables. Generally, its more Pythonic to use the other imports (or impor=
t as) and reference with the namespace, as you are doing everywhere else. T=
he main case where from import * is recommended is API imports (for example=
, importing the API of one module into another, such as for inter-platform,=
 inter-version, or accelerator support).</div><div>=C2=A0</div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1=
px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:=
1ex"><div dir=3D"ltr"><div><div><font face=3D"courier new, monospace" size=
=3D"1"><b><br></b></font></div><div><font face=3D"courier new, monospace" s=
ize=3D"1"><b>ascii_nul =3D chr(0)</b></font></div><div><font face=3D"courie=
r new, monospace" size=3D"1"><b><br></b></font></div><div><font face=3D"cou=
rier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 =C2=A0# from: <a href=3D"h=
ttp://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-pyth=
on" target=3D"_blank">http://stackoverflow.com/questions/1131220/get-md5-ha=
sh-of-big-files-in-python</a></b></font></div><div><font face=3D"courier ne=
w, monospace" size=3D"1"><b>=C2=A0 =C2=A0 =C2=A0# except that i use hexdige=
st() rather than digest()</b></font></div><div><font face=3D"courier new, m=
onospace" size=3D"1"><b>def md5_for_file(f, block_size=3D2**20):</b></font>=
</div><div><font face=3D"courier new, monospace" size=3D"1"><b>=C2=A0 md5 =
=3D hashlib.md5()</b></font></div><div><font face=3D"courier new, monospace=
" size=3D"1"><b>=C2=A0 while True:</b></font></div><div><font face=3D"couri=
er new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 data =3D f.read(block_size)<=
/b></font></div><div><font face=3D"courier new, monospace" size=3D"1"><b>=
=C2=A0 =C2=A0 if not data:</b></font></div><div><font face=3D"courier new, =
monospace" size=3D"1"><b>=C2=A0 =C2=A0 =C2=A0 break</b></font></div><div><f=
ont face=3D"courier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 md5.update(=
data)</b></font></div><div><font face=3D"courier new, monospace" size=3D"1"=
><b>=C2=A0 return md5.hexdigest()</b></font></div><div><font face=3D"courie=
r new, monospace" size=3D"1"><b><br></b></font></div><div><font face=3D"cou=
rier new, monospace" size=3D"1"><b>thishost =3D socket.gethostname()</b></f=
ont></div><div><font face=3D"courier new, monospace" size=3D"1"><b><br></b>=
</font></div><div><font face=3D"courier new, monospace" size=3D"1"><b>parse=
r =3D argparse.ArgumentParser(description=3D&#39;scan files in a tree and p=
rint a line of information about each regular file&#39;)</b></font></div><d=
iv><font face=3D"courier new, monospace" size=3D"1"><b>parser.add_argument(=
&#39;--start-directory&#39;, &#39;-d&#39;, default=3D&#39;.&#39;, help=3D&#=
39;specifies the root of the filesystem tree to be processed&#39;)</b></fon=
t></div><div><font face=3D"courier new, monospace" size=3D"1"><b>args =3D p=
arser.parse_args()</b></font></div><div><font face=3D"courier new, monospac=
e" size=3D"1"><b><br></b></font></div><div><font face=3D"courier new, monos=
pace" size=3D"1"><b>start_directory =3D re.sub( &#39;/+$&#39;, &#39;&#39;, =
args.start_directory )</b></font></div></div></div></blockquote><div><br></=
div><div>I&#39;m not sure this is actually needed. Its also not platform-in=
dependent as some platforms (eg, Windows) primary uses &quot;\&quot; instea=
d.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><div><font =
face=3D"courier new, monospace" size=3D"1"><b><br></b></font></div><div><fo=
nt face=3D"courier new, monospace" size=3D"1"><b>for directory_path, direct=
ory_names, file_names in os.walk( start_directory ):</b></font></div><div><=
font face=3D"courier new, monospace" size=3D"1"><b>=C2=A0 for file_name in =
file_names:</b></font></div><div><font face=3D"courier new, monospace" size=
=3D"1"><b>=C2=A0 =C2=A0 file_path =3D &quot;%s/%s&quot; % ( directory_path,=
 file_name )</b></font></div></div></div></blockquote><div><br></div><div>o=
s.path.join would be more cross-platform than the string formatting. Basica=
lly, this line would become</div></div></div><blockquote style=3D"margin:0p=
x 0px 0px 40px;border:none;padding:0px"><div class=3D"gmail_extra"><div cla=
ss=3D"gmail_quote">file_path =3D os.path.join(directory_path, file_name)</d=
iv></div></blockquote>os.path.join will also ensure that, regardless of the=
 inputs, the paths will only be joined by a single slash.<br><div class=3D"=
gmail_extra"><div class=3D"gmail_quote"><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">=
<div dir=3D"ltr"><div><div><div><font face=3D"courier new, monospace" size=
=3D"1"><b>=C2=A0 =C2=A0 lstat_info =3D os.lstat( file_path )</b></font></di=
v><div><font face=3D"courier new, monospace" size=3D"1"><b><br></b></font><=
/div><div><font face=3D"courier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0=
 mode =3D lstat_info.st_mode</b></font></div><div><font face=3D"courier new=
, monospace" size=3D"1"><b><br></b></font></div><div><font face=3D"courier =
new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 if not S_ISREG( mode ) or S_ISL=
NK( mode ):</b></font></div><div><font face=3D"courier new, monospace" size=
=3D"1"><b>=C2=A0 =C2=A0 =C2=A0 continue</b></font></div><div><font face=3D"=
courier new, monospace" size=3D"1"><b><br></b></font></div><div><font face=
=3D"courier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 f =3D open( file_pa=
th, &#39;r&#39; )</b></font>=C2=A0</div></div></div></div></blockquote><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddi=
ng-left:1ex"><div dir=3D"ltr"><div><div><div><font face=3D"courier new, mon=
ospace" size=3D"1"><b>=C2=A0 =C2=A0 md5sum =3D md5_for_file( f )</b></font>=
</div></div></div></div></blockquote><div><div><br></div><div>The Pythonic =
thing to do here would be to use a &quot;with&quot; statement to ensure the=
 file is closed in a timely manner. This requires Python 2.6 or newer (2.5 =
works as well with a future directive).</div></div><div>This would require =
the above two lines to become:</div></div></div><blockquote style=3D"margin=
:0px 0px 0px 40px;border:none;padding:0px"><div class=3D"gmail_extra"><div =
class=3D"gmail_quote">with open( file_path, &#39;r&#39; ) as f:</div></div>=
<div class=3D"gmail_extra"><div class=3D"gmail_quote">=C2=A0 =C2=A0 md5sum =
=3D md5_for_file( f )</div></div></blockquote><div class=3D"gmail_extra"><d=
iv class=3D"gmail_quote"><div><br></div><div>I do note that you never expli=
citly close the files (which is done via the with statement in my example).=
 While generally fine as CPython will close them automatically when no long=
er referenced, its not a good practice to get into. Other versions of Pytho=
n may have delays before the file is closed, which could then result in err=
ors if processing a huge number of files. The with statement will ensure th=
e file is closed immediately after the md5 computation finishes, even if th=
ere is an error computing the md5. Note that in any case, the OS should aut=
omatically close the file when the process exits, but this is likely even w=
orse than relying on Python to close them for you.</div><div><br></div><div=
>Additionally, you may want to specify binary mode by using open(file_path,=
 &#39;rb&#39;) to ensure platform-independence (&#39;r&#39; uses Universal =
newlines, which means on Windows, Python will convert &quot;\r\n&quot; to &=
quot;\n&quot; while reading the file). Additionally, some platforms will tr=
eat binary files differently.</div><div><br></div><div>You may also want to=
 put some additional error handling in here. For example, the file could be=
 deleted between the &quot;walk&quot; call and the &quot;open&quot; call, t=
he file may not be readable (locked by other processes, incorrect permissio=
ns, etc). Without knowing your use case, you may need to deal with those ca=
ses, or maybe having the script fail out with an error message is good enou=
gh.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);b=
order-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><f=
ont face=3D"courier new, monospace" size=3D"1"><b><br></b></font></div><div=
><font face=3D"courier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 dev =C2=
=A0 =3D lstat_info.st_dev</b></font></div><div><font face=3D"courier new, m=
onospace" size=3D"1"><b>=C2=A0 =C2=A0 ino =C2=A0 =3D lstat_info.st_ino</b><=
/font></div><div><font face=3D"courier new, monospace" size=3D"1"><b>=C2=A0=
 =C2=A0 nlink =3D lstat_info.st_nlink</b></font></div><div><font face=3D"co=
urier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 size =C2=A0=3D lstat_info=
.st_size</b></font></div><div><font face=3D"courier new, monospace" size=3D=
"1"><b><br></b></font></div><div><font face=3D"courier new, monospace" size=
=3D"1"><b>=C2=A0 =C2=A0 sep =3D ascii_nul</b></font></div><div><font face=
=3D"courier new, monospace" size=3D"1"><b><br></b></font></div><div><font f=
ace=3D"courier new, monospace" size=3D"1"><b>=C2=A0 =C2=A0 print &quot;%s%c=
%s%c%d%c%d%c%d%c%d%c%s&quot; % ( thishost, sep, md5sum, sep, dev, sep, ino,=
 sep, nlink, sep, size, sep, file_path )</b></font></div></div></div></div>=
</blockquote><div>=C2=A0<br></div><div>You could use sep.join(thishost, md5=
sum, dev, nio, nlink, size, file_path) rather than a string format here, pr=
esuming all the input values are strings (you can call the str function on =
the values to convert them, which will do the same as the &quot;%s&quot; fo=
rmatter).</div><div><br></div><div>I don&#39;t know how much control you ha=
ve over the output format (you said you intend to pipe this output into oth=
er code), but if you can change it, I would suggest either using a pure bin=
ary format, using a more human-readable separator than chr(0), or at least =
providing an argument to the script to set the separator (I believe Linux h=
as a -0 argument for many of its scripts).</div><div><br></div><div>Also, i=
t seems odd that you include=C2=A0socket.gethostname() in the output, as th=
at will always be the system you are running the code on, and not the syste=
m you are retrieving data for (os.walk will work on network paths, includin=
g UNC paths).</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(20=
4,204,204);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div>=
<div><div><font face=3D"courier new, monospace" size=3D"1"><b>exit( 0 )</b>=
</font></div></div></div></div></blockquote><div><br></div><div>The only ot=
her thing I see is that I would probably break the code into a few addition=
al functions, and put the argument parsing and initial call into a &quot;if=
 name =3D=3D &#39;__main__&#39;:&quot; block. This would allow your code to=
 be imported in the future and called by other Python scripts as a module, =
as well as allowing it to be executed as a script from the command line. Th=
is will not matter if you only ever intend to use this script as a command-=
line call, but could be useful if you want to reuse the code later in a lar=
ger project.</div><div><br></div><div>To do this, however, you would need t=
o make the function yield/return the results, rather than directly print.</=
div></div></div></div>

--047d7bae483664bc3505035b6589--