Groups > comp.lang.python > #78031 > unrolled thread

program to generate data helpful in finding duplicate large files

Started by	David Alban <extasia@extasia.org>
First post	2014-09-18 11:11 -0700
Last post	2014-09-20 03:36 +1000
Articles	9 — 4 participants

Back to article view | Back to comp.lang.python

  program to generate data helpful in finding duplicate large files David Alban <extasia@extasia.org> - 2014-09-18 11:11 -0700
    Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-19 15:45 +1000
      Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-19 16:45 +1000
        Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-19 21:04 +1000
          Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-19 21:36 +1000
            Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-20 09:33 +1000
              Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-20 14:47 +1000
      Re: program to generate data helpful in finding duplicate large files Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-19 11:20 -0600
      Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-20 03:36 +1000

#78031 — program to generate data helpful in finding duplicate large files

From	David Alban <extasia@extasia.org>
Date	2014-09-18 11:11 -0700
Subject	program to generate data helpful in finding duplicate large files
Message-ID	<mailman.14114.1411063879.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

greetings,

i'm a long time perl programmer who is learning python.  i'd be interested
in any comments you might have on my code below.  feel free to respond
privately if you prefer.  i'd like to know if i'm on the right track.  the
program works, and does what i want it to do.  is there a different way a
seasoned python programmer would have done things?  i would like to learn
the culture as well as the language.  am i missing anything?  i know i'm
not doing error checking below.  i suppose comments would help, too.

i wanted a program to scan a tree and for each regular file, print a line
of text to stdout with information about the file.  this will be data for
another program i want to write which finds sets of duplicate files larger
than a parameter size.  that is, using output from this program, the sets
of files i want to find are on the same filesystem on the same host
(obviously, but i include hostname in the data to be sure), and must have
the same md5 sum, but different inode numbers.

the output of the code below is easier for a human to read when paged
through 'less', which on my mac renders the ascii nuls as "^@" in reverse
video.

thanks,
david


*usage: dupscan [-h] [--start-directory START_DIRECTORY]*

*scan files in a tree and print a line of information about each regular
file*

*optional arguments:*
*  -h, --help            show this help message and exit*
*  --start-directory START_DIRECTORY, -d START_DIRECTORY*
*                        specifies the root of the filesystem tree to be*
*                        processed*




*#!/usr/bin/python*

*import argparse*
*import hashlib*
*import os*
*import re*
*import socket*
*import sys*

*from stat import **

*ascii_nul = chr(0)*

*     # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
<http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
*     # except that i use hexdigest() rather than digest()*
*def md5_for_file(f, block_size=2**20):*
*  md5 = hashlib.md5()*
*  while True:*
*    data = f.read(block_size)*
*    if not data:*
*      break*
*    md5.update(data)*
*  return md5.hexdigest()*

*thishost = socket.gethostname()*

*parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
*parser.add_argument('--start-directory', '-d', default='.',
help='specifies the root of the filesystem tree to be processed')*
*args = parser.parse_args()*

*start_directory = re.sub( '/+$', '', args.start_directory )*

*for directory_path, directory_names, file_names in os.walk(
start_directory ):*
*  for file_name in file_names:*
*    file_path = "%s/%s" % ( directory_path, file_name )*

*    lstat_info = os.lstat( file_path )*

*    mode = lstat_info.st_mode*

*    if not S_ISREG( mode ) or S_ISLNK( mode ):*
*      continue*

*    f = open( file_path, 'r' )*
*    md5sum = md5_for_file( f )*

*    dev   = lstat_info.st_dev*
*    ino   = lstat_info.st_ino*
*    nlink = lstat_info.st_nlink*
*    size  = lstat_info.st_size*

*    sep = ascii_nul*

*    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

*exit( 0 )*



-- 
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.

[toc] | [next] | [standalone]

#78060

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-09-19 15:45 +1000
Message-ID	<541bc310$0$29975$c3e8da3$5496439d@news.astraweb.com>
In reply to	#78031

David Alban wrote:

> *#!/usr/bin/python*
> 
> *import argparse*
> *import hashlib*
> *import os*
> *import re*
> *import socket*
> *import sys*

Um, how did you end up with leading and trailing asterisks? That's going to
stop your code from running.

> *from stat import **

"import *" is slightly discouraged. It's not that it's bad, per se, it's
mostly designed for use at the interactive interpreter, and it can lead to
a few annoyances if you don't know what you are doing. So be careful of
using it when you don't need to.

[...]
> *start_directory = re.sub( '/+$', '', args.start_directory )*

I don't think you need to do that, and you certainly don't need to pull out
the nuclear-powered bulldozer of regular expressions just to crack the
peanut of stripping trailing slashes from a string.

start_directory = args.start_directory.rstrip("/")

ought to do the job.

[...]
> *    f = open( file_path, 'r' )*
> *    md5sum = md5_for_file( f )*

You never close the file, which means Python will close it for you, when it
is good and ready. In the case of some Python implementations, that might
not be until the interpreter shuts down, which could mean that you run out
of file handles!

Better is to explicitly close the file:

    f = open(file_path, 'r')
    md5sum = md5_for_file(f)
    f.close()

or if you are using a recent version of Python and don't need to support
Python 2.4 or older:

    with open(file_path, 'r') as f:
        md5sum = md5_for_file(f)

(The "with" block automatically closes the file when you exit the indented
block.)

> *    sep = ascii_nul*

Seems a strange choice of a delimiter.

> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

Arggh, my brain! *wink*

Try this instead:

    s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
    print s

> *exit( 0 )*

No need to explicitly call sys.exit (just exit won't work) at the end of
your code. If you exit by falling off the end of your program, Python uses
a exit code of zero. Normally, you should only call sys.exit to:

- exit with a non-zero code;

- to exit early.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#78061

From	Chris Angelico <rosuav@gmail.com>
Date	2014-09-19 16:45 +1000
Message-ID	<mailman.14135.1411109125.18130.python-list@python.org>
In reply to	#78060

On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> David Alban wrote:
>> *import sys*
>
> Um, how did you end up with leading and trailing asterisks? That's going to
> stop your code from running.

They're not part of the code, they're part of the mangling of the
formatting. So this isn't a code issue, it's a mailing list /
newsgroup one. David, if you set your mail/news client to send plain
text only (not rich text or HTML or formatted or anything like that),
you'll avoid these problems.

>> *    sep = ascii_nul*
>
> Seems a strange choice of a delimiter.

But one that he explained in his body :)

>> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
>> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*
>
> Arggh, my brain! *wink*
>
> Try this instead:
>
>     s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
>     print s

That won't work on its own; several of the values are integers. So
either they need to be str()'d or something in the output system needs
to know to convert them to strings. I'm inclined to the latter option,
which simply means importing print_function from __future__ and
setting sep=chr(0).

>> *exit( 0 )*
>
> No need to explicitly call sys.exit (just exit won't work) at the end of
> your code.

Hmm, you sure exit won't work? I normally use sys.exit to set return
values (though as you say, it's unnecessary at the end of the
program), but I tested it (Python 2.7.3 on Debian) and it does seem to
be functional. Do you know what provides it?

ChrisA

[toc] | [prev] | [next] | [standalone]

#78068

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-09-19 21:04 +1000
Message-ID	<541c0dc9$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#78061

Chris Angelico wrote:

> On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:

>>     s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
>>     print s
> 
> That won't work on its own; several of the values are integers. 

Ah, so they are!

> So 
> either they need to be str()'d or something in the output system needs
> to know to convert them to strings. I'm inclined to the latter option,
> which simply means importing print_function from __future__ and
> setting sep=chr(0).
> 
>>> *exit( 0 )*
>>
>> No need to explicitly call sys.exit (just exit won't work) at the end of
>> your code.
> 
> Hmm, you sure exit won't work? 

In the interactive interpreter, exit is bound to a special helper object:

py> exit
Use exit() or Ctrl-D (i.e. EOF) to exit

Otherwise, you'll get NameError.

> I normally use sys.exit 

Like I said, sys.exit is fine :-)

Of course you can "from sys import exit", or "exit = sys.exit", but the OP's
code didn't include either of those.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#78071

From	Chris Angelico <rosuav@gmail.com>
Date	2014-09-19 21:36 +1000
Message-ID	<mailman.14142.1411127046.18130.python-list@python.org>
In reply to	#78068

On Fri, Sep 19, 2014 at 9:04 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> Hmm, you sure exit won't work?
>
> In the interactive interpreter, exit is bound to a special helper object:
>
> py> exit
> Use exit() or Ctrl-D (i.e. EOF) to exit
>
> Otherwise, you'll get NameError.

It's not the interactive interpreter alone. I tried it in a script
before posting.

Python 2.7.3 on Linux, 2.6.8 on Linux, 3.5.0ish Linux, 2.7.8 Windows,
2.6.5 Windows, 3.3.0 Windows, and 3.4.0 Windows, all work perfectly,
with (AFAIK) default settings. The only one that I tried that doesn't
is:

C:\>type canIexit.py
import sys
print(sys.version)
print(exit)
print(type(exit))
exit(1)
C:\>python canIexit.py
2.4.5 (#1, Jul 22 2011, 02:01:04)
[GCC 4.1.1]
Use Ctrl-Z plus Return to exit.
<type 'str'>
Traceback (most recent call last):
  File "canIexit.py", line 5, in ?
    exit(1)
TypeError: 'str' object is not callable

I've no idea how far back to go before it comes up with a NameError.
However, this is provided (as is made clear by the type lines) by
site.py, and so can be disabled. But with default settings, it is
possible to use exit(1) to set your return value.

ChrisA

[toc] | [prev] | [next] | [standalone]

#78090

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-09-20 09:33 +1000
Message-ID	<541cbd3c$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to	#78071

Chris Angelico wrote:

> On Fri, Sep 19, 2014 at 9:04 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>>> Hmm, you sure exit won't work?
>>
>> In the interactive interpreter, exit is bound to a special helper object:
>>
>> py> exit
>> Use exit() or Ctrl-D (i.e. EOF) to exit
>>
>> Otherwise, you'll get NameError.
> 
> It's not the interactive interpreter alone. I tried it in a script
> before posting.

Well I'll be mogadored.

Serves me right for not testing before posting.

[...]
> I've no idea how far back to go before it comes up with a NameError.
> However, this is provided (as is made clear by the type lines) by
> site.py, and so can be disabled. But with default settings, it is
> possible to use exit(1) to set your return value.

It's a bad idea to rely on features added to site.py, since they aren't
necessarily going to be available at all sites or in all implementations:

steve@orac:/home/steve$ ipy
IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
>>> exit(2)
steve@orac:/home/steve$

Bugger me, I'm going home!



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#78095

From	Chris Angelico <rosuav@gmail.com>
Date	2014-09-20 14:47 +1000
Message-ID	<mailman.14159.1411188477.18130.python-list@python.org>
In reply to	#78090

On Sat, Sep 20, 2014 at 9:33 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> It's a bad idea to rely on features added to site.py, since they aren't
> necessarily going to be available at all sites or in all implementations:
>
> steve@orac:/home/steve$ ipy
> IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433
> Type "help", "copyright", "credits" or "license" for more information.
>>>> exit(2)
> steve@orac:/home/steve$
>
> Bugger me, I'm going home!

This is the real reason for not relying on site.py:

rosuav@sikorsky:~$ python -S
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
>>> exit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'exit' is not defined

ChrisA

[toc] | [prev] | [next] | [standalone]

#78083

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2014-09-19 11:20 -0600
Message-ID	<mailman.14150.1411147263.18130.python-list@python.org>
In reply to	#78060

On Fri, Sep 19, 2014 at 12:45 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano
>>     s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
>>     print s
>
> That won't work on its own; several of the values are integers. So
> either they need to be str()'d or something in the output system needs
> to know to convert them to strings. I'm inclined to the latter option,
> which simply means importing print_function from __future__ and
> setting sep=chr(0).

Personally, I lean toward converting them with map in this case:

    s = '\0'.join(map(str, [thishost, md5sum, dev, ino, nlink, size,
file_path]))

[toc] | [prev] | [next] | [standalone]

#78084

From	Chris Angelico <rosuav@gmail.com>
Date	2014-09-20 03:36 +1000
Message-ID	<mailman.14151.1411148195.18130.python-list@python.org>
In reply to	#78060

On Sat, Sep 20, 2014 at 3:20 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Fri, Sep 19, 2014 at 12:45 AM, Chris Angelico <rosuav@gmail.com> wrote:
>> On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano
>>>     s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
>>>     print s
>>
>> That won't work on its own; several of the values are integers. So
>> either they need to be str()'d or something in the output system needs
>> to know to convert them to strings. I'm inclined to the latter option,
>> which simply means importing print_function from __future__ and
>> setting sep=chr(0).
>
> Personally, I lean toward converting them with map in this case:
>
>     s = '\0'.join(map(str, [thishost, md5sum, dev, ino, nlink, size,
> file_path]))

There are many ways to do it. I'm not seeing this as particularly less
ugly than the original formatting code, tbh, but it does work.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

program to generate data helpful in finding duplicate large files

Contents

#78031 — program to generate data helpful in finding duplicate large files

#78060

#78061

#78068

#78071

#78090

#78095

#78083

#78084