Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #78031 > unrolled thread
| Started by | David Alban <extasia@extasia.org> |
|---|---|
| First post | 2014-09-18 11:11 -0700 |
| Last post | 2014-09-20 03:36 +1000 |
| Articles | 9 — 4 participants |
Back to article view | Back to comp.lang.python
program to generate data helpful in finding duplicate large files David Alban <extasia@extasia.org> - 2014-09-18 11:11 -0700
Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-19 15:45 +1000
Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-19 16:45 +1000
Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-19 21:04 +1000
Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-19 21:36 +1000
Re: program to generate data helpful in finding duplicate large files Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-20 09:33 +1000
Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-20 14:47 +1000
Re: program to generate data helpful in finding duplicate large files Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-19 11:20 -0600
Re: program to generate data helpful in finding duplicate large files Chris Angelico <rosuav@gmail.com> - 2014-09-20 03:36 +1000
| From | David Alban <extasia@extasia.org> |
|---|---|
| Date | 2014-09-18 11:11 -0700 |
| Subject | program to generate data helpful in finding duplicate large files |
| Message-ID | <mailman.14114.1411063879.18130.python-list@python.org> |
[Multipart message — attachments visible in raw view] — view raw
greetings,
i'm a long time perl programmer who is learning python. i'd be interested
in any comments you might have on my code below. feel free to respond
privately if you prefer. i'd like to know if i'm on the right track. the
program works, and does what i want it to do. is there a different way a
seasoned python programmer would have done things? i would like to learn
the culture as well as the language. am i missing anything? i know i'm
not doing error checking below. i suppose comments would help, too.
i wanted a program to scan a tree and for each regular file, print a line
of text to stdout with information about the file. this will be data for
another program i want to write which finds sets of duplicate files larger
than a parameter size. that is, using output from this program, the sets
of files i want to find are on the same filesystem on the same host
(obviously, but i include hostname in the data to be sure), and must have
the same md5 sum, but different inode numbers.
the output of the code below is easier for a human to read when paged
through 'less', which on my mac renders the ascii nuls as "^@" in reverse
video.
thanks,
david
*usage: dupscan [-h] [--start-directory START_DIRECTORY]*
*scan files in a tree and print a line of information about each regular
file*
*optional arguments:*
* -h, --help show this help message and exit*
* --start-directory START_DIRECTORY, -d START_DIRECTORY*
* specifies the root of the filesystem tree to be*
* processed*
*#!/usr/bin/python*
*import argparse*
*import hashlib*
*import os*
*import re*
*import socket*
*import sys*
*from stat import **
*ascii_nul = chr(0)*
* # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
<http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
* # except that i use hexdigest() rather than digest()*
*def md5_for_file(f, block_size=2**20):*
* md5 = hashlib.md5()*
* while True:*
* data = f.read(block_size)*
* if not data:*
* break*
* md5.update(data)*
* return md5.hexdigest()*
*thishost = socket.gethostname()*
*parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
*parser.add_argument('--start-directory', '-d', default='.',
help='specifies the root of the filesystem tree to be processed')*
*args = parser.parse_args()*
*start_directory = re.sub( '/+$', '', args.start_directory )*
*for directory_path, directory_names, file_names in os.walk(
start_directory ):*
* for file_name in file_names:*
* file_path = "%s/%s" % ( directory_path, file_name )*
* lstat_info = os.lstat( file_path )*
* mode = lstat_info.st_mode*
* if not S_ISREG( mode ) or S_ISLNK( mode ):*
* continue*
* f = open( file_path, 'r' )*
* md5sum = md5_for_file( f )*
* dev = lstat_info.st_dev*
* ino = lstat_info.st_ino*
* nlink = lstat_info.st_nlink*
* size = lstat_info.st_size*
* sep = ascii_nul*
* print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
dev, sep, ino, sep, nlink, sep, size, sep, file_path )*
*exit( 0 )*
--
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-19 15:45 +1000 |
| Message-ID | <541bc310$0$29975$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #78031 |
David Alban wrote:
> *#!/usr/bin/python*
>
> *import argparse*
> *import hashlib*
> *import os*
> *import re*
> *import socket*
> *import sys*
Um, how did you end up with leading and trailing asterisks? That's going to
stop your code from running.
> *from stat import **
"import *" is slightly discouraged. It's not that it's bad, per se, it's
mostly designed for use at the interactive interpreter, and it can lead to
a few annoyances if you don't know what you are doing. So be careful of
using it when you don't need to.
[...]
> *start_directory = re.sub( '/+$', '', args.start_directory )*
I don't think you need to do that, and you certainly don't need to pull out
the nuclear-powered bulldozer of regular expressions just to crack the
peanut of stripping trailing slashes from a string.
start_directory = args.start_directory.rstrip("/")
ought to do the job.
[...]
> * f = open( file_path, 'r' )*
> * md5sum = md5_for_file( f )*
You never close the file, which means Python will close it for you, when it
is good and ready. In the case of some Python implementations, that might
not be until the interpreter shuts down, which could mean that you run out
of file handles!
Better is to explicitly close the file:
f = open(file_path, 'r')
md5sum = md5_for_file(f)
f.close()
or if you are using a recent version of Python and don't need to support
Python 2.4 or older:
with open(file_path, 'r') as f:
md5sum = md5_for_file(f)
(The "with" block automatically closes the file when you exit the indented
block.)
> * sep = ascii_nul*
Seems a strange choice of a delimiter.
> * print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*
Arggh, my brain! *wink*
Try this instead:
s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
print s
> *exit( 0 )*
No need to explicitly call sys.exit (just exit won't work) at the end of
your code. If you exit by falling off the end of your program, Python uses
a exit code of zero. Normally, you should only call sys.exit to:
- exit with a non-zero code;
- to exit early.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-19 16:45 +1000 |
| Message-ID | <mailman.14135.1411109125.18130.python-list@python.org> |
| In reply to | #78060 |
On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > David Alban wrote: >> *import sys* > > Um, how did you end up with leading and trailing asterisks? That's going to > stop your code from running. They're not part of the code, they're part of the mangling of the formatting. So this isn't a code issue, it's a mailing list / newsgroup one. David, if you set your mail/news client to send plain text only (not rich text or HTML or formatted or anything like that), you'll avoid these problems. >> * sep = ascii_nul* > > Seems a strange choice of a delimiter. But one that he explained in his body :) >> * print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep, >> dev, sep, ino, sep, nlink, sep, size, sep, file_path )* > > Arggh, my brain! *wink* > > Try this instead: > > s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path]) > print s That won't work on its own; several of the values are integers. So either they need to be str()'d or something in the output system needs to know to convert them to strings. I'm inclined to the latter option, which simply means importing print_function from __future__ and setting sep=chr(0). >> *exit( 0 )* > > No need to explicitly call sys.exit (just exit won't work) at the end of > your code. Hmm, you sure exit won't work? I normally use sys.exit to set return values (though as you say, it's unnecessary at the end of the program), but I tested it (Python 2.7.3 on Debian) and it does seem to be functional. Do you know what provides it? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-19 21:04 +1000 |
| Message-ID | <541c0dc9$0$29992$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #78061 |
Chris Angelico wrote: > On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path]) >> print s > > That won't work on its own; several of the values are integers. Ah, so they are! > So > either they need to be str()'d or something in the output system needs > to know to convert them to strings. I'm inclined to the latter option, > which simply means importing print_function from __future__ and > setting sep=chr(0). > >>> *exit( 0 )* >> >> No need to explicitly call sys.exit (just exit won't work) at the end of >> your code. > > Hmm, you sure exit won't work? In the interactive interpreter, exit is bound to a special helper object: py> exit Use exit() or Ctrl-D (i.e. EOF) to exit Otherwise, you'll get NameError. > I normally use sys.exit Like I said, sys.exit is fine :-) Of course you can "from sys import exit", or "exit = sys.exit", but the OP's code didn't include either of those. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-19 21:36 +1000 |
| Message-ID | <mailman.14142.1411127046.18130.python-list@python.org> |
| In reply to | #78068 |
On Fri, Sep 19, 2014 at 9:04 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> Hmm, you sure exit won't work?
>
> In the interactive interpreter, exit is bound to a special helper object:
>
> py> exit
> Use exit() or Ctrl-D (i.e. EOF) to exit
>
> Otherwise, you'll get NameError.
It's not the interactive interpreter alone. I tried it in a script
before posting.
Python 2.7.3 on Linux, 2.6.8 on Linux, 3.5.0ish Linux, 2.7.8 Windows,
2.6.5 Windows, 3.3.0 Windows, and 3.4.0 Windows, all work perfectly,
with (AFAIK) default settings. The only one that I tried that doesn't
is:
C:\>type canIexit.py
import sys
print(sys.version)
print(exit)
print(type(exit))
exit(1)
C:\>python canIexit.py
2.4.5 (#1, Jul 22 2011, 02:01:04)
[GCC 4.1.1]
Use Ctrl-Z plus Return to exit.
<type 'str'>
Traceback (most recent call last):
File "canIexit.py", line 5, in ?
exit(1)
TypeError: 'str' object is not callable
I've no idea how far back to go before it comes up with a NameError.
However, this is provided (as is made clear by the type lines) by
site.py, and so can be disabled. But with default settings, it is
possible to use exit(1) to set your return value.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-20 09:33 +1000 |
| Message-ID | <541cbd3c$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #78071 |
Chris Angelico wrote: > On Fri, Sep 19, 2014 at 9:04 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >>> Hmm, you sure exit won't work? >> >> In the interactive interpreter, exit is bound to a special helper object: >> >> py> exit >> Use exit() or Ctrl-D (i.e. EOF) to exit >> >> Otherwise, you'll get NameError. > > It's not the interactive interpreter alone. I tried it in a script > before posting. Well I'll be mogadored. Serves me right for not testing before posting. [...] > I've no idea how far back to go before it comes up with a NameError. > However, this is provided (as is made clear by the type lines) by > site.py, and so can be disabled. But with default settings, it is > possible to use exit(1) to set your return value. It's a bad idea to rely on features added to site.py, since they aren't necessarily going to be available at all sites or in all implementations: steve@orac:/home/steve$ ipy IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433 Type "help", "copyright", "credits" or "license" for more information. >>> exit(2) steve@orac:/home/steve$ Bugger me, I'm going home! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-20 14:47 +1000 |
| Message-ID | <mailman.14159.1411188477.18130.python-list@python.org> |
| In reply to | #78090 |
On Sat, Sep 20, 2014 at 9:33 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > It's a bad idea to rely on features added to site.py, since they aren't > necessarily going to be available at all sites or in all implementations: > > steve@orac:/home/steve$ ipy > IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433 > Type "help", "copyright", "credits" or "license" for more information. >>>> exit(2) > steve@orac:/home/steve$ > > Bugger me, I'm going home! This is the real reason for not relying on site.py: rosuav@sikorsky:~$ python -S Python 2.7.3 (default, Mar 13 2014, 11:03:55) [GCC 4.7.2] on linux2 >>> exit Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'exit' is not defined ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-09-19 11:20 -0600 |
| Message-ID | <mailman.14150.1411147263.18130.python-list@python.org> |
| In reply to | #78060 |
On Fri, Sep 19, 2014 at 12:45 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano
>> s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
>> print s
>
> That won't work on its own; several of the values are integers. So
> either they need to be str()'d or something in the output system needs
> to know to convert them to strings. I'm inclined to the latter option,
> which simply means importing print_function from __future__ and
> setting sep=chr(0).
Personally, I lean toward converting them with map in this case:
s = '\0'.join(map(str, [thishost, md5sum, dev, ino, nlink, size,
file_path]))
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-20 03:36 +1000 |
| Message-ID | <mailman.14151.1411148195.18130.python-list@python.org> |
| In reply to | #78060 |
On Sat, Sep 20, 2014 at 3:20 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > On Fri, Sep 19, 2014 at 12:45 AM, Chris Angelico <rosuav@gmail.com> wrote: >> On Fri, Sep 19, 2014 at 3:45 PM, Steven D'Aprano >>> s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path]) >>> print s >> >> That won't work on its own; several of the values are integers. So >> either they need to be str()'d or something in the output system needs >> to know to convert them to strings. I'm inclined to the latter option, >> which simply means importing print_function from __future__ and >> setting sep=chr(0). > > Personally, I lean toward converting them with map in this case: > > s = '\0'.join(map(str, [thishost, md5sum, dev, ino, nlink, size, > file_path])) There are many ways to do it. I'm not seeing this as particularly less ugly than the original formatting code, tbh, but it does work. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web