Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Cem Karan <cfkaran2@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: A sets algorithm
Date: Sun, 7 Feb 2016 20:07:40 -0500
Lines: 55
Message-ID: <mailman.83.1454893663.2317.python-list@python.org>
References: <n98e0f$15lj$1@gioia.aioe.org>
Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\))
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <n98e0f$15lj$1@gioia.aioe.org>
Precedence: list
Xref: csiph.com comp.lang.python:102643


On Feb 7, 2016, at 4:46 PM, Paulo da Silva =
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:

> Hello!
>=20
> This may not be a strict python question, but ...
>=20
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>=20
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>=20
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is =
identical), then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use =
os.path.getsize() to do this (I hope that this is a constant-time =
operation, but I haven't tested it).  As Oscar Benjamin suggested, you =
can create a defaultdict(list) which will make it possible to gather =
lists of files of equal size.  This should help you gather your =
potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, =
each of which will be a list.  If a list has only one file in it, you =
know its unique, and you don't have to do any more work on it.  If there =
are two files in the list, then you have several different options:
	a) Use Chris Angelico's suggestion and hash each of the files =
(use the standard library's 'hashlib' for this).  Identical files will =
always have identical hashes, but there may be false positives, so =
you'll need to verify that files that have identical hashes are indeed =
identical.
	b) If your files tend to have sections that are very different =
(e.g., the first 32 bytes tend to be different), then you pretend that =
section of the file is its hash.  You can then do the same trick as =
above. (the advantage of this is that you will read in a lot less data =
than if you have to hash the entire file).
	c) You may be able to do something clever by reading portions of =
each file.  That is, use zip() combined with read(1024) to read each of =
the files in sections, while keeping hashes of the files.  Or, maybe =
you'll be able to read portions of them and sort the list as you're =
reading.  In either case, if any files are NOT identical, then you'll be =
able to stop work as soon as you figure this out, rather than having to =
read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading =
you're doing.  Storage tends to be slow, and any tricks that reduce the =
number of bytes you need to read in will be helpful to you.

Good luck!
Cem Karan=