Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: A sets algorithm
Date: Mon, 8 Feb 2016 08:58:28 +1100
Lines: 38
Message-ID: <mailman.79.1454882316.2317.python-list@python.org>
References: <n98e0f$15lj$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <n98e0f$15lj$1@gioia.aioe.org>
Precedence: list
Xref: csiph.com comp.lang.python:102637

On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.

Hash them in some way.

This has two costs:
1) You need to figure out some hashing algorithm such that any two
equal files have the same hash, and ideally, that unequal files will
generally have unequal hashes.
2) Hash each file once, and then do your comparisons on the hashes,
finally testing for actual equality only on those with the same hash.
(The last step takes care of hash collisions - where different files
happen to have the same hash - so ideally, you shouldn't have to do
this often.)

If your definition of "equal" among MyFiles is simply based on the
file content, it's easy - just hash the content. But if there are ways
for files to be considered equal without being bit-for-bit identical,
you'll have to reflect that in the hash. For instance, if you consider
files equal if they differ only in whitespace, then you'd need to
convert all whitespace to a single space before hashing; or if you
don't care about the order of lines in the file, you could either hash
the lines separately and sum/xor the hashes, or sort the lines before
hashing. But the rest of the job is pretty straight-forward.

ChrisA