Re: A sets algorithm

From	Chris Angelico <rosuav@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: A sets algorithm
Date	2016-02-08 08:58 +1100
Message-ID	<mailman.79.1454882316.2317.python-list@python.org> (permalink)
References	<n98e0f$15lj$1@gioia.aioe.org>

Show all headers | View raw

On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.

Hash them in some way.

This has two costs:
1) You need to figure out some hashing algorithm such that any two
equal files have the same hash, and ideally, that unequal files will
generally have unequal hashes.
2) Hash each file once, and then do your comparisons on the hashes,
finally testing for actual equality only on those with the same hash.
(The last step takes care of hash collisions - where different files
happen to have the same hash - so ideally, you shouldn't have to do
this often.)

If your definition of "equal" among MyFiles is simply based on the
file content, it's easy - just hash the content. But if there are ways
for files to be considered equal without being bit-for-bit identical,
you'll have to reflect that in the hash. For instance, if you consider
files equal if they differ only in whitespace, then you'd need to
convert all whitespace to a single space before hashing; or if you
don't care about the order of lines in the file, you could either hash
the lines separately and sum/xor the hashes, or sort the lines before
hashing. But the rest of the job is pretty straight-forward.

ChrisA

Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web