Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Tim Chase <python.list@tim.thechases.com>
Newsgroups: comp.lang.python
Subject: Re: A sets algorithm
Date: Sun, 7 Feb 2016 16:17:54 -0600
Lines: 43
Message-ID: <mailman.81.1454885914.2317.python-list@python.org>
References: <n98e0f$15lj$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
In-Reply-To: <n98e0f$15lj$1@gioia.aioe.org>
Precedence: list
Xref: csiph.com comp.lang.python:102639

On 2016-02-07 21:46, Paulo da Silva wrote:
> Suppose I have already a class MyFile that has an efficient method
> (or operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?

If your MyFile grows a __hash__ method that is based off the
uniqueness-qualities that you're comparing, you can use
collections.Counter and filter the results:

  from collections import Counter
  interesting = [
    my_file
    for my_file, count
    in Counter(generate_MyFile_objects()).iteritems()
    if count > 1
    ]

Which should be roughly O(n).  It also has the advantage of iterating
over your generator once.

If you the MyFile objects can be unique but compare for equality
(e.g. two files on the file-system that have the same SHA1 hash, but
you want to know the file-names), you'd have to do a paired search
which would have worse performance and would need to iterate over the
data multiple times:

  all_files = list(generate_MyFile_objects())
  interesting = [
    (my_file1, my_file2)
    for i, my_file1
    in enumerate(all_files, 1)
    for my_file2
    in all_files[i:]
    if my_file1 == my_file2
    ]

-tkc