Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Tim Chase Newsgroups: comp.lang.python Subject: Re: A sets algorithm Date: Sun, 7 Feb 2016 16:17:54 -0600 Lines: 43 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de v6fg1kUlOEmGDMMp6i/jbQwEj393nODke5rMRRN2VQag== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '(of': 0.07; 'collections': 0.09; 'iterate': 0.09; 'worse': 0.09; '-tkc': 0.16; 'equality.': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'iterating': 0.16; 'paulo': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sha1': 0.16; 'wrote:': 0.16; 'skip:l 30': 0.18; 'suppose': 0.22; '(or': 0.23; 'sets': 0.23; 'import': 0.24; 'header:In-Reply-To:1': 0.24; '(e.g.': 0.27; 'compare': 0.27; 'skip:u 20': 0.28; 'equality': 0.29; 'once.': 0.29; 'objects': 0.29; 'class': 0.33; 'equal': 0.34; 'file': 0.34; 'filter': 0.35; 'but': 0.36; 'should': 0.36; 'to:addr:python- list': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'two': 0.37; 'method': 0.37; 'charset:us-ascii': 0.37; 'files': 0.38; 'data': 0.39; 'skip:e 20': 0.39; 'to:addr:python.org': 0.40; 'your': 0.60; 'course': 0.62; 'more': 0.63; 'received:23': 0.84 X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-MC-Relay: Neutral X-MailChannels-SenderId: wwwh|x-authuser|tim@thechases.com X-MailChannels-Auth-Id: wwwh X-MC-Loop-Signature: 1454883632463:2637940544 X-MC-Ingress-Time: 1454883632463 In-Reply-To: X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.25; x86_64-pc-linux-gnu) X-AuthUser: tim@thechases.com X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102639 On 2016-02-07 21:46, Paulo da Silva wrote: > Suppose I have already a class MyFile that has an efficient method > (or operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? If your MyFile grows a __hash__ method that is based off the uniqueness-qualities that you're comparing, you can use collections.Counter and filter the results: from collections import Counter interesting = [ my_file for my_file, count in Counter(generate_MyFile_objects()).iteritems() if count > 1 ] Which should be roughly O(n). It also has the advantage of iterating over your generator once. If you the MyFile objects can be unique but compare for equality (e.g. two files on the file-system that have the same SHA1 hash, but you want to know the file-names), you'd have to do a paired search which would have worse performance and would need to iterate over the data multiple times: all_files = list(generate_MyFile_objects()) interesting = [ (my_file1, my_file2) for i, my_file1 in enumerate(all_files, 1) for my_file2 in all_files[i:] if my_file1 == my_file2 ] -tkc