Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: A sets algorithm Date: Mon, 8 Feb 2016 08:58:28 +1100 Lines: 38 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de 1AdWJ0RfWeuUxv4Z56y/FA1vweJ32gGQqq3gLTHPkfkw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '(of': 0.07; 'cc:addr :python-list': 0.09; 'separately': 0.09; 'suggestions.': 0.09; 'python': 0.10; 'file,': 0.15; '2016': 0.16; 'equality.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'ideally,': 0.16; 'paulo': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'unequal': 0.16; 'wrote:': 0.16; 'differ': 0.18; 'instance,': 0.18; "shouldn't": 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'algorithm': 0.20; '(the': 0.22; 'strict': 0.22; 'suppose': 0.22; 'am,': 0.23; '(or': 0.23; 'feb': 0.23; 'sets': 0.23; 'header:In-Reply-To:1': 0.24; 'mon,': 0.24; 'sort': 0.25; 'testing': 0.25; 'rest': 0.26; 'figure': 0.27; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; 'reflect': 0.27; 'actual': 0.28; 'equality': 0.29; 'hash': 0.29; 'once,': 0.29; 'whitespace': 0.29; 'convert': 0.29; 'generally': 0.32; 'class': 0.33; 'definition': 0.34; 'equal': 0.34; 'file': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'question,': 0.35; 'step': 0.36; 'but': 0.36; 'there': 0.36; 'lines': 0.36; 'received:209.85': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'method': 0.37; 'thanks': 0.37; 'received:209.85.213': 0.37; 'received:209': 0.38; 'files': 0.38; 'takes': 0.39; 'where': 0.40; 'space': 0.40; 'some': 0.40; 'easy': 0.60; 'care': 0.60; 'your': 0.60; "you'll": 0.61; 'course': 0.62; 'more': 0.63; 'different': 0.63; 'content,': 0.66; 'hello!': 0.66; 'finally': 0.70; 'chrisa': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=7SDXM/dw879+o4oY6W/RYSr2Q5yPdGVu6axQoAHIoJo=; b=w+t8qJ1gLOBxI/ORwwbb6Ge0lrWUs+0BhxQQsc1/BvxUsyMwuSZ1OhCiFeoMceY/Gm L8Wpenptr+CO19JzG1psw/o3vwK7HDXZSOraNRavTTEQibQTV1pPmexcaC0zJn3s/Mk4 /biTFeDURZX3L/rk3X35LC5TRuZbib8s4W3NPdv+99HKCuMlsSoKXNMjyFxZvlLOkPzv 2hwAEP1PpwVE2QMo2FOCAoteMoWLZKwS8cr5m2/ZPiZFW05WdkLpSyHmK85+36mVR0wC RJDNE6q6EJZhqPI9XzWwthRXvVprw5fH19r1VRtMiBN40yK8Q6hGBlUFacG2crvoZ3Ab Axfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type; bh=7SDXM/dw879+o4oY6W/RYSr2Q5yPdGVu6axQoAHIoJo=; b=mHfJ8mD8WRrlUmKGC4bizsGCNBv2e4ABl+U+/cjLiGe0nKDxWuBN1HAzNp1PmsioET qGyJu8IWDPx4HOnlG2KLLRDvVAjnn60vvkXfzyAbGOG5erqkkQeuMSxgDNHa39IhD+vy gqG4Hd57FSFp8XLClebMyN+DL3DF/E3115Llvy4qwOCPw2GdUL3NwzGvlWrT1IWpZV2A rQPJbK1B3Uj/nCdDOMb/BXFChrLTklZ4K/HMIRt/uV9OlNmFsTRvyyQ/6Npxrc+NxHIZ zlzN1AmVUVyOF1xVXpozuP0PSGCn4saGlt1r7KQOKzXAkk5ctX1OV+qvnaUO+4+XjD9N FV7Q== X-Gm-Message-State: AG10YOQEvUvf1vL6N2KqfK4RM7vOjfk8EX3NVPMHeeiCanCFDKvcs7HQI6nTeoxfXNFrlGLeGc5WHsb04MXCog== X-Received: by 10.50.28.105 with SMTP id a9mr21588911igh.94.1454882308584; Sun, 07 Feb 2016 13:58:28 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102637 On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva wrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions. Hash them in some way. This has two costs: 1) You need to figure out some hashing algorithm such that any two equal files have the same hash, and ideally, that unequal files will generally have unequal hashes. 2) Hash each file once, and then do your comparisons on the hashes, finally testing for actual equality only on those with the same hash. (The last step takes care of hash collisions - where different files happen to have the same hash - so ideally, you shouldn't have to do this often.) If your definition of "equal" among MyFiles is simply based on the file content, it's easy - just hash the content. But if there are ways for files to be considered equal without being bit-for-bit identical, you'll have to reflect that in the hash. For instance, if you consider files equal if they differ only in whitespace, then you'd need to convert all whitespace to a single space before hashing; or if you don't care about the order of lines in the file, you could either hash the lines separately and sum/xor the hashes, or sort the lines before hashing. But the rest of the job is pretty straight-forward. ChrisA