Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #102637

Re: A sets algorithm

Path csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Subject Re: A sets algorithm
Date Mon, 8 Feb 2016 08:58:28 +1100
Lines 38
Message-ID <mailman.79.1454882316.2317.python-list@python.org> (permalink)
References <n98e0f$15lj$1@gioia.aioe.org>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8
X-Trace news.uni-berlin.de 1AdWJ0RfWeuUxv4Z56y/FA1vweJ32gGQqq3gLTHPkfkw==
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.005
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; '(of': 0.07; 'cc:addr :python-list': 0.09; 'separately': 0.09; 'suggestions.': 0.09; 'python': 0.10; 'file,': 0.15; '2016': 0.16; 'equality.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'ideally,': 0.16; 'paulo': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'unequal': 0.16; 'wrote:': 0.16; 'differ': 0.18; 'instance,': 0.18; "shouldn't": 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'algorithm': 0.20; '(the': 0.22; 'strict': 0.22; 'suppose': 0.22; 'am,': 0.23; '(or': 0.23; 'feb': 0.23; 'sets': 0.23; 'header:In-Reply-To:1': 0.24; 'mon,': 0.24; 'sort': 0.25; 'testing': 0.25; 'rest': 0.26; 'figure': 0.27; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; 'reflect': 0.27; 'actual': 0.28; 'equality': 0.29; 'hash': 0.29; 'once,': 0.29; 'whitespace': 0.29; 'convert': 0.29; 'generally': 0.32; 'class': 0.33; 'definition': 0.34; 'equal': 0.34; 'file': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'question,': 0.35; 'step': 0.36; 'but': 0.36; 'there': 0.36; 'lines': 0.36; 'received:209.85': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'method': 0.37; 'thanks': 0.37; 'received:209.85.213': 0.37; 'received:209': 0.38; 'files': 0.38; 'takes': 0.39; 'where': 0.40; 'space': 0.40; 'some': 0.40; 'easy': 0.60; 'care': 0.60; 'your': 0.60; "you'll": 0.61; 'course': 0.62; 'more': 0.63; 'different': 0.63; 'content,': 0.66; 'hello!': 0.66; 'finally': 0.70; 'chrisa': 0.84; 'to:none': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=7SDXM/dw879+o4oY6W/RYSr2Q5yPdGVu6axQoAHIoJo=; b=w+t8qJ1gLOBxI/ORwwbb6Ge0lrWUs+0BhxQQsc1/BvxUsyMwuSZ1OhCiFeoMceY/Gm L8Wpenptr+CO19JzG1psw/o3vwK7HDXZSOraNRavTTEQibQTV1pPmexcaC0zJn3s/Mk4 /biTFeDURZX3L/rk3X35LC5TRuZbib8s4W3NPdv+99HKCuMlsSoKXNMjyFxZvlLOkPzv 2hwAEP1PpwVE2QMo2FOCAoteMoWLZKwS8cr5m2/ZPiZFW05WdkLpSyHmK85+36mVR0wC RJDNE6q6EJZhqPI9XzWwthRXvVprw5fH19r1VRtMiBN40yK8Q6hGBlUFacG2crvoZ3Ab Axfg==
X-Google-DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type; bh=7SDXM/dw879+o4oY6W/RYSr2Q5yPdGVu6axQoAHIoJo=; b=mHfJ8mD8WRrlUmKGC4bizsGCNBv2e4ABl+U+/cjLiGe0nKDxWuBN1HAzNp1PmsioET qGyJu8IWDPx4HOnlG2KLLRDvVAjnn60vvkXfzyAbGOG5erqkkQeuMSxgDNHa39IhD+vy gqG4Hd57FSFp8XLClebMyN+DL3DF/E3115Llvy4qwOCPw2GdUL3NwzGvlWrT1IWpZV2A rQPJbK1B3Uj/nCdDOMb/BXFChrLTklZ4K/HMIRt/uV9OlNmFsTRvyyQ/6Npxrc+NxHIZ zlzN1AmVUVyOF1xVXpozuP0PSGCn4saGlt1r7KQOKzXAkk5ctX1OV+qvnaUO+4+XjD9N FV7Q==
X-Gm-Message-State AG10YOQEvUvf1vL6N2KqfK4RM7vOjfk8EX3NVPMHeeiCanCFDKvcs7HQI6nTeoxfXNFrlGLeGc5WHsb04MXCog==
X-Received by 10.50.28.105 with SMTP id a9mr21588911igh.94.1454882308584; Sun, 07 Feb 2016 13:58:28 -0800 (PST)
In-Reply-To <n98e0f$15lj$1@gioia.aioe.org>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.21rc2
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref csiph.com comp.lang.python:102637

Show key headers only | View raw


On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.

Hash them in some way.

This has two costs:
1) You need to figure out some hashing algorithm such that any two
equal files have the same hash, and ideally, that unequal files will
generally have unequal hashes.
2) Hash each file once, and then do your comparisons on the hashes,
finally testing for actual equality only on those with the same hash.
(The last step takes care of hash collisions - where different files
happen to have the same hash - so ideally, you shouldn't have to do
this often.)

If your definition of "equal" among MyFiles is simply based on the
file content, it's easy - just hash the content. But if there are ways
for files to be considered equal without being bit-for-bit identical,
you'll have to reflect that in the hash. For instance, if you consider
files equal if they differ only in whitespace, then you'd need to
convert all whitespace to a single space before hashing; or if you
don't care about the order of lines in the file, you could either hash
the lines separately and sum/xor the hashes, or sort the lines before
hashing. But the rest of the job is pretty straight-forward.

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web