Re: A sets algorithm

Path	csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From	Cem Karan <cfkaran2@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: A sets algorithm
Date	Sun, 7 Feb 2016 20:07:40 -0500
Lines	55
Message-ID	<mailman.83.1454893663.2317.python-list@python.org> (permalink)
References	<n98e0f$15lj$1@gioia.aioe.org>
Mime-Version	1.0 (Mac OS X Mail 6.6 $1510$)
Content-Type	text/plain; charset=us-ascii
Content-Transfer-Encoding	quoted-printable
X-Trace	news.uni-berlin.de eCrSvOejXjpnN+Q+DJKPyQ66medlOVTeZmgZxeWVwV3Q==
Return-Path	<cfkaran2@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; '(of': 0.07; 'pretend': 0.07; 'cc:addr:python-list': 0.09; '(use': 0.09; 'benjamin': 0.09; 'iterate': 0.09; 'portions': 0.09; 'suggestions.': 0.09; 'tends': 0.09; 'python': 0.10; 'files.': 0.13; 'sections': 0.13; 'equality.': 0.16; 'file).': 0.16; 'hashes': 0.16; 'length,': 0.16; 'luck!': 0.16; 'operation,': 0.16; 'paulo': 0.16; 'received:192.168.1.4': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'slow,': 0.16; 'suggested,': 0.16; 'this).': 0.16; 'wrote:': 0.16; 'byte': 0.18; 'bytes': 0.18; 'pfxlen:0': 0.18; 'cc:20': 0.20; 'cc:addr:python.org': 0.20; '(the': 0.22; 'gather': 0.22; 'strict': 0.22; 'suppose': 0.22; 'file.': 0.22; 'cc:no real name:20': 0.22; '(or': 0.23; 'feb': 0.23; 'sets': 0.23; "haven't": 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'chris': 0.26; 'figure': 0.27; 'helpful': 0.27; 'compare': 0.27; '(e.g.,': 0.27; 'tend': 0.27; 'clever': 0.29; 'dictionary': 0.29; 'equality': 0.29; 'hash': 0.29; 'once.': 0.29; 'received:209.85.220.174': 0.29; 'unique,': 0.29; 'maybe': 0.33; 'class': 0.33; 'message-id:@gmail.com': 0.34; 'case,': 0.34; 'equal': 0.34; 'file': 0.34; 'lists': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'false': 0.35; 'options:': 0.35; 'question,': 0.35; 'something': 0.35; 'but': 0.36; 'list,': 0.36; 'should': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'method': 0.37; 'thanks': 0.37; 'suggestion': 0.37; 'charset:us- ascii': 0.37; 'list.': 0.37; 'received:209': 0.38; 'several': 0.38; 'files': 0.38; 'or,': 0.38; 'received:209.85.220': 0.38; 'test': 0.39; 'data': 0.39; 'received:192': 0.39; 'rather': 0.39; 'your': 0.60; "you'll": 0.61; 'hope': 0.61; 'entire': 0.61; 'header:Message-Id:1': 0.61; 'course': 0.62; 'above,': 0.63; 'more': 0.63; 'different': 0.63; 'you.': 0.64; 'soon': 0.65; 'hello!': 0.66; 'here': 0.66; 'potentially': 0.67; 'you:': 0.79; 'oscar': 0.84
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=WpRlNcG8E4UUtHYX0/JTdV64k3C7sR35cRH3/rhHexw=; b=vQ7v6jMxqWIOJbAvsfpVTBdg7IBKt6C8mBau8Dr8MNIkFxzcRa2VPq4Hf7rBkvJZ0l Wn+0nHwy2XDzg+33NCvIEsQ8R1t0VvFc1UOlg1fhyUpkEjcvObU0zcy13Q07Zfk7D6/u 68GTmJUZizBL6OtZF4yL2DKld6QsNFkpXgcIfJZmY6u6TCgORR0boNxfcFG5umZ4aOMv pUMbPfy1siJrw943BIDkZiq3+vWcl6lhU+WTj0IZCq/AoDyZbluwEHM5OiDwkI55zIKv 9pBh2JuL92Wze2DligvqeLeeNYxB8eo9RvN7z4zGyZo9EJE3Wo0wqmk/tE+rdC7rgCX0 ZsIQ==
X-Google-DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=WpRlNcG8E4UUtHYX0/JTdV64k3C7sR35cRH3/rhHexw=; b=PjVPKddS8Ok8R+2POPuQQcI7dwi5J87y8U82AuEHiUAra38iwyotvSFBtCXcQ35ySV EfbR7egdxAIj8r8OOiAujcErw/B/qCAaqJ5hrQBVWvNcfLi3gkA5uJIlyX+c5PJmzuhZ jhY3YWQlTg7jxaAi/ft562s1VqmTHxYzvO6NHW1rICyStrAAsOLRU/F/471BXgS5tJ3L 1g7gxWk6aoMSQpGw0B4TrqMbNVe+3NTnpJpINoveBm12pOehtYSFzPj5V4rgrnek24QU 5GTzPtfgjgQkz/vpniLZmh70DFakwx9zLGogBOZUt7xBIF6kF2Xb12TN8HUJ79XFzkJ7 Q4AQ==
X-Gm-Message-State	AG10YOT2V9um9ayEq7g/x5cjhiGbu+srUHENk73jqM+SAwJkwWKf3XmLxCJnwnWZXLnYmA==
X-Received	by 10.55.41.7 with SMTP id p7mr5317512qkh.86.1454893661832; Sun, 07 Feb 2016 17:07:41 -0800 (PST)
In-Reply-To	<n98e0f$15lj$1@gioia.aioe.org>
X-Mailer	Apple Mail (2.1510)
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.21rc2
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref	csiph.com comp.lang.python:102643

Show key headers only | View raw

On Feb 7, 2016, at 4:46 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:

> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is identical), then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use os.path.getsize() to do this (I hope that this is a constant-time operation, but I haven't tested it).  As Oscar Benjamin suggested, you can create a defaultdict(list) which will make it possible to gather lists of files of equal size.  This should help you gather your potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, each of which will be a list.  If a list has only one file in it, you know its unique, and you don't have to do any more work on it.  If there are two files in the list, then you have several different options:
	a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
	b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
	c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading you're doing.  Storage tends to be slow, and any tricks that reduce the number of bytes you need to read in will be helpful to you.

Good luck!
Cem Karan

Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web