Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Cem Karan Newsgroups: comp.lang.python Subject: Re: A sets algorithm Date: Sun, 7 Feb 2016 20:07:40 -0500 Lines: 55 Message-ID: References: Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de eCrSvOejXjpnN+Q+DJKPyQ66medlOVTeZmgZxeWVwV3Q== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(of': 0.07; 'pretend': 0.07; 'cc:addr:python-list': 0.09; '(use': 0.09; 'benjamin': 0.09; 'iterate': 0.09; 'portions': 0.09; 'suggestions.': 0.09; 'tends': 0.09; 'python': 0.10; 'files.': 0.13; 'sections': 0.13; 'equality.': 0.16; 'file).': 0.16; 'hashes': 0.16; 'length,': 0.16; 'luck!': 0.16; 'operation,': 0.16; 'paulo': 0.16; 'received:192.168.1.4': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'slow,': 0.16; 'suggested,': 0.16; 'this).': 0.16; 'wrote:': 0.16; 'byte': 0.18; 'bytes': 0.18; 'pfxlen:0': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '(the': 0.22; 'gather': 0.22; 'strict': 0.22; 'suppose': 0.22; 'file.': 0.22; 'cc:no real name:2**0': 0.22; '(or': 0.23; 'feb': 0.23; 'sets': 0.23; "haven't": 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'chris': 0.26; 'figure': 0.27; 'helpful': 0.27; 'compare': 0.27; '(e.g.,': 0.27; 'tend': 0.27; 'clever': 0.29; 'dictionary': 0.29; 'equality': 0.29; 'hash': 0.29; 'once.': 0.29; 'received:209.85.220.174': 0.29; 'unique,': 0.29; 'maybe': 0.33; 'class': 0.33; 'message-id:@gmail.com': 0.34; 'case,': 0.34; 'equal': 0.34; 'file': 0.34; 'lists': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'false': 0.35; 'options:': 0.35; 'question,': 0.35; 'something': 0.35; 'but': 0.36; 'list,': 0.36; 'should': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'method': 0.37; 'thanks': 0.37; 'suggestion': 0.37; 'charset:us- ascii': 0.37; 'list.': 0.37; 'received:209': 0.38; 'several': 0.38; 'files': 0.38; 'or,': 0.38; 'received:209.85.220': 0.38; 'test': 0.39; 'data': 0.39; 'received:192': 0.39; 'rather': 0.39; 'your': 0.60; "you'll": 0.61; 'hope': 0.61; 'entire': 0.61; 'header:Message-Id:1': 0.61; 'course': 0.62; 'above,': 0.63; 'more': 0.63; 'different': 0.63; 'you.': 0.64; 'soon': 0.65; 'hello!': 0.66; 'here': 0.66; 'potentially': 0.67; 'you:': 0.79; 'oscar': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=WpRlNcG8E4UUtHYX0/JTdV64k3C7sR35cRH3/rhHexw=; b=vQ7v6jMxqWIOJbAvsfpVTBdg7IBKt6C8mBau8Dr8MNIkFxzcRa2VPq4Hf7rBkvJZ0l Wn+0nHwy2XDzg+33NCvIEsQ8R1t0VvFc1UOlg1fhyUpkEjcvObU0zcy13Q07Zfk7D6/u 68GTmJUZizBL6OtZF4yL2DKld6QsNFkpXgcIfJZmY6u6TCgORR0boNxfcFG5umZ4aOMv pUMbPfy1siJrw943BIDkZiq3+vWcl6lhU+WTj0IZCq/AoDyZbluwEHM5OiDwkI55zIKv 9pBh2JuL92Wze2DligvqeLeeNYxB8eo9RvN7z4zGyZo9EJE3Wo0wqmk/tE+rdC7rgCX0 ZsIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=WpRlNcG8E4UUtHYX0/JTdV64k3C7sR35cRH3/rhHexw=; b=PjVPKddS8Ok8R+2POPuQQcI7dwi5J87y8U82AuEHiUAra38iwyotvSFBtCXcQ35ySV EfbR7egdxAIj8r8OOiAujcErw/B/qCAaqJ5hrQBVWvNcfLi3gkA5uJIlyX+c5PJmzuhZ jhY3YWQlTg7jxaAi/ft562s1VqmTHxYzvO6NHW1rICyStrAAsOLRU/F/471BXgS5tJ3L 1g7gxWk6aoMSQpGw0B4TrqMbNVe+3NTnpJpINoveBm12pOehtYSFzPj5V4rgrnek24QU 5GTzPtfgjgQkz/vpniLZmh70DFakwx9zLGogBOZUt7xBIF6kF2Xb12TN8HUJ79XFzkJ7 Q4AQ== X-Gm-Message-State: AG10YOT2V9um9ayEq7g/x5cjhiGbu+srUHENk73jqM+SAwJkwWKf3XmLxCJnwnWZXLnYmA== X-Received: by 10.55.41.7 with SMTP id p7mr5317512qkh.86.1454893661832; Sun, 07 Feb 2016 17:07:41 -0800 (PST) In-Reply-To: X-Mailer: Apple Mail (2.1510) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102643 On Feb 7, 2016, at 4:46 PM, Paulo da Silva = wrote: > Hello! >=20 > This may not be a strict python question, but ... >=20 > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. >=20 > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? >=20 > Thanks for any suggestions. If you're after strict equality (every byte in a pair of files is = identical), then here are a few heuristics that may help you: 1) Test for file length, without reading in the whole file. You can use = os.path.getsize() to do this (I hope that this is a constant-time = operation, but I haven't tested it). As Oscar Benjamin suggested, you = can create a defaultdict(list) which will make it possible to gather = lists of files of equal size. This should help you gather your = potentially identical files quickly. 2) Once you have your dictionary from above, you can iterate its values, = each of which will be a list. If a list has only one file in it, you = know its unique, and you don't have to do any more work on it. If there = are two files in the list, then you have several different options: a) Use Chris Angelico's suggestion and hash each of the files = (use the standard library's 'hashlib' for this). Identical files will = always have identical hashes, but there may be false positives, so = you'll need to verify that files that have identical hashes are indeed = identical. b) If your files tend to have sections that are very different = (e.g., the first 32 bytes tend to be different), then you pretend that = section of the file is its hash. You can then do the same trick as = above. (the advantage of this is that you will read in a lot less data = than if you have to hash the entire file). c) You may be able to do something clever by reading portions of = each file. That is, use zip() combined with read(1024) to read each of = the files in sections, while keeping hashes of the files. Or, maybe = you'll be able to read portions of them and sort the list as you're = reading. In either case, if any files are NOT identical, then you'll be = able to stop work as soon as you figure this out, rather than having to = read the entire file at once. The main purpose of these suggestions is to reduce the amount of reading = you're doing. Storage tends to be slow, and any tricks that reduce the = number of bytes you need to read in will be helpful to you. Good luck! Cem Karan=