Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: A sets algorithm Date: Tue, 9 Feb 2016 15:27:40 +1100 Lines: 57 Message-ID: References: <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de +6zrtK42zSuoLOsQvtFdVQQXXGxh/D7DXlTzmECA5I4g== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'received:209.85.223': 0.03; 'cc:addr:python-list': 0.09; 'executes': 0.09; 'files:': 0.09; 'insertion': 0.09; 'integers': 0.09; 'lookup': 0.09; 'themselves,': 0.09; 'will,': 0.09; ':-)': 0.12; 'files.': 0.13; 'size,': 0.13; 'file,': 0.15; '2016': 0.16; 'bucket': 0.16; 'correspond.': 0.16; 'dictionary.': 0.16; 'elsewhere.': 0.16; 'equality.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hashes': 0.16; 'ignores': 0.16; 'n),': 0.16; 'naive': 0.16; 'operation,': 0.16; 'pairs': 0.16; 'rather,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'whatsoever': 0.16; 'wrote:': 0.16; 'cheap': 0.18; 'comparing': 0.18; 'differ': 0.18; 'exists': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'work,': 0.21; 'fairly': 0.22; 'meant': 0.22; 'saying': 0.22; 'assuming': 0.22; 'constant': 0.22; 'file.': 0.22; 'feb': 0.23; 'component': 0.23; 'elements': 0.23; 'this:': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; '(which': 0.26; 'chris': 0.26; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; 'fine': 0.28; 'comparison': 0.29; 'dictionary': 0.29; 'hash': 0.29; 'itself,': 0.29; 'other,': 0.29; "i'm": 0.30; 'code': 0.30; 'certainly': 0.30; 'possibly': 0.32; 'call,': 0.33; "d'aprano": 0.33; 'stands': 0.33; 'steven': 0.33; 'tue,': 0.34; 'file': 0.34; 'that,': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'quite': 0.35; 'something': 0.35; "isn't": 0.35; 'step': 0.36; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'received:209': 0.38; 'wrong': 0.38; 'log': 0.38; 'files': 0.38; 'mean': 0.38; 'sure': 0.39; 'where': 0.40; 'still': 0.40; 'called': 0.40; 'some': 0.40; 'secure': 0.60; 'skip:u 10': 0.61; 'times': 0.63; 'other.': 0.64; 'card.': 0.67; 'overall': 0.72; 'gathering': 0.76; 'chrisa': 0.84; 'to:none': 0.91; 'mistake': 0.91; 'thing,': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=nZbwvLjqQzGSGlI81DOrKr8CT97yddicdwGZt+wKXE1g4PLaN3/okAuDJ4mvWr+aQ6 0tM0yqggPCtcljwHuCJuKhx+Aq5CslLaQqurLz3GD3H/RbnfXBiWCYihmKGPjW/rmekb T1FfcGUpqclPmTmojp1YoiRKX875EnRm4Z/u00dFH9NREfUC7t6DD82zVztDpQozgvRZ TVGGqdbkuxkLwZGkzpskKo923iBwBn1xPnWMG0IX4Re+Ve0r2DiL7N3bK9v1vjbJSuaT gs5DFA/P1IEdM7nrLGsREJrZmB+apKlRFM3zjsBlxYNtIGHFnUCGDVfmxEioSxI8iEXW YLUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=kTyjQpGJz3v+zMa45TJQMzLh5r5idXrIhvo3w7hWRHBOQRmgTS0JVu8/K8JeQ1QqJy S4AOAt2olyLHoFR6PHuSdnM92SG/qpVr8HXoKZ9p1a2ANKlwg5Fz5dk0vikPe4EzAlND Zi/mr3HI8EGHDWGZzfHxhLm3Ff2iUnzThwI9YDy8URk1mwAcD2N0soR4kJva85/mJHRm WMfebwBkBacD1QZN2A92QHhqSKlH6qG9FYppkvhUkaeNc/Koc4Oye8AH8dLrjBuonnwq rTlRxuz0LtSEWaIsN7tSYSTd1RxX9Ugdb61QQThXABFDr1Rki4eOQX6NoMrcw+uZbOiO RhdA== X-Gm-Message-State: AG10YOQWAocAVArdZG1ApzFYnsR/WLQwGGncE70wUCe7PuoDeSknQ4PnahTZp9JHDXhobeyS4QNZ1hBfO/LLRw== X-Received: by 10.107.14.73 with SMTP id 70mr30733536ioo.31.1454992060454; Mon, 08 Feb 2016 20:27:40 -0800 (PST) In-Reply-To: <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102704 On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano wrote: > On Tuesday 09 February 2016 02:11, Chris Angelico wrote: > >> That's fine for comparing one file against one other. He started out >> by saying he already had a way to compare files for equality. What he >> wants is a way to capitalize on that to find all the identical files >> in a group. A naive approach would simply compare every file against >> every other, for O(N*N) comparisons - but a hash lookup can make that >> O(N) on the files themselves, plus (I think) an O(N log N) hash >> comparison job, which has much lower constant factors. > > You're mixing up N's there :-) In the first two (compare every file against > every other file, and hash lookup), N stands for the number of files. But in > the hash comparison, well, I'm not sure what you mean by that, unless you > mean the calculation of the hash itself, which will be O(M) where M is the > size of the file. > > Unfortunately, even using a hash, you still have to do O(N**2) work, or > rather, O(N*M) where N is the number of files and M is their size. > My intention was that every N was "number of files being compared", but it's possible I made a mistake elsewhere. Assuming the hashes become integers too large for a bucket sort (which they certainly will, unless you want inordinate numbers of hash collisions), the code would look something like this: hash_to_filename = defaultdict(list) for fn in files: # Step 1: Hash every file. hash = calculate_hash(fn) # Step 2: Locate all pairs of files with identical hashes hash_to_filename[hash].append(fn) This loop executes once for each file, so calculate_hash() is called once for each file. The cost of that is O(N), or possibly O(N*M). If you're fairly sure the files are going to differ in size, you could use the file size *as* the hash - it's cheap to calculate (no M component whatsoever - just a stat call, and even that could be optimized away in some cases, eg if you're scanning a whole directory), but isn't cryptographically secure and ignores file content altogether. The second step involves one dictionary lookup or insertion for each file. That's an O(log N) operation, where N is the number of elements in the dictionary. So yes, it's not quite the same as the number of files (if every file exists exactly twice, this N will be half the other N), but it's broadly going to correspond. And an O(log N) operation performed N times has an overall cost of O(N log N). I might have something wrong here, but definitely I meant to have the Ns all mean the same thing, like X on a Magic: The Gathering card. :) ChrisA