Re: A sets algorithm

Path	csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From	Chris Angelico <rosuav@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: A sets algorithm
Date	Tue, 9 Feb 2016 15:27:40 +1100
Lines	57
Message-ID	<mailman.119.1454992069.2317.python-list@python.org> (permalink)
References	<n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com>
Mime-Version	1.0
Content-Type	text/plain; charset=UTF-8
X-Trace	news.uni-berlin.de +6zrtK42zSuoLOsQvtFdVQQXXGxh/D7DXlTzmECA5I4g==
Return-Path	<rosuav@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'received:209.85.223': 0.03; 'cc:addr:python-list': 0.09; 'executes': 0.09; 'files:': 0.09; 'insertion': 0.09; 'integers': 0.09; 'lookup': 0.09; 'themselves,': 0.09; 'will,': 0.09; ':-)': 0.12; 'files.': 0.13; 'size,': 0.13; 'file,': 0.15; '2016': 0.16; 'bucket': 0.16; 'correspond.': 0.16; 'dictionary.': 0.16; 'elsewhere.': 0.16; 'equality.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hashes': 0.16; 'ignores': 0.16; 'n),': 0.16; 'naive': 0.16; 'operation,': 0.16; 'pairs': 0.16; 'rather,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'whatsoever': 0.16; 'wrote:': 0.16; 'cheap': 0.18; 'comparing': 0.18; 'differ': 0.18; 'exists': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'work,': 0.21; 'fairly': 0.22; 'meant': 0.22; 'saying': 0.22; 'assuming': 0.22; 'constant': 0.22; 'file.': 0.22; 'feb': 0.23; 'component': 0.23; 'elements': 0.23; 'this:': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; '(which': 0.26; 'chris': 0.26; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; 'fine': 0.28; 'comparison': 0.29; 'dictionary': 0.29; 'hash': 0.29; 'itself,': 0.29; 'other,': 0.29; "i'm": 0.30; 'code': 0.30; 'certainly': 0.30; 'possibly': 0.32; 'call,': 0.33; "d'aprano": 0.33; 'stands': 0.33; 'steven': 0.33; 'tue,': 0.34; 'file': 0.34; 'that,': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'quite': 0.35; 'something': 0.35; "isn't": 0.35; 'step': 0.36; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'received:209': 0.38; 'wrong': 0.38; 'log': 0.38; 'files': 0.38; 'mean': 0.38; 'sure': 0.39; 'where': 0.40; 'still': 0.40; 'called': 0.40; 'some': 0.40; 'secure': 0.60; 'skip:u 10': 0.61; 'times': 0.63; 'other.': 0.64; 'card.': 0.67; 'overall': 0.72; 'gathering': 0.76; 'chrisa': 0.84; 'to:none': 0.91; 'mistake': 0.91; 'thing,': 0.93
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=nZbwvLjqQzGSGlI81DOrKr8CT97yddicdwGZt+wKXE1g4PLaN3/okAuDJ4mvWr+aQ6 0tM0yqggPCtcljwHuCJuKhx+Aq5CslLaQqurLz3GD3H/RbnfXBiWCYihmKGPjW/rmekb T1FfcGUpqclPmTmojp1YoiRKX875EnRm4Z/u00dFH9NREfUC7t6DD82zVztDpQozgvRZ TVGGqdbkuxkLwZGkzpskKo923iBwBn1xPnWMG0IX4Re+Ve0r2DiL7N3bK9v1vjbJSuaT gs5DFA/P1IEdM7nrLGsREJrZmB+apKlRFM3zjsBlxYNtIGHFnUCGDVfmxEioSxI8iEXW YLUQ==
X-Google-DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=kTyjQpGJz3v+zMa45TJQMzLh5r5idXrIhvo3w7hWRHBOQRmgTS0JVu8/K8JeQ1QqJy S4AOAt2olyLHoFR6PHuSdnM92SG/qpVr8HXoKZ9p1a2ANKlwg5Fz5dk0vikPe4EzAlND Zi/mr3HI8EGHDWGZzfHxhLm3Ff2iUnzThwI9YDy8URk1mwAcD2N0soR4kJva85/mJHRm WMfebwBkBacD1QZN2A92QHhqSKlH6qG9FYppkvhUkaeNc/Koc4Oye8AH8dLrjBuonnwq rTlRxuz0LtSEWaIsN7tSYSTd1RxX9Ugdb61QQThXABFDr1Rki4eOQX6NoMrcw+uZbOiO RhdA==
X-Gm-Message-State	AG10YOQWAocAVArdZG1ApzFYnsR/WLQwGGncE70wUCe7PuoDeSknQ4PnahTZp9JHDXhobeyS4QNZ1hBfO/LLRw==
X-Received	by 10.107.14.73 with SMTP id 70mr30733536ioo.31.1454992060454; Mon, 08 Feb 2016 20:27:40 -0800 (PST)
In-Reply-To	<56b96755$0$2909$c3e8da3$76491128@news.astraweb.com>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.21rc2
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref	csiph.com comp.lang.python:102704

Show key headers only | View raw

On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>

My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:

hash_to_filename = defaultdict(list)
for fn in files:
    # Step 1: Hash every file.
    hash = calculate_hash(fn)

    # Step 2: Locate all pairs of files with identical hashes
    hash_to_filename[hash].append(fn)

This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.

The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).

I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)

ChrisA

Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web