Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: A sets algorithm
Date: Tue, 9 Feb 2016 15:27:40 +1100
Lines: 57
Message-ID: <mailman.119.1454992069.2317.python-list@python.org>
References: <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:102704

On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>

My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:

hash_to_filename = defaultdict(list)
for fn in files:
    # Step 1: Hash every file.
    hash = calculate_hash(fn)

    # Step 2: Locate all pairs of files with identical hashes
    hash_to_filename[hash].append(fn)


This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.

The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).

I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)

ChrisA