Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: A sets algorithm
Date: Tue, 9 Feb 2016 02:11:30 +1100
Lines: 38
Message-ID: <mailman.100.1454944293.2317.python-list@python.org>
References: <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com>
Precedence: list
Xref: csiph.com comp.lang.python:102676

On Tue, Feb 9, 2016 at 1:49 AM, Random832 <random832@fastmail.com> wrote:
> On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
>>       a) Use Chris Angelico's suggestion and hash each of the files (use=
 the standard library's 'hashlib' for this).  Identical files will always h=
ave identical hashes, but there may be false positives, so you'll need to v=
erify that files that have identical hashes are indeed identical.
>>       b) If your files tend to have sections that are very different (e.=
g., the first 32 bytes tend to be different), then you pretend that section=
 of the file is its hash.  You can then do the same trick as above. (the ad=
vantage of this is that you will read in a lot less data than if you have t=
o hash the entire file).
>>       c) You may be able to do something clever by reading portions of e=
ach file.  That is, use zip() combined with read(1024) to read each of the =
files in sections, while keeping hashes of the files.  Or, maybe you'll be =
able to read portions of them and sort the list as you're reading.  In eith=
er case, if any files are NOT identical, then you'll be able to stop work a=
s soon as you figure this out, rather than having to read the entire file a=
t once.
>>
>> The main purpose of these suggestions is to reduce the amount of reading
>> you're doing.
>
> hashing a file using a conventional hashing algorithm requires reading
> the whole file. Unless the files are very likely to be identical _until_
> near the end, you're better off just reading the first N bytes of both
> files, then the next N bytes, etc, until you find somewhere they're
> different. The filecmp module may be useful for this.

That's fine for comparing one file against one other. He started out
by saying he already had a way to compare files for equality. What he
wants is a way to capitalize on that to find all the identical files
in a group. A naive approach would simply compare every file against
every other, for O(N*N) comparisons - but a hash lookup can make that
O(N) on the files themselves, plus (I think) an O(N log N) hash
comparison job, which has much lower constant factors. The key here is
the hashing algorithm though.

ChrisA