Groups > comp.lang.python > #102636 > unrolled thread

A sets algorithm

Started by	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
First post	2016-02-07 21:46 +0000
Last post	2016-02-09 17:48 +1300
Articles	13 — 8 participants

Back to article view | Back to comp.lang.python

  A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
    Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
    Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
    Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
      Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
        Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
    Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
    Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
    Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
      Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
        Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
          Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

#102636 — A sets algorithm

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-07 21:46 +0000
Subject	A sets algorithm
Message-ID	<n98e0f$15lj$1@gioia.aioe.org>

Hello!

This may not be a strict python question, but ...

Suppose I have already a class MyFile that has an efficient method (or
operator) to compare two MyFile s for equality.

What is the most efficient way to obtain all sets of equal files (of
course each set must have more than one file - all single files are
discarded)?

Thanks for any suggestions.
Paulo

[toc] | [next] | [standalone]

#102637

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-08 08:58 +1100
Message-ID	<mailman.79.1454882316.2317.python-list@python.org>
In reply to	#102636

On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.

Hash them in some way.

This has two costs:
1) You need to figure out some hashing algorithm such that any two
equal files have the same hash, and ideally, that unequal files will
generally have unequal hashes.
2) Hash each file once, and then do your comparisons on the hashes,
finally testing for actual equality only on those with the same hash.
(The last step takes care of hash collisions - where different files
happen to have the same hash - so ideally, you shouldn't have to do
this often.)

If your definition of "equal" among MyFiles is simply based on the
file content, it's easy - just hash the content. But if there are ways
for files to be considered equal without being bit-for-bit identical,
you'll have to reflect that in the hash. For instance, if you consider
files equal if they differ only in whitespace, then you'd need to
convert all whitespace to a single space before hashing; or if you
don't care about the order of lines in the file, you could either hash
the lines separately and sum/xor the hashes, or sort the lines before
hashing. But the rest of the job is pretty straight-forward.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102638

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-02-07 22:03 +0000
Message-ID	<mailman.80.1454882599.2317.python-list@python.org>
In reply to	#102636

On 7 Feb 2016 21:51, "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
wrote:
>
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?

If you can make the MyFiles hashable then this is easily done with
defaultdict(list).

--
Oscar

[toc] | [prev] | [next] | [standalone]

#102639

From	Tim Chase <python.list@tim.thechases.com>
Date	2016-02-07 16:17 -0600
Message-ID	<mailman.81.1454885914.2317.python-list@python.org>
In reply to	#102636

On 2016-02-07 21:46, Paulo da Silva wrote:
> Suppose I have already a class MyFile that has an efficient method
> (or operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?

If your MyFile grows a __hash__ method that is based off the
uniqueness-qualities that you're comparing, you can use
collections.Counter and filter the results:

  from collections import Counter
  interesting = [
    my_file
    for my_file, count
    in Counter(generate_MyFile_objects()).iteritems()
    if count > 1
    ]

Which should be roughly O(n).  It also has the advantage of iterating
over your generator once.

If you the MyFile objects can be unique but compare for equality
(e.g. two files on the file-system that have the same SHA1 hash, but
you want to know the file-names), you'd have to do a paired search
which would have worse performance and would need to iterate over the
data multiple times:

  all_files = list(generate_MyFile_objects())
  interesting = [
    (my_file1, my_file2)
    for i, my_file1
    in enumerate(all_files, 1)
    for my_file2
    in all_files[i:]
    if my_file1 == my_file2
    ]

-tkc

[toc] | [prev] | [next] | [standalone]

#102640

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-08 00:05 +0000
Message-ID	<n98m3o$1h8k$1@gioia.aioe.org>
In reply to	#102639

Às 22:17 de 07-02-2016, Tim Chase escreveu:
> On 2016-02-07 21:46, Paulo da Silva wrote:
...

> 
> If you the MyFile objects can be unique but compare for equality
> (e.g. two files on the file-system that have the same SHA1 hash, but
> you want to know the file-names), you'd have to do a paired search
> which would have worse performance and would need to iterate over the
> data multiple times:
> 
>   all_files = list(generate_MyFile_objects())
>   interesting = [
>     (my_file1, my_file2)
>     for i, my_file1
>     in enumerate(all_files, 1)
>     for my_file2
>     in all_files[i:]
>     if my_file1 == my_file2
>     ]
> 
"my_file1 == my_file2" can be implemented into MyFile class taking
advantage of caching sizes (if different files are different), hashes or
even content (for small files) or file headers (first n bytes).
However this seems to have a problem:
all_files: a b c d e ...
If a==b then comparing b with c,d,e is useless.

May be using several steps with dict - sizes, then hashes for same sizes
files, etc ...

Another solution I thought of, could be defining some methods (I still
don't know which ones) in MyFile so that I could use sets intersection.
Would this one be a faster solution?

Thanks

[toc] | [prev] | [next] | [standalone]

#102642

From	Tim Chase <python.list@tim.thechases.com>
Date	2016-02-07 18:20 -0600
Message-ID	<mailman.82.1454892109.2317.python-list@python.org>
In reply to	#102640

On 2016-02-08 00:05, Paulo da Silva wrote:
> Às 22:17 de 07-02-2016, Tim Chase escreveu:
>>   all_files = list(generate_MyFile_objects())
>>   interesting = [
>>     (my_file1, my_file2)
>>     for i, my_file1
>>     in enumerate(all_files, 1)
>>     for my_file2
>>     in all_files[i:]
>>     if my_file1 == my_file2
>>     ]
> 
> "my_file1 == my_file2" can be implemented into MyFile class taking
> advantage of caching sizes (if different files are different),
> hashes or even content (for small files) or file headers (first n
> bytes). However this seems to have a problem:
> all_files: a b c d e ...
> If a==b then comparing b with c,d,e is useless.

Depends on what the OP wants to have happen if more than one input
file is equal. I.e., a == b == c.  Does one just want "a has
duplicates" (and optionally "and here's one of them"), or does one
want "a == b", "a == c" and "b == c" in the output?

> Another solution I thought of, could be defining some methods (I
> still don't know which ones) in MyFile so that I could use sets
> intersection. Would this one be a faster solution?

Adding __hash__ would allow for the set operations, but would
require (as ChrisA points out) knowing how to create a hash function
that encompasses the information you want to compare.

-tkc

[toc] | [prev] | [next] | [standalone]

#102643

From	Cem Karan <cfkaran2@gmail.com>
Date	2016-02-07 20:07 -0500
Message-ID	<mailman.83.1454893663.2317.python-list@python.org>
In reply to	#102636

On Feb 7, 2016, at 4:46 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:

> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is identical), then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use os.path.getsize() to do this (I hope that this is a constant-time operation, but I haven't tested it).  As Oscar Benjamin suggested, you can create a defaultdict(list) which will make it possible to gather lists of files of equal size.  This should help you gather your potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, each of which will be a list.  If a list has only one file in it, you know its unique, and you don't have to do any more work on it.  If there are two files in the list, then you have several different options:
	a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
	b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
	c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading you're doing.  Storage tends to be slow, and any tricks that reduce the number of bytes you need to read in will be helpful to you.

Good luck!
Cem Karan

[toc] | [prev] | [next] | [standalone]

#102645

From	Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date	2016-02-08 02:22 +0000
Message-ID	<n98u5p$1ovh$1@gioia.aioe.org>
In reply to	#102636

Às 21:46 de 07-02-2016, Paulo da Silva escreveu:
> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 

After reading all suggestions I decided to try first the
defaultdict(list), as first suggested by Oscar, in several steps. First
with sizes and then with other partial contents or/and "strong" hashes
as suggested by Cem.

Thank you very much to all who responded for all helpful suggestions.
If I find something better I'll report here.
Paulo

[toc] | [prev] | [next] | [standalone]

#102672

From	Random832 <random832@fastmail.com>
Date	2016-02-08 09:49 -0500
Message-ID	<mailman.97.1454942995.2317.python-list@python.org>
In reply to	#102636

On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
> 	a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
> 	b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
> 	c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.
> 
> The main purpose of these suggestions is to reduce the amount of reading
> you're doing.

hashing a file using a conventional hashing algorithm requires reading
the whole file. Unless the files are very likely to be identical _until_
near the end, you're better off just reading the first N bytes of both
files, then the next N bytes, etc, until you find somewhere they're
different. The filecmp module may be useful for this.

[toc] | [prev] | [next] | [standalone]

#102676

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-09 02:11 +1100
Message-ID	<mailman.100.1454944293.2317.python-list@python.org>
In reply to	#102636

On Tue, Feb 9, 2016 at 1:49 AM, Random832 <random832@fastmail.com> wrote:
> On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
>>       a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
>>       b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
>>       c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.
>>
>> The main purpose of these suggestions is to reduce the amount of reading
>> you're doing.
>
> hashing a file using a conventional hashing algorithm requires reading
> the whole file. Unless the files are very likely to be identical _until_
> near the end, you're better off just reading the first N bytes of both
> files, then the next N bytes, etc, until you find somewhere they're
> different. The filecmp module may be useful for this.

That's fine for comparing one file against one other. He started out
by saying he already had a way to compare files for equality. What he
wants is a way to capitalize on that to find all the identical files
in a group. A naive approach would simply compare every file against
every other, for O(N*N) comparisons - but a hash lookup can make that
O(N) on the files themselves, plus (I think) an O(N log N) hash
comparison job, which has much lower constant factors. The key here is
the hashing algorithm though.

ChrisA

[toc] | [prev] | [next] | [standalone]

#102701

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-02-09 15:13 +1100
Message-ID	<56b96755$0$2909$c3e8da3$76491128@news.astraweb.com>
In reply to	#102676

On Tuesday 09 February 2016 02:11, Chris Angelico wrote:

> That's fine for comparing one file against one other. He started out
> by saying he already had a way to compare files for equality. What he
> wants is a way to capitalize on that to find all the identical files
> in a group. A naive approach would simply compare every file against
> every other, for O(N*N) comparisons - but a hash lookup can make that
> O(N) on the files themselves, plus (I think) an O(N log N) hash
> comparison job, which has much lower constant factors.

You're mixing up N's there :-) In the first two (compare every file against 
every other file, and hash lookup), N stands for the number of files. But in 
the hash comparison, well, I'm not sure what you mean by that, unless you 
mean the calculation of the hash itself, which will be O(M) where M is the 
size of the file.

Unfortunately, even using a hash, you still have to do O(N**2) work, or 
rather, O(N*M) where N is the number of files and M is their size.

Assuming that duplicate files are relatively rare, the best way to do this 
is to collate files with the same size. Looking up the file size ought to be 
pretty fast. It is an IO operation, but it is approximately constant time in 
the sense that it doesn't actually depend on the size of the file.

(If you're using a programming language or operating system where the only 
way to find out how big a file is to open it and count the bytes, this 
obviously won't work for you.)

So let's collect files into a separate bucket for each unique file size:

bysize = {}
for file in files:
    bysize.setdefault(file.getsize(), []).append(file)

In practice, you may find that nearly all files have different sizes, and 
most of the bucks have only a single file in them. Those files you can just 
ignore, as they must be unique.

So instead of having to compare N files against each of the others, you are 
left with a much smaller number of buckets, each with (hopefully) only two 
or three files of the same size. You might have gone from N=1000000 to 
thirty buckets with five files each.

And most of those files will probably differ on the very first byte, or at 
least within the first 16K of bytes, so it's hardly worth hashing them 
(especially if the hash algorithm is an expensive cryptographic hash, or if 
the file is large).

I would delay the hash comparison as long as possible, something like this:

def __hash__(self):
    if self._hash is None:
        self._hash = calculate_hash()  # expensive, so we cache it
    return self._hash

def __eq__(self, other):
    size = self.getsize()
    if size != other.getsize():
        return False
    if self._hash and other._hash:
        # Both hashes are already done, so might as well use them.
        return (
            hash(self) == hash(other) 
            # in case of collisions...
            and self.read() == other.read()
            )
    if size < MAXSIZE:
        return self.read(size) == other.read(size)
    else:
        if self.read(MAXSIZE) != other.read(MAXSIZE):
            return False
        else:
             return (
                hash(self) == hash(other) 
                # in case of collisions...
                and self.read() == other.read()
                )

If your hash is strong enough that collisions are vanishingly rare, you 
could ignore the `and self.read() == other.read()` check.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#102704

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-09 15:27 +1100
Message-ID	<mailman.119.1454992069.2317.python-list@python.org>
In reply to	#102701

On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>

My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:

hash_to_filename = defaultdict(list)
for fn in files:
    # Step 1: Hash every file.
    hash = calculate_hash(fn)

    # Step 2: Locate all pairs of files with identical hashes
    hash_to_filename[hash].append(fn)

This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.

The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).

I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#102705

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2016-02-09 17:48 +1300
Message-ID	<dht9dsFoqflU1@mid.individual.net>
In reply to	#102704

Chris Angelico wrote:
> hash_to_filename = defaultdict(list)
> for fn in files:
>     # Step 1: Hash every file.
>     hash = calculate_hash(fn)
> 
>     # Step 2: Locate all pairs of files with identical hashes
>     hash_to_filename[hash].append(fn)

I think you can avoid hashing the files altogether.

First divide the files into groups of the same size. Then for
each group, open all the files at once, read them in parallel
and compare them with each other.

As soon as you find a difference, split the group into
smaller groups. When a group gets down to just one file, you
can stop reading that file.

Assuming that most of the differing files differ near the
beginning, this strategy means that you will hardly ever
have to read a whole file. Hashing the files beforehand,
in contrast, requires reading all of every file every
time.

-- 
Greg

[toc] | [prev] | [standalone]

csiph-web

A sets algorithm

Contents

#102636 — A sets algorithm

#102637

#102638

#102639

#102640

#102642

#102643

#102645

#102672

#102676

#102701

#102704

#102705