Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102636 > unrolled thread
| Started by | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| First post | 2016-02-07 21:46 +0000 |
| Last post | 2016-02-09 17:48 +1300 |
| Articles | 13 — 8 participants |
Back to article view | Back to comp.lang.python
A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-07 21:46 +0000 |
| Subject | A sets algorithm |
| Message-ID | <n98e0f$15lj$1@gioia.aioe.org> |
Hello! This may not be a strict python question, but ... Suppose I have already a class MyFile that has an efficient method (or operator) to compare two MyFile s for equality. What is the most efficient way to obtain all sets of equal files (of course each set must have more than one file - all single files are discarded)? Thanks for any suggestions. Paulo
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-08 08:58 +1100 |
| Message-ID | <mailman.79.1454882316.2317.python-list@python.org> |
| In reply to | #102636 |
On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions. Hash them in some way. This has two costs: 1) You need to figure out some hashing algorithm such that any two equal files have the same hash, and ideally, that unequal files will generally have unequal hashes. 2) Hash each file once, and then do your comparisons on the hashes, finally testing for actual equality only on those with the same hash. (The last step takes care of hash collisions - where different files happen to have the same hash - so ideally, you shouldn't have to do this often.) If your definition of "equal" among MyFiles is simply based on the file content, it's easy - just hash the content. But if there are ways for files to be considered equal without being bit-for-bit identical, you'll have to reflect that in the hash. For instance, if you consider files equal if they differ only in whitespace, then you'd need to convert all whitespace to a single space before hashing; or if you don't care about the order of lines in the file, you could either hash the lines separately and sum/xor the hashes, or sort the lines before hashing. But the rest of the job is pretty straight-forward. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2016-02-07 22:03 +0000 |
| Message-ID | <mailman.80.1454882599.2317.python-list@python.org> |
| In reply to | #102636 |
On 7 Feb 2016 21:51, "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote: > > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? If you can make the MyFiles hashable then this is easily done with defaultdict(list). -- Oscar
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2016-02-07 16:17 -0600 |
| Message-ID | <mailman.81.1454885914.2317.python-list@python.org> |
| In reply to | #102636 |
On 2016-02-07 21:46, Paulo da Silva wrote:
> Suppose I have already a class MyFile that has an efficient method
> (or operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
If your MyFile grows a __hash__ method that is based off the
uniqueness-qualities that you're comparing, you can use
collections.Counter and filter the results:
from collections import Counter
interesting = [
my_file
for my_file, count
in Counter(generate_MyFile_objects()).iteritems()
if count > 1
]
Which should be roughly O(n). It also has the advantage of iterating
over your generator once.
If you the MyFile objects can be unique but compare for equality
(e.g. two files on the file-system that have the same SHA1 hash, but
you want to know the file-names), you'd have to do a paired search
which would have worse performance and would need to iterate over the
data multiple times:
all_files = list(generate_MyFile_objects())
interesting = [
(my_file1, my_file2)
for i, my_file1
in enumerate(all_files, 1)
for my_file2
in all_files[i:]
if my_file1 == my_file2
]
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-08 00:05 +0000 |
| Message-ID | <n98m3o$1h8k$1@gioia.aioe.org> |
| In reply to | #102639 |
Às 22:17 de 07-02-2016, Tim Chase escreveu: > On 2016-02-07 21:46, Paulo da Silva wrote: ... > > If you the MyFile objects can be unique but compare for equality > (e.g. two files on the file-system that have the same SHA1 hash, but > you want to know the file-names), you'd have to do a paired search > which would have worse performance and would need to iterate over the > data multiple times: > > all_files = list(generate_MyFile_objects()) > interesting = [ > (my_file1, my_file2) > for i, my_file1 > in enumerate(all_files, 1) > for my_file2 > in all_files[i:] > if my_file1 == my_file2 > ] > "my_file1 == my_file2" can be implemented into MyFile class taking advantage of caching sizes (if different files are different), hashes or even content (for small files) or file headers (first n bytes). However this seems to have a problem: all_files: a b c d e ... If a==b then comparing b with c,d,e is useless. May be using several steps with dict - sizes, then hashes for same sizes files, etc ... Another solution I thought of, could be defining some methods (I still don't know which ones) in MyFile so that I could use sets intersection. Would this one be a faster solution? Thanks
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2016-02-07 18:20 -0600 |
| Message-ID | <mailman.82.1454892109.2317.python-list@python.org> |
| In reply to | #102640 |
On 2016-02-08 00:05, Paulo da Silva wrote: > Às 22:17 de 07-02-2016, Tim Chase escreveu: >> all_files = list(generate_MyFile_objects()) >> interesting = [ >> (my_file1, my_file2) >> for i, my_file1 >> in enumerate(all_files, 1) >> for my_file2 >> in all_files[i:] >> if my_file1 == my_file2 >> ] > > "my_file1 == my_file2" can be implemented into MyFile class taking > advantage of caching sizes (if different files are different), > hashes or even content (for small files) or file headers (first n > bytes). However this seems to have a problem: > all_files: a b c d e ... > If a==b then comparing b with c,d,e is useless. Depends on what the OP wants to have happen if more than one input file is equal. I.e., a == b == c. Does one just want "a has duplicates" (and optionally "and here's one of them"), or does one want "a == b", "a == c" and "b == c" in the output? > Another solution I thought of, could be defining some methods (I > still don't know which ones) in MyFile so that I could use sets > intersection. Would this one be a faster solution? Adding __hash__ would allow for the set operations, but would require (as ChrisA points out) knowing how to create a hash function that encompasses the information you want to compare. -tkc
[toc] | [prev] | [next] | [standalone]
| From | Cem Karan <cfkaran2@gmail.com> |
|---|---|
| Date | 2016-02-07 20:07 -0500 |
| Message-ID | <mailman.83.1454893663.2317.python-list@python.org> |
| In reply to | #102636 |
On Feb 7, 2016, at 4:46 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > > Thanks for any suggestions. If you're after strict equality (every byte in a pair of files is identical), then here are a few heuristics that may help you: 1) Test for file length, without reading in the whole file. You can use os.path.getsize() to do this (I hope that this is a constant-time operation, but I haven't tested it). As Oscar Benjamin suggested, you can create a defaultdict(list) which will make it possible to gather lists of files of equal size. This should help you gather your potentially identical files quickly. 2) Once you have your dictionary from above, you can iterate its values, each of which will be a list. If a list has only one file in it, you know its unique, and you don't have to do any more work on it. If there are two files in the list, then you have several different options: a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this). Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical. b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash. You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file). c) You may be able to do something clever by reading portions of each file. That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files. Or, maybe you'll be able to read portions of them and sort the list as you're reading. In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once. The main purpose of these suggestions is to reduce the amount of reading you're doing. Storage tends to be slow, and any tricks that reduce the number of bytes you need to read in will be helpful to you. Good luck! Cem Karan
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-08 02:22 +0000 |
| Message-ID | <n98u5p$1ovh$1@gioia.aioe.org> |
| In reply to | #102636 |
Às 21:46 de 07-02-2016, Paulo da Silva escreveu: > Hello! > > This may not be a strict python question, but ... > > Suppose I have already a class MyFile that has an efficient method (or > operator) to compare two MyFile s for equality. > > What is the most efficient way to obtain all sets of equal files (of > course each set must have more than one file - all single files are > discarded)? > After reading all suggestions I decided to try first the defaultdict(list), as first suggested by Oscar, in several steps. First with sizes and then with other partial contents or/and "strong" hashes as suggested by Cem. Thank you very much to all who responded for all helpful suggestions. If I find something better I'll report here. Paulo
[toc] | [prev] | [next] | [standalone]
| From | Random832 <random832@fastmail.com> |
|---|---|
| Date | 2016-02-08 09:49 -0500 |
| Message-ID | <mailman.97.1454942995.2317.python-list@python.org> |
| In reply to | #102636 |
On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: > a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this). Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical. > b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash. You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file). > c) You may be able to do something clever by reading portions of each file. That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files. Or, maybe you'll be able to read portions of them and sort the list as you're reading. In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once. > > The main purpose of these suggestions is to reduce the amount of reading > you're doing. hashing a file using a conventional hashing algorithm requires reading the whole file. Unless the files are very likely to be identical _until_ near the end, you're better off just reading the first N bytes of both files, then the next N bytes, etc, until you find somewhere they're different. The filecmp module may be useful for this.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-09 02:11 +1100 |
| Message-ID | <mailman.100.1454944293.2317.python-list@python.org> |
| In reply to | #102636 |
On Tue, Feb 9, 2016 at 1:49 AM, Random832 <random832@fastmail.com> wrote: > On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: >> a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this). Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical. >> b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash. You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file). >> c) You may be able to do something clever by reading portions of each file. That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files. Or, maybe you'll be able to read portions of them and sort the list as you're reading. In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once. >> >> The main purpose of these suggestions is to reduce the amount of reading >> you're doing. > > hashing a file using a conventional hashing algorithm requires reading > the whole file. Unless the files are very likely to be identical _until_ > near the end, you're better off just reading the first N bytes of both > files, then the next N bytes, etc, until you find somewhere they're > different. The filecmp module may be useful for this. That's fine for comparing one file against one other. He started out by saying he already had a way to compare files for equality. What he wants is a way to capitalize on that to find all the identical files in a group. A naive approach would simply compare every file against every other, for O(N*N) comparisons - but a hash lookup can make that O(N) on the files themselves, plus (I think) an O(N log N) hash comparison job, which has much lower constant factors. The key here is the hashing algorithm though. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-02-09 15:13 +1100 |
| Message-ID | <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #102676 |
On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
> That's fine for comparing one file against one other. He started out
> by saying he already had a way to compare files for equality. What he
> wants is a way to capitalize on that to find all the identical files
> in a group. A naive approach would simply compare every file against
> every other, for O(N*N) comparisons - but a hash lookup can make that
> O(N) on the files themselves, plus (I think) an O(N log N) hash
> comparison job, which has much lower constant factors.
You're mixing up N's there :-) In the first two (compare every file against
every other file, and hash lookup), N stands for the number of files. But in
the hash comparison, well, I'm not sure what you mean by that, unless you
mean the calculation of the hash itself, which will be O(M) where M is the
size of the file.
Unfortunately, even using a hash, you still have to do O(N**2) work, or
rather, O(N*M) where N is the number of files and M is their size.
Assuming that duplicate files are relatively rare, the best way to do this
is to collate files with the same size. Looking up the file size ought to be
pretty fast. It is an IO operation, but it is approximately constant time in
the sense that it doesn't actually depend on the size of the file.
(If you're using a programming language or operating system where the only
way to find out how big a file is to open it and count the bytes, this
obviously won't work for you.)
So let's collect files into a separate bucket for each unique file size:
bysize = {}
for file in files:
bysize.setdefault(file.getsize(), []).append(file)
In practice, you may find that nearly all files have different sizes, and
most of the bucks have only a single file in them. Those files you can just
ignore, as they must be unique.
So instead of having to compare N files against each of the others, you are
left with a much smaller number of buckets, each with (hopefully) only two
or three files of the same size. You might have gone from N=1000000 to
thirty buckets with five files each.
And most of those files will probably differ on the very first byte, or at
least within the first 16K of bytes, so it's hardly worth hashing them
(especially if the hash algorithm is an expensive cryptographic hash, or if
the file is large).
I would delay the hash comparison as long as possible, something like this:
def __hash__(self):
if self._hash is None:
self._hash = calculate_hash() # expensive, so we cache it
return self._hash
def __eq__(self, other):
size = self.getsize()
if size != other.getsize():
return False
if self._hash and other._hash:
# Both hashes are already done, so might as well use them.
return (
hash(self) == hash(other)
# in case of collisions...
and self.read() == other.read()
)
if size < MAXSIZE:
return self.read(size) == other.read(size)
else:
if self.read(MAXSIZE) != other.read(MAXSIZE):
return False
else:
return (
hash(self) == hash(other)
# in case of collisions...
and self.read() == other.read()
)
If your hash is strong enough that collisions are vanishingly rare, you
could ignore the `and self.read() == other.read()` check.
--
Steve
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-09 15:27 +1100 |
| Message-ID | <mailman.119.1454992069.2317.python-list@python.org> |
| In reply to | #102701 |
On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>
My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:
hash_to_filename = defaultdict(list)
for fn in files:
# Step 1: Hash every file.
hash = calculate_hash(fn)
# Step 2: Locate all pairs of files with identical hashes
hash_to_filename[hash].append(fn)
This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.
The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).
I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2016-02-09 17:48 +1300 |
| Message-ID | <dht9dsFoqflU1@mid.individual.net> |
| In reply to | #102704 |
Chris Angelico wrote: > hash_to_filename = defaultdict(list) > for fn in files: > # Step 1: Hash every file. > hash = calculate_hash(fn) > > # Step 2: Locate all pairs of files with identical hashes > hash_to_filename[hash].append(fn) I think you can avoid hashing the files altogether. First divide the files into groups of the same size. Then for each group, open all the files at once, read them in parallel and compare them with each other. As soon as you find a difference, split the group into smaller groups. When a group gets down to just one file, you can stop reading that file. Assuming that most of the differing files differ near the beginning, this strategy means that you will hardly ever have to read a whole file. Hashing the files beforehand, in contrast, requires reading all of every file every time. -- Greg
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web