Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102705
| Path | csiph.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail |
|---|---|
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
| Newsgroups | comp.lang.python |
| Subject | Re: A sets algorithm |
| Date | Tue, 09 Feb 2016 17:48:58 +1300 |
| Lines | 27 |
| Message-ID | <dht9dsFoqflU1@mid.individual.net> (permalink) |
| References | <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> <mailman.119.1454992069.2317.python-list@python.org> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 7bit |
| X-Trace | individual.net doyQ0W0zCA4378o5xeGf/gAbrShR5cxtEEAFf/EZI+SJnuP/lS |
| Cancel-Lock | sha1:o59fJ9vl6IbGtBgdtJE6wB6beGQ= |
| User-Agent | Mozilla Thunderbird 1.0.5 (Macintosh/20050711) |
| X-Accept-Language | en-us, en |
| In-Reply-To | <mailman.119.1454992069.2317.python-list@python.org> |
| Xref | csiph.com comp.lang.python:102705 |
Show key headers only | View raw
Chris Angelico wrote: > hash_to_filename = defaultdict(list) > for fn in files: > # Step 1: Hash every file. > hash = calculate_hash(fn) > > # Step 2: Locate all pairs of files with identical hashes > hash_to_filename[hash].append(fn) I think you can avoid hashing the files altogether. First divide the files into groups of the same size. Then for each group, open all the files at once, read them in parallel and compare them with each other. As soon as you find a difference, split the group into smaller groups. When a group gets down to just one file, you can stop reading that file. Assuming that most of the differing files differ near the beginning, this strategy means that you will hardly ever have to read a whole file. Hashing the files beforehand, in contrast, requires reading all of every file every time. -- Greg
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300
csiph-web