Re: A sets algorithm

Path	csiph.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Newsgroups	comp.lang.python
Subject	Re: A sets algorithm
Date	Tue, 09 Feb 2016 17:48:58 +1300
Lines	27
Message-ID	<dht9dsFoqflU1@mid.individual.net> (permalink)
References	<n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> <mailman.119.1454992069.2317.python-list@python.org>
Mime-Version	1.0
Content-Type	text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding	7bit
X-Trace	individual.net doyQ0W0zCA4378o5xeGf/gAbrShR5cxtEEAFf/EZI+SJnuP/lS
Cancel-Lock	sha1:o59fJ9vl6IbGtBgdtJE6wB6beGQ=
User-Agent	Mozilla Thunderbird 1.0.5 (Macintosh/20050711)
X-Accept-Language	en-us, en
In-Reply-To	<mailman.119.1454992069.2317.python-list@python.org>
Xref	csiph.com comp.lang.python:102705

Show key headers only | View raw

Chris Angelico wrote:
> hash_to_filename = defaultdict(list)
> for fn in files:
>     # Step 1: Hash every file.
>     hash = calculate_hash(fn)
> 
>     # Step 2: Locate all pairs of files with identical hashes
>     hash_to_filename[hash].append(fn)

I think you can avoid hashing the files altogether.

First divide the files into groups of the same size. Then for
each group, open all the files at once, read them in parallel
and compare them with each other.

As soon as you find a difference, split the group into
smaller groups. When a group gets down to just one file, you
can stop reading that file.

Assuming that most of the differing files differ near the
beginning, this strategy means that you will hardly ever
have to read a whole file. Hashing the files beforehand,
in contrast, requires reading all of every file every
time.

-- 
Greg

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web