Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #102705

Re: A sets algorithm

Path csiph.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From Gregory Ewing <greg.ewing@canterbury.ac.nz>
Newsgroups comp.lang.python
Subject Re: A sets algorithm
Date Tue, 09 Feb 2016 17:48:58 +1300
Lines 27
Message-ID <dht9dsFoqflU1@mid.individual.net> (permalink)
References <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> <mailman.119.1454992069.2317.python-list@python.org>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-Trace individual.net doyQ0W0zCA4378o5xeGf/gAbrShR5cxtEEAFf/EZI+SJnuP/lS
Cancel-Lock sha1:o59fJ9vl6IbGtBgdtJE6wB6beGQ=
User-Agent Mozilla Thunderbird 1.0.5 (Macintosh/20050711)
X-Accept-Language en-us, en
In-Reply-To <mailman.119.1454992069.2317.python-list@python.org>
Xref csiph.com comp.lang.python:102705

Show key headers only | View raw


Chris Angelico wrote:
> hash_to_filename = defaultdict(list)
> for fn in files:
>     # Step 1: Hash every file.
>     hash = calculate_hash(fn)
> 
>     # Step 2: Locate all pairs of files with identical hashes
>     hash_to_filename[hash].append(fn)

I think you can avoid hashing the files altogether.

First divide the files into groups of the same size. Then for
each group, open all the files at once, read them in parallel
and compare them with each other.

As soon as you find a difference, split the group into
smaller groups. When a group gets down to just one file, you
can stop reading that file.

Assuming that most of the differing files differ near the
beginning, this strategy means that you will hardly ever
have to read a whole file. Hashing the files beforehand,
in contrast, requires reading all of every file every
time.

-- 
Greg

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web