Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #102672

Re: A sets algorithm

Path csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From Random832 <random832@fastmail.com>
Newsgroups comp.lang.python
Subject Re: A sets algorithm
Date Mon, 08 Feb 2016 09:49:52 -0500
Lines 13
Message-ID <mailman.97.1454942995.2317.python-list@python.org> (permalink)
References <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com>
Mime-Version 1.0
Content-Type text/plain
Content-Transfer-Encoding 7bit
X-Trace news.uni-berlin.de en4dYFfh7dOjyUDyn3Y3cwpyaxTyZbtGn5whDEy2Rjlg==
Return-Path <random832@fastmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.004
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'pretend': 0.07; '(use': 0.09; 'bytes,': 0.09; 'portions': 0.09; 'received:internal': 0.09; 'files.': 0.13; 'sections': 0.13; 'file).': 0.16; 'hashes': 0.16; 'message-id:@webmail.messagingengine.com': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:io': 0.16; 'received:messagingengine.com': 0.16; 'received:psf.io': 0.16; 'this).': 0.16; 'wrote:': 0.16; 'bytes': 0.18; 'algorithm': 0.20; '(the': 0.22; 'file.': 0.22; 'feb': 0.23; 'somewhere': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'module': 0.25; 'chris': 0.26; 'figure': 0.27; '(e.g.,': 0.27; 'tend': 0.27; 'this.': 0.28; 'clever': 0.29; 'end,': 0.29; 'hash': 0.29; 'once.': 0.29; 'maybe': 0.33; 'useful': 0.33; 'case,': 0.34; 'file': 0.34; 'list': 0.34; 'next': 0.35; 'false': 0.35; 'files,': 0.35; 'something': 0.35; 'but': 0.36; 'there': 0.36; 'to:addr :python-list': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'suggestion': 0.37; 'received:66': 0.38; 'files': 0.38; 'or,': 0.38; 'data': 0.39; 'rather': 0.39; 'to:addr:python.org': 0.40; 'your': 0.60; "you'll": 0.61; 'entire': 0.61; 'header:Message- Id:1': 0.61; 'different': 0.63; 'soon': 0.65; "they're": 0.66; 'etc,': 0.84; 'different.': 0.91
DKIM-Signature v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.com; h= content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-sasl-enc :x-sasl-enc; s=mesmtp; bh=oUXjRAwmKdZTEaJF1tUCkHw2/80=; b=V4ybdi kWQSEMvmUZOVqePuRUAevvIT367Hy0KBcfB5wvH/hs9AnTB/K7a6JV3pIBY6Y0NM /Qb3UgIWJTtXoncxcjyAsOR2JX/EKYziIbbzhDCGNebBezUohidCuN9ESIegYLsN Mv+QJmkcyIYLTJS4BwILUKkEIoRNcxqKbjnEw=
DKIM-Signature v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=oUXjRAwmKdZTEaJ F1tUCkHw2/80=; b=c000fCR7PKHBl2bmlRvOiBlSyDmwd3MAFnrukgjSkOstAXI dGwW3s76iqBV4Bgd6Qu4lpd1YQbyiaHRkJoDXntf7RJyxXHOVJ9EHzfKntkhEmfe LEeH0RFFeiwhbkSKCOdZge1QBnrIGFI9mRAL/ghNlJmJrd7X9vZwyrRWf+o4=
X-Sasl-Enc mcGhTC9/0XC5xrc06REFrKpgTgpaSj3gdw4engFi1+Z+ 1454942992
X-Mailer MessagingEngine.com Webmail Interface - ajax-57359b3d
In-Reply-To <CC00410F-D160-4C34-A933-C1810614A178@gmail.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.21rc2
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref csiph.com comp.lang.python:102672

Show key headers only | View raw


On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
> 	a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
> 	b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
> 	c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.
> 
> The main purpose of these suggestions is to reduce the amount of reading
> you're doing.

hashing a file using a conventional hashing algorithm requires reading
the whole file. Unless the files are very likely to be identical _until_
near the end, you're better off just reading the first N bytes of both
files, then the next N bytes, etc, until you find somewhere they're
different. The filecmp module may be useful for this.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
  Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
  Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
    Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
      Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
  Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
  Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
  Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
  Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
    Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
      Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
        Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300

csiph-web