Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102676
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Chris Angelico <rosuav@gmail.com> |
| Newsgroups | comp.lang.python |
| Subject | Re: A sets algorithm |
| Date | Tue, 9 Feb 2016 02:11:30 +1100 |
| Lines | 38 |
| Message-ID | <mailman.100.1454944293.2317.python-list@python.org> (permalink) |
| References | <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | quoted-printable |
| X-Trace | news.uni-berlin.de lX8TD6yGCpJTQJopHHWozQBoJaCjL1P4FgyZBgL0Y6Mg== |
| Return-Path | <rosuav@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'received:209.85.223': 0.03; 'pretend': 0.07; 'cc:addr:python-list': 0.09; '(use': 0.09; 'bytes,': 0.09; 'lookup': 0.09; 'portions': 0.09; 'themselves,': 0.09; 'files.': 0.13; 'sections': 0.13; '2016': 0.16; 'equality.': 0.16; 'file).': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hashes': 0.16; 'naive': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'this).': 0.16; 'wrote:': 0.16; 'bytes': 0.18; 'comparing': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'algorithm': 0.20; '(the': 0.22; 'saying': 0.22; 'constant': 0.22; 'file.': 0.22; 'am,': 0.23; 'feb': 0.23; 'somewhere': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'module': 0.25; 'chris': 0.26; 'figure': 0.27; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; '(e.g.,': 0.27; 'tend': 0.27; 'this.': 0.28; 'fine': 0.28; 'clever': 0.29; 'comparison': 0.29; 'end,': 0.29; 'hash': 0.29; 'once.': 0.29; 'other,': 0.29; 'maybe': 0.33; 'useful': 0.33; 'though.': 0.33; 'case,': 0.34; 'tue,': 0.34; 'file': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'next': 0.35; 'false': 0.35; 'files,': 0.35; 'something': 0.35; 'but': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'subject:: ': 0.37; 'suggestion': 0.37; 'received:209': 0.38; 'log': 0.38; 'files': 0.38; 'or,': 0.38; 'data': 0.39; 'rather': 0.39; 'your': 0.60; "you'll": 0.61; 'entire': 0.61; 'different': 0.63; 'other.': 0.64; 'soon': 0.65; "they're": 0.66; 'here': 0.66; 'chrisa': 0.84; 'etc,': 0.84; 'to:none': 0.91; 'different.': 0.91 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=gKzQIIcpEj76KE9Mz2CUKt42IWlXwkWuj+dxr6OIJmc=; b=swdU5Qvb1HMCXt1qIOu+0MBKEQIIte/TGDMEBm1gUIq3+XR91njLzbfaE9z8wBAjZV 8vacN4W6JJfsUrqXDM6ne8MStxqWekDv+3wDy0Y/idctNsK+5wfAg8X3D8zAW6dqLOEN ZxYiVutZL0RPLgHG7KeqxqUso5NmDZ4EoWFcuSXpFjuJV1xcgkfs2pAquts32agRmjl0 JUTb5xYFJxIh4cMBMqi4XyEFtHRcLyx33/kN27F27epdUo0raJ9/e9EIWZuOyh9Z6LBZ ZSYA+DXf5Jo5dCwii68F0I14arxvh1LUzSEEcM4aFLauQPhcoHEft2aM4wDUkSaGr8uv OhLA== |
| X-Google-DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type:content-transfer-encoding; bh=gKzQIIcpEj76KE9Mz2CUKt42IWlXwkWuj+dxr6OIJmc=; b=G+xIQPnFYZlix1Oskr0EZ0R5zb5xs7Ir5djxX08Xn9t/XNVFqLooAc9Z0O0P9lfy1T SBG58e4UF7vg6p6KpjISZDKOAbE3P8WryxWAh/7hzskRCT3RA3f8HPEC8x7rvvEr4ZDd BoaVG1Gv35mYym07aQuDt8Bc7FlkJ4y7aeUrHNCDueodRWAaMBPgf/ZZlyQCGhNnFwtX faaaGmJIfy/nUJTBvYOhfEnhNqoljOf+OirfA3TXHe/87JWmEkbIf78GFPukQVnrhNwq MNQCWHK/UdW4jbNk+WkPKKSOVdTjZWrLmBAc8NwjYJ2i0WViqULVRyL8geQK+Qik/cR1 7CKg== |
| X-Gm-Message-State | AG10YOSyN6mzKwkvY0RmhOAG5ZBBXH0ws5G1bhLwYMOxSkGshQIEc+4ExAs1Q7PGxoZLnEqmJ4Bv3NaqKHrcjg== |
| X-Received | by 10.107.14.73 with SMTP id 70mr27722322ioo.31.1454944290543; Mon, 08 Feb 2016 07:11:30 -0800 (PST) |
| In-Reply-To | <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.21rc2 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Xref | csiph.com comp.lang.python:102676 |
Show key headers only | View raw
On Tue, Feb 9, 2016 at 1:49 AM, Random832 <random832@fastmail.com> wrote: > On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote: >> a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this). Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical. >> b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash. You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file). >> c) You may be able to do something clever by reading portions of each file. That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files. Or, maybe you'll be able to read portions of them and sort the list as you're reading. In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once. >> >> The main purpose of these suggestions is to reduce the amount of reading >> you're doing. > > hashing a file using a conventional hashing algorithm requires reading > the whole file. Unless the files are very likely to be identical _until_ > near the end, you're better off just reading the first N bytes of both > files, then the next N bytes, etc, until you find somewhere they're > different. The filecmp module may be useful for this. That's fine for comparing one file against one other. He started out by saying he already had a way to compare files for equality. What he wants is a way to capitalize on that to find all the identical files in a group. A naive approach would simply compare every file against every other, for O(N*N) comparisons - but a hash lookup can make that O(N) on the files themselves, plus (I think) an O(N log N) hash comparison job, which has much lower constant factors. The key here is the hashing algorithm though. ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300
csiph-web