Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102704
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Chris Angelico <rosuav@gmail.com> |
| Newsgroups | comp.lang.python |
| Subject | Re: A sets algorithm |
| Date | Tue, 9 Feb 2016 15:27:40 +1100 |
| Lines | 57 |
| Message-ID | <mailman.119.1454992069.2317.python-list@python.org> (permalink) |
| References | <n98e0f$15lj$1@gioia.aioe.org> <CC00410F-D160-4C34-A933-C1810614A178@gmail.com> <1454942992.2532814.515120466.5DA4A683@webmail.messagingengine.com> <mailman.100.1454944293.2317.python-list@python.org> <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| X-Trace | news.uni-berlin.de +6zrtK42zSuoLOsQvtFdVQQXXGxh/D7DXlTzmECA5I4g== |
| Return-Path | <rosuav@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'received:209.85.223': 0.03; 'cc:addr:python-list': 0.09; 'executes': 0.09; 'files:': 0.09; 'insertion': 0.09; 'integers': 0.09; 'lookup': 0.09; 'themselves,': 0.09; 'will,': 0.09; ':-)': 0.12; 'files.': 0.13; 'size,': 0.13; 'file,': 0.15; '2016': 0.16; 'bucket': 0.16; 'correspond.': 0.16; 'dictionary.': 0.16; 'elsewhere.': 0.16; 'equality.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hashes': 0.16; 'ignores': 0.16; 'n),': 0.16; 'naive': 0.16; 'operation,': 0.16; 'pairs': 0.16; 'rather,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'whatsoever': 0.16; 'wrote:': 0.16; 'cheap': 0.18; 'comparing': 0.18; 'differ': 0.18; 'exists': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'work,': 0.21; 'fairly': 0.22; 'meant': 0.22; 'saying': 0.22; 'assuming': 0.22; 'constant': 0.22; 'file.': 0.22; 'feb': 0.23; 'component': 0.23; 'elements': 0.23; 'this:': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; '(which': 0.26; 'chris': 0.26; 'compare': 0.27; 'message-id:@mail.gmail.com': 0.27; 'fine': 0.28; 'comparison': 0.29; 'dictionary': 0.29; 'hash': 0.29; 'itself,': 0.29; 'other,': 0.29; "i'm": 0.30; 'code': 0.30; 'certainly': 0.30; 'possibly': 0.32; 'call,': 0.33; "d'aprano": 0.33; 'stands': 0.33; 'steven': 0.33; 'tue,': 0.34; 'file': 0.34; 'that,': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'quite': 0.35; 'something': 0.35; "isn't": 0.35; 'step': 0.36; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'received:209': 0.38; 'wrong': 0.38; 'log': 0.38; 'files': 0.38; 'mean': 0.38; 'sure': 0.39; 'where': 0.40; 'still': 0.40; 'called': 0.40; 'some': 0.40; 'secure': 0.60; 'skip:u 10': 0.61; 'times': 0.63; 'other.': 0.64; 'card.': 0.67; 'overall': 0.72; 'gathering': 0.76; 'chrisa': 0.84; 'to:none': 0.91; 'mistake': 0.91; 'thing,': 0.93 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=nZbwvLjqQzGSGlI81DOrKr8CT97yddicdwGZt+wKXE1g4PLaN3/okAuDJ4mvWr+aQ6 0tM0yqggPCtcljwHuCJuKhx+Aq5CslLaQqurLz3GD3H/RbnfXBiWCYihmKGPjW/rmekb T1FfcGUpqclPmTmojp1YoiRKX875EnRm4Z/u00dFH9NREfUC7t6DD82zVztDpQozgvRZ TVGGqdbkuxkLwZGkzpskKo923iBwBn1xPnWMG0IX4Re+Ve0r2DiL7N3bK9v1vjbJSuaT gs5DFA/P1IEdM7nrLGsREJrZmB+apKlRFM3zjsBlxYNtIGHFnUCGDVfmxEioSxI8iEXW YLUQ== |
| X-Google-DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-type; bh=iTQDesSCXQK2IplTuf0ZqHyy5jpZXl1bBXSSK+bglto=; b=kTyjQpGJz3v+zMa45TJQMzLh5r5idXrIhvo3w7hWRHBOQRmgTS0JVu8/K8JeQ1QqJy S4AOAt2olyLHoFR6PHuSdnM92SG/qpVr8HXoKZ9p1a2ANKlwg5Fz5dk0vikPe4EzAlND Zi/mr3HI8EGHDWGZzfHxhLm3Ff2iUnzThwI9YDy8URk1mwAcD2N0soR4kJva85/mJHRm WMfebwBkBacD1QZN2A92QHhqSKlH6qG9FYppkvhUkaeNc/Koc4Oye8AH8dLrjBuonnwq rTlRxuz0LtSEWaIsN7tSYSTd1RxX9Ugdb61QQThXABFDr1Rki4eOQX6NoMrcw+uZbOiO RhdA== |
| X-Gm-Message-State | AG10YOQWAocAVArdZG1ApzFYnsR/WLQwGGncE70wUCe7PuoDeSknQ4PnahTZp9JHDXhobeyS4QNZ1hBfO/LLRw== |
| X-Received | by 10.107.14.73 with SMTP id 70mr30733536ioo.31.1454992060454; Mon, 08 Feb 2016 20:27:40 -0800 (PST) |
| In-Reply-To | <56b96755$0$2909$c3e8da3$76491128@news.astraweb.com> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.21rc2 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Xref | csiph.com comp.lang.python:102704 |
Show key headers only | View raw
On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>
My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:
hash_to_filename = defaultdict(list)
for fn in files:
# Step 1: Hash every file.
hash = calculate_hash(fn)
# Step 2: Locate all pairs of files with identical hashes
hash_to_filename[hash].append(fn)
This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.
The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).
I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)
ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-07 21:46 +0000
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-08 08:58 +1100
Re: A sets algorithm Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-07 22:03 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 16:17 -0600
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 00:05 +0000
Re: A sets algorithm Tim Chase <python.list@tim.thechases.com> - 2016-02-07 18:20 -0600
Re: A sets algorithm Cem Karan <cfkaran2@gmail.com> - 2016-02-07 20:07 -0500
Re: A sets algorithm Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-08 02:22 +0000
Re: A sets algorithm Random832 <random832@fastmail.com> - 2016-02-08 09:49 -0500
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 02:11 +1100
Re: A sets algorithm Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-09 15:13 +1100
Re: A sets algorithm Chris Angelico <rosuav@gmail.com> - 2016-02-09 15:27 +1100
Re: A sets algorithm Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2016-02-09 17:48 +1300
csiph-web