Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Tim Chase Newsgroups: comp.lang.python Subject: Re: A sets algorithm Date: Sun, 7 Feb 2016 18:20:50 -0600 Lines: 35 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de Fqk1qMKYp1nm6vgw3GZyugyy4LRPbmgv6a8h+HqRPBLA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'cc:addr:python-list': 0.09; '(first': 0.09; 'output?': 0.09; 'headers': 0.15; '-tkc': 0.16; 'bytes).': 0.16; 'caching': 0.16; 'compare.': 0.16; 'equal.': 0.16; 'files)': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'hashes': 0.16; 'i.e.,': 0.16; 'optionally': 0.16; 'out)': 0.16; 'paulo': 0.16; 'received:10.122': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'useless.': 0.16; 'wrote:': 0.16; 'comparing': 0.18; 'pfxlen:0': 0.18; 'skip:l 30': 0.18; 'input': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'problem:': 0.22; 'cc:no real name:2**0': 0.22; 'seems': 0.23; 'sets': 0.23; 'implemented': 0.24; 'tim': 0.24; 'header:In-Reply-To:1': 0.24; 'points': 0.27; 'defining': 0.27; 'operations,': 0.27; 'function': 0.28; 'chase': 0.29; 'hash': 0.29; 'another': 0.32; 'class': 0.33; '(for': 0.34; 'file': 0.34; 'could': 0.35; 'but': 0.36; '(and': 0.36; 'depends': 0.36; 'faster': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'thought': 0.37; 'files': 0.38; 'does': 0.39; 'skip:e 20': 0.39; 'still': 0.40; 'some': 0.40; 'more': 0.63; 'different': 0.63; 'received:46': 0.63; 'information': 0.63; 'chrisa': 0.84 X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-Sender-Id: wwwh|x-authuser|tim@thechases.com X-MC-Relay: Neutral X-MailChannels-SenderId: wwwh|x-authuser|tim@thechases.com X-MailChannels-Auth-Id: wwwh X-MC-Loop-Signature: 1454891009221:3995543210 X-MC-Ingress-Time: 1454891009221 In-Reply-To: X-Mailer: Claws Mail 3.11.1 (GTK+ 2.24.25; x86_64-pc-linux-gnu) X-AuthUser: tim@thechases.com X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:102642 On 2016-02-08 00:05, Paulo da Silva wrote: > =C3=80s 22:17 de 07-02-2016, Tim Chase escreveu: >> all_files =3D list(generate_MyFile_objects()) >> interesting =3D [ >> (my_file1, my_file2) >> for i, my_file1 >> in enumerate(all_files, 1) >> for my_file2 >> in all_files[i:] >> if my_file1 =3D=3D my_file2 >> ] >=20 > "my_file1 =3D=3D my_file2" can be implemented into MyFile class taking > advantage of caching sizes (if different files are different), > hashes or even content (for small files) or file headers (first n > bytes). However this seems to have a problem: > all_files: a b c d e ... > If a=3D=3Db then comparing b with c,d,e is useless. Depends on what the OP wants to have happen if more than one input file is equal. I.e., a =3D=3D b =3D=3D c. Does one just want "a has duplicates" (and optionally "and here's one of them"), or does one want "a =3D=3D b", "a =3D=3D c" and "b =3D=3D c" in the output? > Another solution I thought of, could be defining some methods (I > still don't know which ones) in MyFile so that I could use sets > intersection. Would this one be a faster solution? Adding __hash__ would allow for the set operations, but would require (as ChrisA points out) knowing how to create a hash function that encompasses the information you want to compare. -tkc