Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #7541

Re: What is the most efficient way to compare similar contents in two lists?

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!feeder1.cambriumusenet.nl!feed.tweaknews.nl!194.109.133.84.MISMATCH!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python.': 0.04; 'context': 0.04; 'modified': 0.05; 'pointer': 0.05; 'dictionary': 0.07; 'subject:two': 0.07; 'deletion': 0.09; 'given,': 0.09; 'list.\xa0': 0.09; 'match.': 0.09; 'operator,': 0.09; 'solution,': 0.09; 'subject:compare': 0.09; 'this:': 0.10; 'am,': 0.14; 'received:209.85.214.174': 0.14; 'received:mail- iw0-f174.google.com': 0.14; 'wrote:': 0.14; 'defined': 0.14; 'angelico': 0.16; 'consecutive': 0.16; 'duplicates': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'guessing': 0.16; 'list;': 0.16; 'pointers': 0.16; 'reached,': 0.16; 'sorting': 0.16; 'subject:similar': 0.16; 'algorithm': 0.16; 'matching': 0.16; 'tue,': 0.17; 'simpler': 0.19; 'written,': 0.19; 'header:In-Reply-To:1': 0.21; 'wrote': 0.22; 'assume': 0.23; 'assumes': 0.23; 'found,': 0.23; 'keys': 0.23; 'lines,': 0.23; "doesn't": 0.25; 'string': 0.26; 'example': 0.27; "i'm": 0.27; 'message-id:@mail.gmail.com': 0.28; 'received:209.85.214': 0.28; 'subject:?': 0.29; 'anyway.': 0.29; 'version': 0.29; '(since': 0.30; 'sort': 0.31; 'yet': 0.32; 'headers': 0.32; 'to:addr:python- list': 0.33; 'list': 0.33; 'lines': 0.33; 'chris': 0.34; 'source': 0.34; 'file': 0.34; 'duplicate': 0.35; 'identical': 0.35; 'matter,': 0.35; 'subject:What': 0.35; 'using': 0.35; 'difference': 0.37; 'subject:lists': 0.37; 'similar': 0.37; 'received:google.com': 0.37; 'something': 0.37; 'received:209.85': 0.37; 'assuming': 0.37; 'comparing': 0.37; 'two': 0.37; 'but': 0.38; 'subject:: ': 0.38; 'received:209': 0.39; 'sets': 0.39; 'finding': 0.39; 'to:addr:python.org': 0.39; 'current': 0.40; 'simply': 0.60; 'within': 0.60; 'order': 0.62; 'beautiful': 0.77; 'mandates': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=7g6xpupsV5BUFEIhfpzSZrlVIlrvodM8LFPxLVtkhLY=; b=lIyb9IaUCp/hhGyC/zAy2OfSIJi3E1KdKVDZuBCSEDEjBRpYKEnV5BrOlLCAWlRxXV A+k6+PMrNh4mtEL0sldtUsuuWocQLvNXwIPlLS6vqqXVqknkhzoTOKrhm5mwXk03+Ydy NDnEKXCIJEm9fjwZfQCdOYJttyEc1XAVv7cH8=
DomainKey-Signature a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ooHOh1ij1w1cW08BxtmRAPJ7Z4gHscHCXOnWTcWrxO93BUlG898IYvPdmXiM7qmKS5 ozaLPzmr90aHwcODcgEAqCJuK5eO9O+UQu1ZZBimvxxgZYKwa/9KyAVfAx3qo31REs3I F/5mJMXZh8fw8fwmUUWOA+EIv6L5CQeqbLGTM=
MIME-Version 1.0
In-Reply-To <BANLkTikny0T7KMzvAd709+NZtvsFQgvnxw@mail.gmail.com>
References <3ff2f4cc-7c26-4f59-9463-d6f0874174d8@dn9g2000vbb.googlegroups.com> <BANLkTik+KBYb_yuJ7mw9=X2OBPVN=bXyjQ@mail.gmail.com> <BANLkTikny0T7KMzvAd709+NZtvsFQgvnxw@mail.gmail.com>
Date Tue, 14 Jun 2011 04:11:14 +1000
Subject Re: What is the most efficient way to compare similar contents in two lists?
From Chris Angelico <rosuav@gmail.com>
To python-list@python.org
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.188.1307988677.11593.python-list@python.org> (permalink)
Lines 41
NNTP-Posting-Host 82.94.164.166
X-Trace 1307988677 news.xs4all.nl 49180 [::ffff:82.94.164.166]:35237
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:7541

Show key headers only | View raw


On Tue, Jun 14, 2011 at 3:58 AM, Dan Stromberg <drsalists@gmail.com> wrote:
>
> This is a beautiful solution, and yet I feel compelled to mention that it
> disregards duplicates within a given list.  If you need duplicate
> detection/differencing, it's better to sort each list and then use an
> algorithm similar to the merge step of mergesort.

The original example used the 'in' operator, which is effectively a
set operation anyway. As written, it would count all duplicates in the
source headers but only count one in the target. I'm guessing that the
context mandates no duplicates (say, if they're dictionary keys or
something - let's assume float("nan") is not a key).

> Using sets as above is O(n), while the sorting version is O(nlogn) usually.
> O(n) is better than O(nlogn).
>
> And of course, the version based on sorting assumes order doesn't matter.

If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. The algorithm went
something like this:

* Start with pointers to beginnings of both lists.
* See if current string is identical in each list; if so, increment
pointers and iterate.
* Once a difference is found, try to find a re-match by incrementing
pointer #1 until the two match. If the end of the first list is
reached, emit current line of #2 as a deletion and point pointer #1 to
just after where it started.
* On finding a re-match, emit all list #1 lines from first difference
to rematch as insertions.

(Since that was for comparing a source file with a user-supplied
modified source - a sort of diff/patch - a re-match was defined by N
consecutive matching lines, but in this, a re-match can simply be two
identical strings.)

But for the situation given, I'm assuming that a simpler solution suffices.

Chris Angelico

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 07:58 -0700
  Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 01:09 +1000
    Re: What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 08:21 -0700
      Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 01:39 +1000
        Re: What is the most efficient way to compare similar contents in two lists? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-13 16:04 +0000
          Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 02:30 +1000
          Re: What is the most efficient way to compare similar contents in two lists? geremy condra <debatem1@gmail.com> - 2011-06-13 10:46 -0700
          Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 03:50 +1000
          Re: What is the most efficient way to compare similar contents in two lists? geremy condra <debatem1@gmail.com> - 2011-06-13 11:20 -0700
          Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:24 +1000
          Re: What is the most efficient way to compare similar contents in two lists? Ethan Furman <ethan@stoneleaf.us> - 2011-06-13 12:11 -0700
            Re: What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 12:21 -0700
          Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 07:33 +1000
  Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:11 +1000
    Re: What is the most efficient way to compare similar contents in two        lists? Chris Torek <nospam@torek.net> - 2011-06-13 18:40 +0000
  Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:12 +1000

csiph-web