Groups > comp.lang.python > #7517 > unrolled thread

What is the most efficient way to compare similar contents in two lists?

Started by	Zachary Dziura <zcdziura@gmail.com>
First post	2011-06-13 07:58 -0700
Last post	2011-06-14 04:12 +1000
Articles	16 — 6 participants

Back to article view | Back to comp.lang.python

#7517 — What is the most efficient way to compare similar contents in two lists?

From	Zachary Dziura <zcdziura@gmail.com>
Date	2011-06-13 07:58 -0700
Subject	What is the most efficient way to compare similar contents in two lists?
Message-ID	<3ff2f4cc-7c26-4f59-9463-d6f0874174d8@dn9g2000vbb.googlegroups.com>

similar_headers = 0
different_headers = 0
source_headers = sorted(source_mapping.headers)
target_headers = sorted(target_mapping.headers)

# Check if the headers between the two mappings are the same
if set(source_headers) == set(target_headers):
    similar_headers = len(source_headers)
else:
    # We're going to do two run-throughs of the tables, to find the
    # different and similar header names. Start with the source
    # headers...
    for source_header in source_headers:
        if source_header in target_headers:
            similar_headers += 1
        else:
            different_headers += 1
    # Now check target headers for any differences
    for target_header in target_headers:
        if target_header in source_headers:
            pass
        else:
            different_headers += 1

As you can probably tell, I make two iterations: one for the
'source_headers' list, and another for the 'target_headers' list.
During the first iteration, if a specific header (mapped to a variable
'source_header') exists in both lists, then the 'similar_headers'
variable is incremented by one. Similarly, if it doesn't exist in both
lists, 'different_headers' is incremented by one. For the second
iteration, it only checks for different headers.

My code works as expected and there are no bugs, however I get the
feeling that I'm not doing this comparison in the most efficient way
possible. Is there another way that I can make this same comparison
while making my code more Pythonic and efficient? I would prefer not
to have to install an external module from elsewhere, though if I have
to then I will.

Thanks in advance for any and all answers!

[toc] | [next] | [standalone]

#7519

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 01:09 +1000
Message-ID	<mailman.169.1307977750.11593.python-list@python.org>
In reply to	#7517

On Tue, Jun 14, 2011 at 12:58 AM, Zachary Dziura <zcdziura@gmail.com> wrote:
> if set(source_headers) == set(target_headers):
>    similar_headers = len(source_headers)

Since you're making sets already, I'd recommend using set operations -
same_headers is the (length of the) intersection of those two sets,
and different_headers is the XOR.

# If you need the lists afterwards, use different variable names
source_headers = set(source_headers)
target_headers = set(target_headers)
similar_headers = len(source_headers & target_headers)
different_headers = len(source_headers ^ target_headers)

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7520

From	Zachary Dziura <zcdziura@gmail.com>
Date	2011-06-13 08:21 -0700
Message-ID	<bad3a56e-efba-4ef4-a159-64273cda5a08@gh5g2000vbb.googlegroups.com>
In reply to	#7519

On Jun 13, 11:09 am, Chris Angelico <ros...@gmail.com> wrote:
> On Tue, Jun 14, 2011 at 12:58 AM, Zachary Dziura <zcdzi...@gmail.com> wrote:
> > if set(source_headers) == set(target_headers):
> >    similar_headers = len(source_headers)
>
> Since you're making sets already, I'd recommend using set operations -
> same_headers is the (length of the) intersection of those two sets,
> and different_headers is the XOR.
>
> # If you need the lists afterwards, use different variable names
> source_headers = set(source_headers)
> target_headers = set(target_headers)
> similar_headers = len(source_headers & target_headers)
> different_headers = len(source_headers ^ target_headers)
>
> Chris Angelico

Wow! That was a lot easier than I thought it would be! I guess I
should have done a little bit more research into such operations.
Thanks a bunch!!

[toc] | [prev] | [next] | [standalone]

#7524

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 01:39 +1000
Message-ID	<mailman.176.1307979593.11593.python-list@python.org>
In reply to	#7520

On Tue, Jun 14, 2011 at 1:21 AM, Zachary Dziura <zcdziura@gmail.com> wrote:
> Wow! That was a lot easier than I thought it would be! I guess I
> should have done a little bit more research into such operations.
> Thanks a bunch!!

:)

Python: "Batteries Included".

(Although Python 3 is "Most of the batteries you're used to, included".)

ChrisA

[toc] | [prev] | [next] | [standalone]

#7529

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-06-13 16:04 +0000
Message-ID	<4df634f0$0$30002$c3e8da3$5496439d@news.astraweb.com>
In reply to	#7524

On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote:

> Python: "Batteries Included".
> 
> (Although Python 3 is "Most of the batteries you're used to, included".)

Oh gods, you're not going to bring up sort(cmp=...) again are you????


/me ducks and covers



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#7534

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 02:30 +1000
Message-ID	<mailman.182.1307982662.11593.python-list@python.org>
In reply to	#7529

On Tue, Jun 14, 2011 at 2:04 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote:
>
>> Python: "Batteries Included".
>>
>> (Although Python 3 is "Most of the batteries you're used to, included".)
>
> Oh gods, you're not going to bring up sort(cmp=...) again are you????
>
> /me ducks and covers

Ha! That *lengthy* thread started fairly soon after I joined this
list. It was highly... informative. I learned a bit about Python, and
a lot about python-list, from the posts there.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7537

From	geremy condra <debatem1@gmail.com>
Date	2011-06-13 10:46 -0700
Message-ID	<mailman.184.1307987170.11593.python-list@python.org>
In reply to	#7529

On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Tue, Jun 14, 2011 at 2:04 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote:
>>
>>> Python: "Batteries Included".
>>>
>>> (Although Python 3 is "Most of the batteries you're used to, included".)
>>
>> Oh gods, you're not going to bring up sort(cmp=...) again are you????
>>
>> /me ducks and covers
>
> Ha! That *lengthy* thread started fairly soon after I joined this
> list. It was highly... informative. I learned a bit about Python, and
> a lot about python-list, from the posts there.

So what are we talking about here? Reduce?

Geremy Condra

[toc] | [prev] | [next] | [standalone]

#7539

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 03:50 +1000
Message-ID	<mailman.186.1307987418.11593.python-list@python.org>
In reply to	#7529

On Tue, Jun 14, 2011 at 3:46 AM, geremy condra <debatem1@gmail.com> wrote:
> On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote:
>> Ha! That *lengthy* thread started fairly soon after I joined this
>> list. It was highly... informative. I learned a bit about Python, and
>> a lot about python-list, from the posts there.
>
> So what are we talking about here? Reduce?

There was a huge thread about the removal of the cmp parameter to
sort() in Python 3.x. It was quite a wide-ranging discussion, touched
on many things. I think it got up to 90 posts in the first day, or
something. The main thing it taught me was that python-list is a high
traffic list; the next most important thing was who the courteous
posters are.

In contrast, the thread on NaNs in Python has taught me more about the
difference between floating point and 'real numbers' in a week or two
than all my previous computing training put together. Yep, this is an
interesting list.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7544

From	geremy condra <debatem1@gmail.com>
Date	2011-06-13 11:20 -0700
Message-ID	<mailman.190.1307989254.11593.python-list@python.org>
In reply to	#7529

On Mon, Jun 13, 2011 at 10:50 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Tue, Jun 14, 2011 at 3:46 AM, geremy condra <debatem1@gmail.com> wrote:
>> On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote:
>>> Ha! That *lengthy* thread started fairly soon after I joined this
>>> list. It was highly... informative. I learned a bit about Python, and
>>> a lot about python-list, from the posts there.
>>
>> So what are we talking about here? Reduce?
>
> There was a huge thread about the removal of the cmp parameter to
> sort() in Python 3.x. It was quite a wide-ranging discussion, touched
> on many things. I think it got up to 90 posts in the first day, or
> something. The main thing it taught me was that python-list is a high
> traffic list; the next most important thing was who the courteous
> posters are.

I know that, but I mean what were you talking about before if you
weren't talking about cmp?

> In contrast, the thread on NaNs in Python has taught me more about the
> difference between floating point and 'real numbers' in a week or two
> than all my previous computing training put together. Yep, this is an
> interesting list.

I liked that thread. There should be more of those.

Geremy Condra

[toc] | [prev] | [next] | [standalone]

#7545

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 04:24 +1000
Message-ID	<mailman.191.1307989471.11593.python-list@python.org>
In reply to	#7529

On Tue, Jun 14, 2011 at 4:20 AM, geremy condra <debatem1@gmail.com> wrote:
> I know that, but I mean what were you talking about before if you
> weren't talking about cmp?

Not sure what you mean. There were other threads before the cmp thread
started. That thread started with, if my memory serves me correctly,
something coming back from the python-dev list about restoring that
parameter to Python 3 (it was one of the things removed, but is still
available in Python 2).

> I liked that thread [on NaN]. There should be more of those.

Seconded. Someone just needs to ask the question. We had a small
thread a while ago that started with the question (I'm paraphrasing)
"How does Python multiply?".

ChrisA

[toc] | [prev] | [next] | [standalone]

#7549

From	Ethan Furman <ethan@stoneleaf.us>
Date	2011-06-13 12:11 -0700
Message-ID	<mailman.193.1307991482.11593.python-list@python.org>
In reply to	#7529

Chris Angelico wrote:
> On Tue, Jun 14, 2011 at 4:20 AM, geremy condra <debatem1@gmail.com> wrote:
>> I know that, but I mean what were you talking about before if you
>> weren't talking about cmp?
> 
> Not sure what you mean. 

I suspect Geremy is referring to your "most of the batteries you're used 
to, included" comment -- which batteries are missing?

~Ethan~

[toc] | [prev] | [next] | [standalone]

#7551

From	Zachary Dziura <zcdziura@gmail.com>
Date	2011-06-13 12:21 -0700
Message-ID	<6b8519ff-67f9-4c7a-bfe2-c5e71acbee5f@c26g2000vbq.googlegroups.com>
In reply to	#7549

For this script, it's guaranteed that whatever tables the script goes
through and processes, there will be no duplicate headers. I didn't
include functionality to deal with duplicates because there won't be
any to begin with! I just wanted to find out the most efficient way of
checking for similar headers, which Chris A. told me! Thank you again,
Chris! You saved me a lot of headache!

[toc] | [prev] | [next] | [standalone]

#7561

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 07:33 +1000
Message-ID	<mailman.197.1308000815.11593.python-list@python.org>
In reply to	#7529

On Tue, Jun 14, 2011 at 5:11 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
> I suspect Geremy is referring to your "most of the batteries you're used to,
> included" comment -- which batteries are missing?

Oh! There's a handful of modules that aren't yet available in 3.x,
which might surprise someone who's moving from 2.x. But hey, I didn't
know about difflib (although I probably would have found it if I'd
gone searching). Python has more than expected, not less!

And Zachary: You're most welcome. Glad to have been of service!

ChrisA

[toc] | [prev] | [next] | [standalone]

#7541

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 04:11 +1000
Message-ID	<mailman.188.1307988677.11593.python-list@python.org>
In reply to	#7517

On Tue, Jun 14, 2011 at 3:58 AM, Dan Stromberg <drsalists@gmail.com> wrote:
>
> This is a beautiful solution, and yet I feel compelled to mention that it
> disregards duplicates within a given list.  If you need duplicate
> detection/differencing, it's better to sort each list and then use an
> algorithm similar to the merge step of mergesort.

The original example used the 'in' operator, which is effectively a
set operation anyway. As written, it would count all duplicates in the
source headers but only count one in the target. I'm guessing that the
context mandates no duplicates (say, if they're dictionary keys or
something - let's assume float("nan") is not a key).

> Using sets as above is O(n), while the sorting version is O(nlogn) usually.
> O(n) is better than O(nlogn).
>
> And of course, the version based on sorting assumes order doesn't matter.

If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. The algorithm went
something like this:

* Start with pointers to beginnings of both lists.
* See if current string is identical in each list; if so, increment
pointers and iterate.
* Once a difference is found, try to find a re-match by incrementing
pointer #1 until the two match. If the end of the first list is
reached, emit current line of #2 as a deletion and point pointer #1 to
just after where it started.
* On finding a re-match, emit all list #1 lines from first difference
to rematch as insertions.

(Since that was for comparing a source file with a user-supplied
modified source - a sort of diff/patch - a re-match was defined by N
consecutive matching lines, but in this, a re-match can simply be two
identical strings.)

But for the situation given, I'm assuming that a simpler solution suffices.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7546 — Re: What is the most efficient way to compare similar contents in two lists?

From	Chris Torek <nospam@torek.net>
Date	2011-06-13 18:40 +0000
Subject	Re: What is the most efficient way to compare similar contents in two lists?
Message-ID	<it5lj1030fo@news4.newsguy.com>
In reply to	#7541

In article <mailman.188.1307988677.11593.python-list@python.org>
Chris Angelico  <rosuav@gmail.com> wrote:
>If order and duplicates matter, then you want a completely different
>diff. I wrote one a while back, but not in Python. ...

If order and duplicates matter, one might want to look into
difflib. :-)
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]

#7542

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-14 04:12 +1000
Message-ID	<mailman.189.1307988768.11593.python-list@python.org>
In reply to	#7517

On Tue, Jun 14, 2011 at 4:11 AM, Chris Angelico <rosuav@gmail.com> wrote:
> The algorithm went
> something like this:
>
> * Start with pointers to beginnings of both lists.

PS. It wasn't C with pointers and the like; iirc it actually used
array indices as the "pointers". Immaterial to the algorithm though.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

What is the most efficient way to compare similar contents in two lists?

Contents

#7517 — What is the most efficient way to compare similar contents in two lists?

#7519

#7520

#7524

#7529

#7534

#7537

#7539

#7544

#7545

#7549

#7551

#7561

#7541

#7546 — Re: What is the most efficient way to compare similar contents in two lists?

#7542