Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #7517 > unrolled thread
| Started by | Zachary Dziura <zcdziura@gmail.com> |
|---|---|
| First post | 2011-06-13 07:58 -0700 |
| Last post | 2011-06-14 04:12 +1000 |
| Articles | 16 — 6 participants |
Back to article view | Back to comp.lang.python
What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 07:58 -0700
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 01:09 +1000
Re: What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 08:21 -0700
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 01:39 +1000
Re: What is the most efficient way to compare similar contents in two lists? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-13 16:04 +0000
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 02:30 +1000
Re: What is the most efficient way to compare similar contents in two lists? geremy condra <debatem1@gmail.com> - 2011-06-13 10:46 -0700
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 03:50 +1000
Re: What is the most efficient way to compare similar contents in two lists? geremy condra <debatem1@gmail.com> - 2011-06-13 11:20 -0700
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:24 +1000
Re: What is the most efficient way to compare similar contents in two lists? Ethan Furman <ethan@stoneleaf.us> - 2011-06-13 12:11 -0700
Re: What is the most efficient way to compare similar contents in two lists? Zachary Dziura <zcdziura@gmail.com> - 2011-06-13 12:21 -0700
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 07:33 +1000
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:11 +1000
Re: What is the most efficient way to compare similar contents in two lists? Chris Torek <nospam@torek.net> - 2011-06-13 18:40 +0000
Re: What is the most efficient way to compare similar contents in two lists? Chris Angelico <rosuav@gmail.com> - 2011-06-14 04:12 +1000
| From | Zachary Dziura <zcdziura@gmail.com> |
|---|---|
| Date | 2011-06-13 07:58 -0700 |
| Subject | What is the most efficient way to compare similar contents in two lists? |
| Message-ID | <3ff2f4cc-7c26-4f59-9463-d6f0874174d8@dn9g2000vbb.googlegroups.com> |
similar_headers = 0
different_headers = 0
source_headers = sorted(source_mapping.headers)
target_headers = sorted(target_mapping.headers)
# Check if the headers between the two mappings are the same
if set(source_headers) == set(target_headers):
similar_headers = len(source_headers)
else:
# We're going to do two run-throughs of the tables, to find the
# different and similar header names. Start with the source
# headers...
for source_header in source_headers:
if source_header in target_headers:
similar_headers += 1
else:
different_headers += 1
# Now check target headers for any differences
for target_header in target_headers:
if target_header in source_headers:
pass
else:
different_headers += 1
As you can probably tell, I make two iterations: one for the
'source_headers' list, and another for the 'target_headers' list.
During the first iteration, if a specific header (mapped to a variable
'source_header') exists in both lists, then the 'similar_headers'
variable is incremented by one. Similarly, if it doesn't exist in both
lists, 'different_headers' is incremented by one. For the second
iteration, it only checks for different headers.
My code works as expected and there are no bugs, however I get the
feeling that I'm not doing this comparison in the most efficient way
possible. Is there another way that I can make this same comparison
while making my code more Pythonic and efficient? I would prefer not
to have to install an external module from elsewhere, though if I have
to then I will.
Thanks in advance for any and all answers!
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 01:09 +1000 |
| Message-ID | <mailman.169.1307977750.11593.python-list@python.org> |
| In reply to | #7517 |
On Tue, Jun 14, 2011 at 12:58 AM, Zachary Dziura <zcdziura@gmail.com> wrote: > if set(source_headers) == set(target_headers): > similar_headers = len(source_headers) Since you're making sets already, I'd recommend using set operations - same_headers is the (length of the) intersection of those two sets, and different_headers is the XOR. # If you need the lists afterwards, use different variable names source_headers = set(source_headers) target_headers = set(target_headers) similar_headers = len(source_headers & target_headers) different_headers = len(source_headers ^ target_headers) Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Zachary Dziura <zcdziura@gmail.com> |
|---|---|
| Date | 2011-06-13 08:21 -0700 |
| Message-ID | <bad3a56e-efba-4ef4-a159-64273cda5a08@gh5g2000vbb.googlegroups.com> |
| In reply to | #7519 |
On Jun 13, 11:09 am, Chris Angelico <ros...@gmail.com> wrote: > On Tue, Jun 14, 2011 at 12:58 AM, Zachary Dziura <zcdzi...@gmail.com> wrote: > > if set(source_headers) == set(target_headers): > > similar_headers = len(source_headers) > > Since you're making sets already, I'd recommend using set operations - > same_headers is the (length of the) intersection of those two sets, > and different_headers is the XOR. > > # If you need the lists afterwards, use different variable names > source_headers = set(source_headers) > target_headers = set(target_headers) > similar_headers = len(source_headers & target_headers) > different_headers = len(source_headers ^ target_headers) > > Chris Angelico Wow! That was a lot easier than I thought it would be! I guess I should have done a little bit more research into such operations. Thanks a bunch!!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 01:39 +1000 |
| Message-ID | <mailman.176.1307979593.11593.python-list@python.org> |
| In reply to | #7520 |
On Tue, Jun 14, 2011 at 1:21 AM, Zachary Dziura <zcdziura@gmail.com> wrote: > Wow! That was a lot easier than I thought it would be! I guess I > should have done a little bit more research into such operations. > Thanks a bunch!! :) Python: "Batteries Included". (Although Python 3 is "Most of the batteries you're used to, included".) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-06-13 16:04 +0000 |
| Message-ID | <4df634f0$0$30002$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #7524 |
On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote: > Python: "Batteries Included". > > (Although Python 3 is "Most of the batteries you're used to, included".) Oh gods, you're not going to bring up sort(cmp=...) again are you???? /me ducks and covers -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 02:30 +1000 |
| Message-ID | <mailman.182.1307982662.11593.python-list@python.org> |
| In reply to | #7529 |
On Tue, Jun 14, 2011 at 2:04 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote: > >> Python: "Batteries Included". >> >> (Although Python 3 is "Most of the batteries you're used to, included".) > > Oh gods, you're not going to bring up sort(cmp=...) again are you???? > > /me ducks and covers Ha! That *lengthy* thread started fairly soon after I joined this list. It was highly... informative. I learned a bit about Python, and a lot about python-list, from the posts there. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | geremy condra <debatem1@gmail.com> |
|---|---|
| Date | 2011-06-13 10:46 -0700 |
| Message-ID | <mailman.184.1307987170.11593.python-list@python.org> |
| In reply to | #7529 |
On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Jun 14, 2011 at 2:04 AM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Tue, 14 Jun 2011 01:39:50 +1000, Chris Angelico wrote: >> >>> Python: "Batteries Included". >>> >>> (Although Python 3 is "Most of the batteries you're used to, included".) >> >> Oh gods, you're not going to bring up sort(cmp=...) again are you???? >> >> /me ducks and covers > > Ha! That *lengthy* thread started fairly soon after I joined this > list. It was highly... informative. I learned a bit about Python, and > a lot about python-list, from the posts there. So what are we talking about here? Reduce? Geremy Condra
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 03:50 +1000 |
| Message-ID | <mailman.186.1307987418.11593.python-list@python.org> |
| In reply to | #7529 |
On Tue, Jun 14, 2011 at 3:46 AM, geremy condra <debatem1@gmail.com> wrote: > On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote: >> Ha! That *lengthy* thread started fairly soon after I joined this >> list. It was highly... informative. I learned a bit about Python, and >> a lot about python-list, from the posts there. > > So what are we talking about here? Reduce? There was a huge thread about the removal of the cmp parameter to sort() in Python 3.x. It was quite a wide-ranging discussion, touched on many things. I think it got up to 90 posts in the first day, or something. The main thing it taught me was that python-list is a high traffic list; the next most important thing was who the courteous posters are. In contrast, the thread on NaNs in Python has taught me more about the difference between floating point and 'real numbers' in a week or two than all my previous computing training put together. Yep, this is an interesting list. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | geremy condra <debatem1@gmail.com> |
|---|---|
| Date | 2011-06-13 11:20 -0700 |
| Message-ID | <mailman.190.1307989254.11593.python-list@python.org> |
| In reply to | #7529 |
On Mon, Jun 13, 2011 at 10:50 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Jun 14, 2011 at 3:46 AM, geremy condra <debatem1@gmail.com> wrote: >> On Mon, Jun 13, 2011 at 9:30 AM, Chris Angelico <rosuav@gmail.com> wrote: >>> Ha! That *lengthy* thread started fairly soon after I joined this >>> list. It was highly... informative. I learned a bit about Python, and >>> a lot about python-list, from the posts there. >> >> So what are we talking about here? Reduce? > > There was a huge thread about the removal of the cmp parameter to > sort() in Python 3.x. It was quite a wide-ranging discussion, touched > on many things. I think it got up to 90 posts in the first day, or > something. The main thing it taught me was that python-list is a high > traffic list; the next most important thing was who the courteous > posters are. I know that, but I mean what were you talking about before if you weren't talking about cmp? > In contrast, the thread on NaNs in Python has taught me more about the > difference between floating point and 'real numbers' in a week or two > than all my previous computing training put together. Yep, this is an > interesting list. I liked that thread. There should be more of those. Geremy Condra
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 04:24 +1000 |
| Message-ID | <mailman.191.1307989471.11593.python-list@python.org> |
| In reply to | #7529 |
On Tue, Jun 14, 2011 at 4:20 AM, geremy condra <debatem1@gmail.com> wrote: > I know that, but I mean what were you talking about before if you > weren't talking about cmp? Not sure what you mean. There were other threads before the cmp thread started. That thread started with, if my memory serves me correctly, something coming back from the python-dev list about restoring that parameter to Python 3 (it was one of the things removed, but is still available in Python 2). > I liked that thread [on NaN]. There should be more of those. Seconded. Someone just needs to ask the question. We had a small thread a while ago that started with the question (I'm paraphrasing) "How does Python multiply?". ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2011-06-13 12:11 -0700 |
| Message-ID | <mailman.193.1307991482.11593.python-list@python.org> |
| In reply to | #7529 |
Chris Angelico wrote: > On Tue, Jun 14, 2011 at 4:20 AM, geremy condra <debatem1@gmail.com> wrote: >> I know that, but I mean what were you talking about before if you >> weren't talking about cmp? > > Not sure what you mean. I suspect Geremy is referring to your "most of the batteries you're used to, included" comment -- which batteries are missing? ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Zachary Dziura <zcdziura@gmail.com> |
|---|---|
| Date | 2011-06-13 12:21 -0700 |
| Message-ID | <6b8519ff-67f9-4c7a-bfe2-c5e71acbee5f@c26g2000vbq.googlegroups.com> |
| In reply to | #7549 |
For this script, it's guaranteed that whatever tables the script goes through and processes, there will be no duplicate headers. I didn't include functionality to deal with duplicates because there won't be any to begin with! I just wanted to find out the most efficient way of checking for similar headers, which Chris A. told me! Thank you again, Chris! You saved me a lot of headache!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 07:33 +1000 |
| Message-ID | <mailman.197.1308000815.11593.python-list@python.org> |
| In reply to | #7529 |
On Tue, Jun 14, 2011 at 5:11 AM, Ethan Furman <ethan@stoneleaf.us> wrote: > I suspect Geremy is referring to your "most of the batteries you're used to, > included" comment -- which batteries are missing? Oh! There's a handful of modules that aren't yet available in 3.x, which might surprise someone who's moving from 2.x. But hey, I didn't know about difflib (although I probably would have found it if I'd gone searching). Python has more than expected, not less! And Zachary: You're most welcome. Glad to have been of service! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 04:11 +1000 |
| Message-ID | <mailman.188.1307988677.11593.python-list@python.org> |
| In reply to | #7517 |
On Tue, Jun 14, 2011 at 3:58 AM, Dan Stromberg <drsalists@gmail.com> wrote:
>
> This is a beautiful solution, and yet I feel compelled to mention that it
> disregards duplicates within a given list. If you need duplicate
> detection/differencing, it's better to sort each list and then use an
> algorithm similar to the merge step of mergesort.
The original example used the 'in' operator, which is effectively a
set operation anyway. As written, it would count all duplicates in the
source headers but only count one in the target. I'm guessing that the
context mandates no duplicates (say, if they're dictionary keys or
something - let's assume float("nan") is not a key).
> Using sets as above is O(n), while the sorting version is O(nlogn) usually.
> O(n) is better than O(nlogn).
>
> And of course, the version based on sorting assumes order doesn't matter.
If order and duplicates matter, then you want a completely different
diff. I wrote one a while back, but not in Python. The algorithm went
something like this:
* Start with pointers to beginnings of both lists.
* See if current string is identical in each list; if so, increment
pointers and iterate.
* Once a difference is found, try to find a re-match by incrementing
pointer #1 until the two match. If the end of the first list is
reached, emit current line of #2 as a deletion and point pointer #1 to
just after where it started.
* On finding a re-match, emit all list #1 lines from first difference
to rematch as insertions.
(Since that was for comparing a source file with a user-supplied
modified source - a sort of diff/patch - a re-match was defined by N
consecutive matching lines, but in this, a re-match can simply be two
identical strings.)
But for the situation given, I'm assuming that a simpler solution suffices.
Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Chris Torek <nospam@torek.net> |
|---|---|
| Date | 2011-06-13 18:40 +0000 |
| Subject | Re: What is the most efficient way to compare similar contents in two lists? |
| Message-ID | <it5lj1030fo@news4.newsguy.com> |
| In reply to | #7541 |
In article <mailman.188.1307988677.11593.python-list@python.org> Chris Angelico <rosuav@gmail.com> wrote: >If order and duplicates matter, then you want a completely different >diff. I wrote one a while back, but not in Python. ... If order and duplicates matter, one might want to look into difflib. :-) -- In-Real-Life: Chris Torek, Wind River Systems Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603 email: gmail (figure it out) http://web.torek.net/torek/index.html
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-14 04:12 +1000 |
| Message-ID | <mailman.189.1307988768.11593.python-list@python.org> |
| In reply to | #7517 |
On Tue, Jun 14, 2011 at 4:11 AM, Chris Angelico <rosuav@gmail.com> wrote: > The algorithm went > something like this: > > * Start with pointers to beginnings of both lists. PS. It wasn't C with pointers and the like; iirc it actually used array indices as the "pointers". Immaterial to the algorithm though. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web