Groups > comp.lang.python > #67985 > unrolled thread

Re: Tuples and immutability

Started by	Duncan Booth <duncan.booth@invalid.invalid>
First post	2014-03-07 09:33 +0000
Last post	2014-03-08 19:55 -0700
Articles	20 on this page of 108 — 24 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Tuples and immutability Duncan Booth <duncan.booth@invalid.invalid> - 2014-03-07 09:33 +0000
    Re: Tuples and immutability Ben Finney <ben+python@benfinney.id.au> - 2014-03-07 22:04 +1100
    Re: Tuples and immutability Chris Angelico <rosuav@gmail.com> - 2014-03-07 22:11 +1100
    Re: Tuples and immutability Peter Otten <__peter__@web.de> - 2014-03-07 12:38 +0100
    Re: Tuples and immutability Chris Angelico <rosuav@gmail.com> - 2014-03-07 22:45 +1100
    Re: Tuples and immutability Alister <alister.ware@ntlworld.com> - 2014-03-07 11:51 +0000
      Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-07 11:23 -0700
    Re: Tuples and immutability Roy Smith <roy@panix.com> - 2014-03-07 08:37 -0500
    Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-08 15:17 +1300
      Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-07 21:17 -0700
        Balanced trees (was: Re: Tuples and immutability) Marko Rauhamaa <marko@pacujo.net> - 2014-03-08 10:34 +0200
          Re: Balanced trees (was: Re: Tuples and immutability) Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-08 04:03 -0700
            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-08 13:26 +0200
          Re: Balanced trees (was: Re: Tuples and immutability) Dan Stromberg <drsalists@gmail.com> - 2014-03-08 11:58 -0800
          Re: Balanced trees Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-08 20:37 +0000
            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-08 23:21 +0200
              Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-08 17:22 -0500
                Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-09 11:17 +0200
              Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-08 17:31 -0800
                Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-09 11:27 +0200
                  Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-09 14:20 -0700
                    Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-09 23:32 +0200
                      Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-09 14:37 -0700
                        Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-09 23:43 +0200
                          Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-09 15:08 -0700
                            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-10 00:24 +0200
                          Re: Balanced trees Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-09 18:04 -0600
                            Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-10 03:24 +0000
                      Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-10 03:24 +0000
                        Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-10 08:16 +0200
                          Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-10 08:53 +0000
                            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-10 11:41 +0200
                              Re: Balanced trees Ned Batchelder <ned@nedbatchelder.com> - 2014-03-10 06:57 -0400
                                Re: Balanced trees Rustom Mody <rustompmody@gmail.com> - 2014-03-10 09:01 -0700
                                  Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-11 02:02 +0000
                                    Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-10 22:20 -0400
                                      Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 13:29 +1100
                            Re: Balanced trees Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-10 09:51 +0000
                          Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-10 09:59 -0400
                            Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 03:20 +1100
                            Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 03:24 +1100
                              Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-10 19:08 +0200
                                Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 04:17 +1100
                                  Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-10 19:34 +0200
                                    Re: Balanced trees Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-13 12:40 -0600
                                    Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-13 23:57 +0000
                                      Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-13 20:12 -0700
                                        Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-14 05:02 +0000
                                Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-10 19:24 -0400
                                  Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 10:27 +1100
                                  Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-11 01:26 +0000
                                    Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 12:45 +1100
                                    Re: Balanced trees Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-10 21:38 -0600
                                    Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 15:28 +1100
                                      Re: Balanced trees "Rhodri James" <rhodri@wildebst.org.uk> - 2014-03-12 00:57 +0000
                                    Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-11 12:12 +0200
                                      Re: Balanced trees alex23 <wuwei23@gmail.com> - 2014-03-12 10:13 +1000
                                  Re: Balanced trees Alister <alister.ware@ntlworld.com> - 2014-03-11 09:25 +0000
                                  Re: Balanced trees Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2014-03-12 10:08 +0100
                            Re: Balanced trees Tim Chase <python.list@tim.thechases.com> - 2014-03-10 11:33 -0500
                            Re: Balanced trees Chris Angelico <rosuav@gmail.com> - 2014-03-11 03:39 +1100
                            Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-10 18:05 -0700
                              Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-10 22:13 -0400
                        Re: Balanced trees Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-03-10 19:57 -0400
          Re: Balanced trees Joshua Landau <joshua@landau.ws> - 2014-03-15 01:13 +0000
            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-18 00:05 +0200
              Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-18 12:26 -0700
                Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-18 22:55 +0200
                  Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-18 14:45 -0700
                    Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-19 00:03 +0200
                      Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-18 15:21 -0700
                        Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-19 01:11 +0200
                          Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-19 01:15 +0000
                            Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-19 10:49 +0200
                              Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-19 13:42 +0000
                                Re: Balanced trees Marko Rauhamaa <marko@pacujo.net> - 2014-03-19 15:54 +0200
                                Re: Balanced trees Roy Smith <roy@panix.com> - 2014-03-19 10:06 -0400
                        Re: Balanced trees Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-19 01:15 +0000
                          Re: Balanced trees Ethan Furman <ethan@stoneleaf.us> - 2014-03-19 08:15 -0700
                    Re: Balanced trees "Rhodri James" <rhodri@wildebst.org.uk> - 2014-03-20 02:16 +0000
                      Re: Balanced trees Dan Stromberg <drsalists@gmail.com> - 2014-03-19 19:34 -0700
                  Re: Balanced trees Chris Kaynor <ckaynor@zindagigames.com> - 2014-03-18 18:02 -0700
              Re: Balanced trees Daniel Stutzbach <stutzbach@google.com> - 2014-03-18 13:18 -0700
          blist in standard library  (was Re: Balanced trees) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-15 12:31 +0000
          Re: Balanced trees Daniel Stutzbach <stutzbach@google.com> - 2014-03-17 14:16 -0700
          Re: Balanced trees Joshua Landau <joshua@landau.ws> - 2014-03-18 00:08 +0000
          Re: Balanced trees Daniel Stutzbach <stutzbach@google.com> - 2014-03-17 18:01 -0700
          Re: Balanced trees Joshua Landau <joshua@landau.ws> - 2014-03-18 07:46 +0000
        Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-09 13:40 +1300
          Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-08 19:39 -0700
            Re: Tuples and immutability Marko Rauhamaa <marko@pacujo.net> - 2014-03-09 11:35 +0200
            Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-10 11:03 +1300
              Re: Tuples and immutability Terry Reedy <tjreedy@udel.edu> - 2014-03-09 19:00 -0400
              Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-09 17:42 -0600
                Re: Tuples and immutability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-10 02:37 +0000
                  Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-10 02:35 -0600
                    Re: Tuples and immutability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-10 09:13 +0000
                    Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-11 18:15 +1300
                Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-11 18:03 +1300
                  Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-11 04:39 -0600
                    Re: Tuples and immutability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-11 16:46 +0000
                      Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-12 10:23 +1300
                      Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-11 17:06 -0600
                        Re: Tuples and immutability Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-03-12 23:20 +0000
                          Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-12 19:35 -0600
                          Re: Tuples and immutability Terry Reedy <tjreedy@udel.edu> - 2014-03-12 22:09 -0400
        Re: Tuples and immutability Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-03-09 13:45 +1300
          Re: Tuples and immutability Ian Kelly <ian.g.kelly@gmail.com> - 2014-03-08 19:55 -0700

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

#68161 — Re: Balanced trees

From	Chris Angelico <rosuav@gmail.com>
Date	2014-03-11 03:39 +1100
Subject	Re: Balanced trees
Message-ID	<mailman.8011.1394469568.18130.python-list@python.org>
In reply to	#68143

On Tue, Mar 11, 2014 at 3:33 AM, Tim Chase
<python.list@tim.thechases.com> wrote:
> On 2014-03-11 03:24, Chris Angelico wrote:
>> Imagine, worst case, all one million records have the same
>> song/user/add_time and you need to do twenty comparisons involving
>> four fields. That's gotta be worse than one hashing of five fields.
>
> And if you have one million songs that are indistinguishable, you
> have Pop Top-40 playlists.  And that is *definitely* worse than
> hashes OR trees.  :-)

LOL!

That's not a Comp Sci problem, I'm afraid. But there is a solution:

https://www.youtube.com/watch?v=jCZzvZEmq_A
https://www.youtube.com/watch?v=DEr_5C6JYB4

A different sort of music.

ChrisA

[toc] | [prev] | [next] | [standalone]

#68186 — Re: Balanced trees

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-03-10 18:05 -0700
Subject	Re: Balanced trees
Message-ID	<mailman.8031.1394499924.18130.python-list@python.org>
In reply to	#68143

On Mon, Mar 10, 2014 at 6:59 AM, Roy Smith <roy@panix.com> wrote:
> On the other hand, log n, for n = 1 million, is just 20.  It's not hard
> to imagine a hash function which costs 20x what a node traversal does,
> in which case, the log n lookup is ahead for all n < 1 million.

FWIW, both the hash table and the tree will have constants.  So a tree
would be c*20 in its most significant term, and the hash table would
be d*1 in its.  The real-world performance depends quite a bit on
those constants at small values of n.  I don't really consider a
million all that big, but the meaning of "big" of course depends.

[toc] | [prev] | [next] | [standalone]

#68191 — Re: Balanced trees

From	Roy Smith <roy@panix.com>
Date	2014-03-10 22:13 -0400
Subject	Re: Balanced trees
Message-ID	<roy-19EAAB.22135410032014@news.panix.com>
In reply to	#68186

In article <mailman.8031.1394499924.18130.python-list@python.org>,
 Dan Stromberg <drsalists@gmail.com> wrote:

> On Mon, Mar 10, 2014 at 6:59 AM, Roy Smith <roy@panix.com> wrote:
> > On the other hand, log n, for n = 1 million, is just 20.  It's not hard
> > to imagine a hash function which costs 20x what a node traversal does,
> > in which case, the log n lookup is ahead for all n < 1 million.
> 
> FWIW, both the hash table and the tree will have constants.  So a tree
> would be c*20 in its most significant term, and the hash table would
> be d*1 in its.  The real-world performance depends quite a bit on
> those constants at small values of n.  I don't really consider a
> million all that big, but the meaning of "big" of course depends.

Well, the largest disk volume I can configure in AWS is 1 TB.  So, I 
guess we can take that to be "big".

Assuming 1-character strings, and no overhead (both strange assumptions, 
but it makes the math easier), that's 10^12 nodes in our binary tree.  
That's still only 40 layers deep.  Log n is your friend.

[toc] | [prev] | [next] | [standalone]

#68183 — Re: Balanced trees

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2014-03-10 19:57 -0400
Subject	Re: Balanced trees
Message-ID	<mailman.8028.1394495825.18130.python-list@python.org>
In reply to	#68117

On 10 Mar 2014 03:24:08 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> declaimed the following:

>
>Proof: I create a hash table that accepts unsigned bytes as keys, where 
>the hash is the value of the byte. So byte 0x42 hashes to 0x42, or 
>decimal 66. I give the hash table 256 slots, numbered from 0 to 255. 
>Every hash takes constant time, there are no collisions at all, lookups, 
>insertions and deletions all take constant time regardless of how full 
>the table is. The best, worst and average cases are not just O(1) but 
>also ?(1).
>
	What you've defined, however, can be replaced with straight list
indexing <G> No hash function needed.

	If there is a finite (and small) number of possible keys, the list will
likely beat the hash.

	OTOH, when the keys exceed reasonable list storage, the hash becomes
useful -- but one now has to be concerned with collisions.

	Back in the day, I was somewhat pleased to discover that the Amiga
directory structure was a real world example of something I'd only seen as
a class assignment in my data structures class: a hashed-head
multiple-linked list.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#68364 — Re: Balanced trees

From	Joshua Landau <joshua@landau.ws>
Date	2014-03-15 01:13 +0000
Subject	Re: Balanced trees
Message-ID	<mailman.8154.1394846042.18130.python-list@python.org>
In reply to	#68026

On 8 March 2014 20:37, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> I've found this link useful http://kmike.ru/python-data-structures/
>
> I also don't want all sorts of data structures added to the Python library.
> I believe that there are advantages to leaving specialist data structures on
> pypi or other sites, plus it means Python in a Nutshell can still fit in
> your pocket and not a 40 ton articulated lorry, unlike the Java equivalent.

The thing we really need is for the blist containers to become stdlib
(but not to replace the current list implementation). The rejected PEP
(http://legacy.python.org/dev/peps/pep-3128/) misses a few important
points, largely in how the "log(n)" has a really large base:
random.choice went from 1.2µs to 1.6µs from n=1 to n=10⁸, vs 1.2µs for
a standard list.

Further, it's worth considering a few advantages:

* copy is O(1), allowing code to avoid mutation by just copying its
input, which is good practice.

* FIFO is effectively O(1), as the time just about doubles from n=1 to
n=10⁸ so will never actually branch that much. There is still a speed
benefit of collections.deque, but it's much, much less significant.
This is very useful when considering usage as a multi-purpose data
structure, and removes demand for explicit linked lists (which have
foolishly been reimplemented loads of times).

* It reduces demand for trees:
    * There are efficient implementations of sortedlist, sortedset and
sorteddict.
    * Slicing, slice assignment and slice deletion are really fast.
    * Addition of lists is sublinear. Instead of
"list(itertools.chain(...))", one can add in a loop and end up
*faster*.

I think blist isn't very popular not because it isn't really good, but
because it isn't a specialised structure. It is, however, almost there
for almost every circumstance. This can help keep the standard library
clean, especially of tree data structures.

Here's what we kill:

* Linked lists and doubly-linked lists, which are scarily popular for
whatever reason. Sometimes people claim that collections.deque isn't
powerful enough for whatever they want, and blist will almost
definitely sate those cases.

* Balanced trees, with blist.sortedlist. This is actually needed right now.

* Poor performance in the cases where a lot of list merging and pruning happens.

* Most uses of bisect.

* Some instances where two data structures are used in parallel in
order to keep performance fast on disparate operations (like `x in y`
and `y[i]`).

Now, I understand there are downsides to blist. Particularly, I've
looked through the "benchmarks" and they seem untruthful. Further,
we'd need a maintainer. Finally, nobody jumps at blists because
they're rarely the obvious solution. Rather, they attempt to be a
different general solution. Hopefully, though, a stdlib inclusion
could make them a lot more obvious, and support in some current
libraries could make them feel more at home.

I don't know whether this is a good idea, but I do feel that it is
more promising and general than having a graph in the standard
library.

[toc] | [prev] | [next] | [standalone]

#68464 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-18 00:05 +0200
Subject	Re: Balanced trees
Message-ID	<87mwgoqy4k.fsf@elektro.pacujo.net>
In reply to	#68364

Joshua Landau <joshua@landau.ws>:

> The thing we really need is for the blist containers to become stdlib
> (but not to replace the current list implementation).

Very interesting. Downloaded blist but didn't compile it yet. It *could*
be the missing link.

I would *love* to see some comparative performance results between
blist.sorteddict and an AVL tree. A B+ tree could be nicer to the CPU
cache, but then again, the cache line is only a few pointers wide and
there appears to be a lot of shoving stuff left and right -- based on a
10-second "analysis" of the code:

        while (src >= stop)
                *dst-- = *src--;

Personally, I find it a bit odd to place lists at the center of the
proposal and have trees come out as a side effect. I would rather see it
the other way round: sorteddict -> sortedset et al.

> * It reduces demand for trees:

I would hope it would satisfy the demand for trees.

> nobody jumps at blists because they're rarely the obvious solution.

I haven't jumped at them because they were nowhere to be found on <URL:
http://docs.python.org/3/genindex-B.html>.


Marko

[toc] | [prev] | [next] | [standalone]

#68513 — Re: Balanced trees

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-03-18 12:26 -0700
Subject	Re: Balanced trees
Message-ID	<mailman.8257.1395170778.18130.python-list@python.org>
In reply to	#68464

On Mon, Mar 17, 2014 at 3:05 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Joshua Landau <joshua@landau.ws>:
>
>> The thing we really need is for the blist containers to become stdlib
>> (but not to replace the current list implementation).
>
> Very interesting. Downloaded blist but didn't compile it yet. It *could*
> be the missing link.
>
> I would *love* to see some comparative performance results between
> blist.sorteddict and an AVL tree.

I added blist.sorteddict and removed (temporarily) Pypy and Jython.
The results are at
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/2014-03/

In short, blist.sorteddict didn't do that well, despite being in C.
In the random workloads, blist.sorteddict was dead last.  In the
sequential workloads blist.sorteddict fell somewhere in the middle.

I excluded Pypy and Jython because blist.sorteddict probably doesn't
run on them.

HTH

[toc] | [prev] | [next] | [standalone]

#68518 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-18 22:55 +0200
Subject	Re: Balanced trees
Message-ID	<8738ifqlaw.fsf@elektro.pacujo.net>
In reply to	#68513

Dan Stromberg <drsalists@gmail.com>:

> The results are at
> http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/2014-03/

Unfortunately I'm having a hard time understanding the results.

The 50/50 get/set ratio is most interesting to me.

I'm seeing (under cpython-3.3):


    Size: 1048576, duration:  75.3, dictionary type: dict
    [...]
    Size:  262144, duration:  66.1, dictionary type: AVL_tree
    [...]
    Size:   65536, duration:  77.3, dictionary type: blist.sorteddict 


What does it mean?


Marko

[toc] | [prev] | [next] | [standalone]

#68520 — Re: Balanced trees

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-03-18 14:45 -0700
Subject	Re: Balanced trees
Message-ID	<mailman.8263.1395179161.18130.python-list@python.org>
In reply to	#68518

On Tue, Mar 18, 2014 at 1:55 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Dan Stromberg <drsalists@gmail.com>:
>
>> The results are at
>> http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/2014-03/
>
> Unfortunately I'm having a hard time understanding the results.
>
> The 50/50 get/set ratio is most interesting to me.
>
> I'm seeing (under cpython-3.3):
>
>
>     Size: 1048576, duration:  75.3, dictionary type: dict
>     [...]
>     Size:  262144, duration:  66.1, dictionary type: AVL_tree
>     [...]
>     Size:   65536, duration:  77.3, dictionary type: blist.sorteddict
>
>
> What does it mean?

dict was able to do 1048576 operations on a dictionary before taking
more than 120 seconds to complete - it took 75.3 seconds to do 1048576
operations.

AVL_tree was able to do 262144 operations on a dictionary before
taking more than 120 seconds to complete - it took 66.1 seconds to do
262144 operations.

blist.sorteddict was able to do 65536 operations on a dictionary
before taking more than 120 seconds to complete  - it took 77.3
seconds to do 65536 operations.

For the 50/50 workload; the "operations" were half adding key, value
pairs; and half lookups of values by keys we know are in the
dictionary.

I used to run all the dictionaries for as long as it took to do 4
million operations, but for (EG) unbalanced binary trees, that would
take far too long in the ordered tests, so I changed the code to try a
given tree type until the time for an operation became prohibitive.

If you look at the graphs (I have to admit they've become a little
cluttered), you can see the slower trees "escaping" rapidly (exceeding
the 120 second threshold), while the better performing trees grow more
slowly and are allowed to continue proving themselves longer.
Inspecting these graphs may help in developing an intuition for how
the tests were conducted.

The code implementing this method of testing is in
http://stromberg.dnsalias.org/svn/python-tree-and-heap-comparison/trunk/tester

HTH

[toc] | [prev] | [next] | [standalone]

#68522 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-19 00:03 +0200
Subject	Re: Balanced trees
Message-ID	<87wqfrp3jk.fsf@elektro.pacujo.net>
In reply to	#68520

Dan Stromberg <drsalists@gmail.com>:

> dict was able to do 1048576 operations on a dictionary before taking
> more than 120 seconds to complete - it took 75.3 seconds to do 1048576
> operations.
>
> AVL_tree was able to do 262144 operations on a dictionary before
> taking more than 120 seconds to complete - it took 66.1 seconds to do
> 262144 operations.
>
> blist.sorteddict was able to do 65536 operations on a dictionary
> before taking more than 120 seconds to complete - it took 77.3 seconds
> to do 65536 operations.

For a proper comparison, I'd like a fixed, identical dataset and set of
operations run against each data structure.

How about this test program:

   generate random datasets of 100, 10000, 1000000 and 100000000 elements
   generate random testset of 1000000 elements
   for each data structure:
        for each dataset:
             initialize data structure with dataset
             head, tail = testset[:100], testset[100:]
             t0 = current timestamp
             for each element in head:
                  add element to data structure
             for each element in tail:
                  add element to data structure
                  append element to head
                  remove head.pop(0) from data structure
             for each element in head:
                  remove element from data structure
             t1 = current timestamp
             report data structure type, dataset size, t1 - t0


Marko

[toc] | [prev] | [next] | [standalone]

#68523 — Re: Balanced trees

From	Dan Stromberg <drsalists@gmail.com>
Date	2014-03-18 15:21 -0700
Subject	Re: Balanced trees
Message-ID	<mailman.8264.1395181297.18130.python-list@python.org>
In reply to	#68522

On Tue, Mar 18, 2014 at 3:03 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Dan Stromberg <drsalists@gmail.com>:
> For a proper comparison, I'd like a fixed, identical dataset and set of
> operations run against each data structure.
>
> How about this test program:

I used to do essentially this, but it was time-prohibitive and
produced harder-to-read graphs - harder to read because the enormous
values of the bad trees were dwarfing the values of the good trees.

Imagine doing 100000000 operation tests for the unbalanced binary
tree. For a series of random keys, it would do quite well (probably
second only to dict), but for a series of sequential keys it would
take longer than anyone would reasonably want to wait because it's
basically a storage-inefficient linked list.

Rather than throw out unbalanced binary tree altogether, it makes more
sense to run it until it gets "too slow".

The workload+interpreter pairs are all tested the same way, it's just
that the ones that are doing badly are thrown out before they're able
to get a lot worse. Studying the graphs will likely help develop an
intuition for what's happening.

[toc] | [prev] | [next] | [standalone]

#68525 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-19 01:11 +0200
Subject	Re: Balanced trees
Message-ID	<87r45zf6fu.fsf@elektro.pacujo.net>
In reply to	#68523

Dan Stromberg <drsalists@gmail.com>:

> On Tue, Mar 18, 2014 at 3:03 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Dan Stromberg <drsalists@gmail.com>:
>> For a proper comparison, I'd like a fixed, identical dataset and set
>> of operations run against each data structure.
>>
>> How about this test program:
>
> I used to do essentially this, but it was time-prohibitive and
> produced harder-to-read graphs - harder to read because the enormous
> values of the bad trees were dwarfing the values of the good trees.
>
> Imagine doing 100000000 operation tests for the unbalanced binary
> tree. For a series of random keys, it would do quite well (probably
> second only to dict), but for a series of sequential keys it would
> take longer than anyone would reasonably want to wait because it's
> basically a storage-inefficient linked list.
>
> Rather than throw out unbalanced binary tree altogether, it makes more
> sense to run it until it gets "too slow".

I disagree strongly. You should throw out unbalanced binary trees and
linked lists and the like and concentrate on solid contenders.

But it's your test. You do as you like.

Anyway, even a well-thought-out test is subject to all kinds of
criticisms due to the CPU architecture, the distribution of the key
values, the quality of the data structure implementation etc etc.

Marko

[toc] | [prev] | [next] | [standalone]

#68530 — Re: Balanced trees

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-03-19 01:15 +0000
Subject	Re: Balanced trees
Message-ID	<5328efab$0$29994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#68525

On Wed, 19 Mar 2014 01:11:33 +0200, Marko Rauhamaa wrote:

> Dan Stromberg <drsalists@gmail.com>:

>> Rather than throw out unbalanced binary tree altogether, it makes more
>> sense to run it until it gets "too slow".
> 
> I disagree strongly. You should throw out unbalanced binary trees and
> linked lists and the like and concentrate on solid contenders.

If you are in a position to randomize the data before storing it in the 
tree, an unbalanced binary tree is a solid contender. The overhead will 
likely be much less than any balanced tree, and the probability of 
degenerate behaviour negligible for any amount of data big enough to 
really matter.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#68539 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-19 10:49 +0200
Subject	Re: Balanced trees
Message-ID	<87pplipo82.fsf@elektro.pacujo.net>
In reply to	#68530

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> If you are in a position to randomize the data before storing it in
> the tree, an unbalanced binary tree is a solid contender.

Applications that can assume randomly distributed data are exceedingly
rare and even then, you can't easily discount the possibility of
worst-case ordering.

In fact, Dan himself said unbalanced trees performed so badly he
couldn't test them satisfactorily.


Marko

[toc] | [prev] | [next] | [standalone]

#68551 — Re: Balanced trees

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-03-19 13:42 +0000
Subject	Re: Balanced trees
Message-ID	<53299eac$0$29994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#68539

On Wed, 19 Mar 2014 10:49:33 +0200, Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> 
>> If you are in a position to randomize the data before storing it in the
>> tree, an unbalanced binary tree is a solid contender.
> 
> Applications that can assume randomly distributed data are exceedingly
> rare 

Please re-read what I wrote. I didn't say "if your data comes to you 
fully randomized". I said "if you are in a position to randomize the data 
before storing it". In other words, if you actively randomize the data 
yourself.

> and even then, you can't easily discount the possibility of
> worst-case ordering.

Of course you can. If you have a million items, then the odds that those 
million items happen to be sorted (the worst-case order) are 1 in a 
million factorial. That's a rather small number, small enough that we can 
discount it from ever happening, not in a million lifetimes of the 
Universe.

Of course, you would be right to point out that one would also like to 
avoid *nearly* worst-case ordering. Nevertheless, there are Vastly more 
ways that the data will be sufficiently randomized as to avoid degenerate 
performance than the worst-case poor performance.

> In fact, Dan himself said unbalanced trees performed so badly he
> couldn't test them satisfactorily.

You're misrepresenting what Dan said. What he actually said was that 
unbalanced trees perform second only to dicts with random data, only with 
sorted data do they perform badly. His exact words were:

"For a series of random keys, it would do quite well (probably second only
to dict), but for a series of sequential keys it would take longer than
anyone would reasonably want to wait"

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#68552 — Re: Balanced trees

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-03-19 15:54 +0200
Subject	Re: Balanced trees
Message-ID	<87siqe1efz.fsf@elektro.pacujo.net>
In reply to	#68551

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Please re-read what I wrote. I didn't say "if your data comes to you
> fully randomized". I said "if you are in a position to randomize the
> data before storing it". In other words, if you actively randomize the
> data yourself.

Yes, you could implement a "hash table" that way.


Marko

[toc] | [prev] | [next] | [standalone]

#68553 — Re: Balanced trees

From	Roy Smith <roy@panix.com>
Date	2014-03-19 10:06 -0400
Subject	Re: Balanced trees
Message-ID	<roy-0F0BB4.10062919032014@news.panix.com>
In reply to	#68551

In article <53299eac$0$29994$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> If you have a million items, then the odds that those 
> million items happen to be sorted (the worst-case order) are 1 in a 
> million factorial. That's a rather small number, small enough that we can 
> discount it from ever happening, not in a million lifetimes of the 
> Universe.

If the items are coming from some inherently random process, then I 
agree.  But, not all data sources are random.

Imagine you had a web site, and wanted to store various bits of data 
from the stream of input requests.  You decided that the data structure 
you were going to use was a balanced tree.  If your keys were, say, a 
hash of the client's IP address, then it's a pretty good guess they're 
random.  But, what if the keys were the time the request was received?  
Those are obviously not random, and using them as a keys in a balanced 
tree would give you sub-optimum performance.

This is not a hypothetical scenario.  Songza uses MongoDB as our 
database.  The indexes are balanced trees.  One of our biggest 
collections has a multi-component key, one of the components being the 
request time truncated to the hour.  Insertion time into that collection 
has an obvious sawtooth shape, with performance degrading as each hour 
progresses and jumping back up as the time rolls over to the next hour.  
This is due (we believe :-)) to the constant rebalancing of the index 
trees.

Almost-sorted data sets are very common.  For example, here's a pipeline 
I run a lot (from memory, could have gotten some detail wrong):

grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort 
| uniq -c

Field 7 is the timestamp for when a request came in.  What I'm doing 
here is counting how many requests of a certain type came in during each 
minute of the day.  Logging isn't exactly in chronological order, so I 
need to sort the times before I count them.  But, it's in *almost* 
chronological order; a sort which had pathologically bad behavior on 
almost sorted data would be unusable here.

[toc] | [prev] | [next] | [standalone]

#68531 — Re: Balanced trees

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-03-19 01:15 +0000
Subject	Re: Balanced trees
Message-ID	<5328efc8$0$29994$c3e8da3$5496439d@news.astraweb.com>
In reply to	#68523

On Tue, 18 Mar 2014 15:21:28 -0700, Dan Stromberg wrote:

> On Tue, Mar 18, 2014 at 3:03 PM, Marko Rauhamaa <marko@pacujo.net>
> wrote:
>> Dan Stromberg <drsalists@gmail.com>:
>> For a proper comparison, I'd like a fixed, identical dataset and set of
>> operations run against each data structure.
>>
>> How about this test program:
> 
> I used to do essentially this, but it was time-prohibitive and produced
> harder-to-read graphs - harder to read because the enormous values of
> the bad trees were dwarfing the values of the good trees.

A log graph may be the solution to that: graph the log of the time rather 
than time itself.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#68558 — Re: Balanced trees

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-03-19 08:15 -0700
Subject	Re: Balanced trees
Message-ID	<mailman.8282.1395243639.18130.python-list@python.org>
In reply to	#68531

On 03/18/2014 06:15 PM, Steven D'Aprano wrote:
> On Tue, 18 Mar 2014 15:21:28 -0700, Dan Stromberg wrote:
>
>> On Tue, Mar 18, 2014 at 3:03 PM, Marko Rauhamaa <marko@pacujo.net>
>> wrote:
>>> Dan Stromberg <drsalists@gmail.com>:
>>> For a proper comparison, I'd like a fixed, identical dataset and set of
>>> operations run against each data structure.
>>>
>>> How about this test program:
>>
>> I used to do essentially this, but it was time-prohibitive and produced
>> harder-to-read graphs - harder to read because the enormous values of
>> the bad trees were dwarfing the values of the good trees.
>
> A log graph may be the solution to that: graph the log of the time rather
> than time itself.

I don't think that will solve the problem of not wanting to wait three days for the test to finish.  ;)

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#68565 — Re: Balanced trees

From	"Rhodri James" <rhodri@wildebst.org.uk>
Date	2014-03-20 02:16 +0000
Subject	Re: Balanced trees
Message-ID	<op.xcz0mccp5079vu@gnudebeest>
In reply to	#68520

On Tue, 18 Mar 2014 21:45:52 -0000, Dan Stromberg <drsalists@gmail.com>  
wrote:

> blist.sorteddict was able to do 65536 operations on a dictionary
> before taking more than 120 seconds to complete  - it took 77.3
> seconds to do 65536 operations.

65536 is a suspiciously round number.  You might want to double-
check that there's no 16-bit overflow causing something unexpected.

-- 
Rhodri James *-* Wildebeest Herder to the Masses

[toc] | [prev] | [next] | [standalone]

Page 4 of 6 — ← Prev page 1 2 3 [4] 5 6 Next page →

csiph-web

Re: Tuples and immutability

Contents

#68161 — Re: Balanced trees

#68186 — Re: Balanced trees

#68191 — Re: Balanced trees

#68183 — Re: Balanced trees

#68364 — Re: Balanced trees

#68464 — Re: Balanced trees

#68513 — Re: Balanced trees

#68518 — Re: Balanced trees

#68520 — Re: Balanced trees

#68522 — Re: Balanced trees

#68523 — Re: Balanced trees

#68525 — Re: Balanced trees

#68530 — Re: Balanced trees

#68539 — Re: Balanced trees

#68551 — Re: Balanced trees

#68552 — Re: Balanced trees

#68553 — Re: Balanced trees

#68531 — Re: Balanced trees

#68558 — Re: Balanced trees

#68565 — Re: Balanced trees