Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: joshua.landau.ws@gmail.com
In-Reply-To: <CAMMy=OveYrvz+SZkbk9WvdrmsQMEx3_HVWSkcatbZC4A5jzD8Q@mail.gmail.com>
References: <lennh4$kpm$1@cabale.usenet-fr.net> <CAPTjJmomgZSFj7TanBn8_qXP4ULhCm85abJ5=2PcikXJkrH6GQ@mail.gmail.com> <CAN1F8qU8AJ+Z-Rb4+wGsYWd_bS5aOA3bsjd5NjF46MUvnecSeQ@mail.gmail.com> <mailman.7473.1393598638.18130.python-list@python.org> <XnsA2E95FA1E1EB6duncanbooth@127.0.0.1> <bnvctpF5vanU1@mid.individual.net> <mailman.7920.1394252278.18130.python-list@python.org> <87eh2d3x8h.fsf_-_@elektro.pacujo.net> <CAGGBd_qU3Zp3A4pymnDQfWynWZwFVrdHJpG=U0WZTap4HiymdA@mail.gmail.com> <lffv32$mqo$1@ger.gmane.org> <CAN1F8qU=2K6ysbpnu-JoUtfWfTToRsOaRek0dSQzjPx_sYzPKQ@mail.gmail.com> <CAMMy=OsYUzULRTRHwytiTszd44f3anz43=2DtGALHR3nBGe8JQ@mail.gmail.com> <CAN1F8qUqVnhjxXEUnub36GEwK_DD+ukKK2GwA5wYku0s=wpdgA@mail.gmail.com> <CAMMy=OveYrvz+SZkbk9WvdrmsQMEx3_HVWSkcatbZC4A5jzD8Q@mail.gmail.com>
From: Joshua Landau <joshua@landau.ws>
Date: Tue, 18 Mar 2014 07:46:42 +0000
Subject: Re: Balanced trees
To: python-list <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.8245.1395128850.18130.python-list@python.org>
Lines: 67
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:68492

On 18 March 2014 01:01, Daniel Stutzbach <stutzbach@google.com> wrote:
> I would love to have include macro-benchmarks.  I keep waiting for the PyPy
> benchmark suite to get ported to Python 3...

*grins*

>> "Delete a slice" is fudged from its inclusion of multiplication, which
>> is far faster on blists. I admit that it's not obvious how to fix
>> this.
>
> I could move the initialization into the timed part, similar to what I did
> for sort (see below).  That has downsides too, of course, but it might be an
> improvement.

You could try making a baseline and subtracting it:

    timer("del x[len(x)//4:3*len(x)//4]; x *= 2") - timer("x * 2")

Not ideal, but closer, assuming that the multiplication isn't much
larger than the deletion. Error would be summed.

>> "Sort *" are really unfair because they put initialisation in the
>> timed part
>
> That's a limitation of timeit.  The setup step is only executed once.  If I
> put the initialization there, every sort after the first one would be
> sorting a pre-sorted list.  If you compare the "Create form an iterator" and
> "Sort a random list", you'll see that the initialization cost is dwarfed by
> the sorting cost for n > 15 or so.

This argument is slightly less convincing without the overhead of the
keys. It might be worth doing a subtraction and adding some error-bars
as I suggest above. Nevertheless, I do agree for n > some small n,
which is all that matters anyway.

>> and all have keys.
>
> If you use classes with __lt__ methods instead of keys, the cost is
> dominated by the calls to __lt__.  You're right that I should include both,
> though.

This argument doesn't make sense to me. The only time this happens is
when you have a non-primitive and your transformation gives a
primitive which has optimised comparisons. This typically only happens
when the key is a getitem or getattr, in which case it's just
meaningless overhead. I see little reason to care about the key's cost
in those cases.

> That's definitely a cache issue, which is always a risk with
> micro-benchmarks.
>
> I agree it's more interesting to pick items randomly instead of always
> querying the same index.  The overhead of choice() is kind of a problem,
> though.  Since I'm only plotting up to 10**5, I'd expect these to look more
> or less flat.

You could try jumping around to avoid the cache without using random
numbers. Something like "idx = (idx + LARGE_PRIME) % n" might have less
overhead. Further, the subtraction method would probably work fine for
that.

Also, I don't think the cache is all bad. Chances are a lot of list
accesses have a lot of data locality.

> Thanks for all of the feedback.

Thanks in turn for the module :).