Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <fd6dnXTDxbbiIOvMnZ2dnUVZ8qGdnZ2d@brightview.co.uk>
References: <sridnffI6YhxuOjMnZ2dnUVZ7tSdnZ2d@brightview.co.uk> <5175377f$0$29977$c3e8da3$5496439d@news.astraweb.com> <517545F7.5090209@nowhere.org> <5175c12f$0$29977$c3e8da3$5496439d@news.astraweb.com> <e4KdnXzMesF-r-vMnZ2dnUVZ8rWdnZ2d@brightview.co.uk> <51769f96$0$29977$c3e8da3$5496439d@news.astraweb.com> <fd6dnXTDxbbiIOvMnZ2dnUVZ8qGdnZ2d@brightview.co.uk>
From: Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date: Tue, 23 Apr 2013 18:45:41 +0100
Subject: Re: List Count
To: Blind Anagram <blindanagram@nowhere.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.987.1366739164.3114.python-list@python.org>
Lines: 75
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:44205

On 23 April 2013 17:57, Blind Anagram <blindanagram@nowhere.org> wrote:
> On 23/04/2013 15:49, Steven D'Aprano wrote:
>> On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:
>>
>>> I did a lot of work comparing the overall performance of the sieve when
>>> using either lists or arrays and I found that lists were a lot faster
>>> for the majority use case when the sieve is not large.
>>
>> And when the sieve is large?
>
> I don't know but since the majority use case is when the sieve is small,
> it makes sense to choose a list.

That's an odd comment given what you said at the start of this thread:

Blind Anagram wrote:
> I would be grateful for any advice people can offer on the fastest way
> to count items in a sub-sequence of a large list.
>
> I have a list of boolean values that can contain many hundreds of
> millions of elements for which I want to count the number of True values
> in a sub-sequence, one from the start up to some value (say hi).


>> I don't actually believe that the bottleneck is the cost of taking a list
>> slice. That's pretty fast, even for huge lists, and all efforts to skip
>> making a copy by using itertools.islice actually ended up slower. But
>> suppose it is the bottleneck. Then *sooner or later* arrays will win over
>> lists, simply because they're smaller.
>
> Maybe you have not noticed that, when I am discussing a huge sieve, I am
> simply pushing a sieve designed primarily for a small sieve lengths to
> the absolute limit.  This is most definitely a minority use case.
>
> In pushing the size of the sieve upwards, it is the slice operation that
> is the first thing that 'breaks'.  This is because the slice can be
> almost as big as the primary array so the OS gets driven into memory
> allocation problems for a sieve that is about half the length it would
> otherwise be. It still works but the cost of the slice once this point
> is reached rises from about 20% to over 600% because of all the paging
> going on.

You keep mentioning that you want it to work with a large sieve. I
would much rather compute the same quantities with a small sieve if
possible. If you were using the Lehmer/Meissel algorithm you would be
able to compute the same quantity (i.e. pi(1e9)) using a much smaller
sieve with 30k items instead of 1e9. that would fit *very* comfortably
in memory and you wouldn't even need to slice the list. Or to put it
another way, you could compute pi(~1e18) using your current sieve
without slicing or paging. If you want to lift the limit on computing
pi(x) this is clearly the way to go.

>
> The unavailable list.count(value, limit) function would hence allow the
> sieve length to be up to twice as large before running into problems and
> would also cut the 20% slice cost I am seeing on smaller sieve lengths.
>
> So, all I was doing in asking for advice was to check whether there is
> an easy way of avoiding the slice copy, not because this is critical,
> but rather because it is a pity to limit the performance because Python
> forces a (strictly unnecessary) copy in order to perform a count within
> a part of a list.
>
> In other words, the lack of a list.count(value, limit) function makes
> Python less effective than it would otherwise be.  I haven't looked at
> Python's C code base but I still wonder if there a good reason for NOT
> providing this?

If you feel that this is a good suggestion for an improvement to
Python consider posting it on python-ideas. I wasn't aware of the
equivalent functionality on strings but I see that the tuple.count()
function is the same as list.count().


Oscar