NNTP-Posting-Date: Mon, 22 Apr 2013 09:15:36 -0500 Message-ID: <517545F7.5090209@nowhere.org> Date: Mon, 22 Apr 2013 15:15:19 +0100 From: Blind Anagram User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20130328 Thunderbird/17.0.5 MIME-Version: 1.0 Newsgroups: comp.lang.python To: Steven D'Aprano Subject: Re: List Count References: <5175377f$0$29977$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: <5175377f$0$29977$c3e8da3$5496439d@news.astraweb.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Lines: 75 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-VUTpGcaISXz4LfFntuRg6J4sdzc6QECS+J+pu5OmoecJnVK3e9YNeDrtluSEnB0BCef9JQcSAVEvLez!suZMBqKV3hDAtP77qZ3OPGJOxNxsvvcmElrY9sH0qOOEdQOYU4QpEJSC+0mTuYPLZAd5qJjADQ== X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 4461 Path: csiph.com!usenet.pasdenom.info!news.stben.net!border3.nntp.ams.giganews.com!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.brightview.co.uk!news.brightview.co.uk.POSTED!not-for-mail Xref: csiph.com comp.lang.python:44074 On 22/04/2013 14:13, Steven D'Aprano wrote: > On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote: > >> I would be grateful for any advice people can offer on the fastest way >> to count items in a sub-sequence of a large list. >> >> I have a list of boolean values that can contain many hundreds of >> millions of elements for which I want to count the number of True values >> in a sub-sequence, one from the start up to some value (say hi). >> >> I am currently using: >> >> sieve[:hi].count(True) >> >> but I believe this may be costly because it copies a possibly large part >> of the sieve. > > Have you timed it? Because Python is a high-level language, it is rarely > obvious what code will be fast. Yes, sieve[:hi] will copy the first hi > entries, but that's likely to be fast, basically just a memcopy, unless > sieve is huge and memory is short. In other words, unless your sieve is > so huge that the operating system cannot find enough memory for it, > making a copy is likely to be relatively insignificant. > > I've just tried seven different techniques to "optimize" this, and the > simplest, most obvious technique is by far the fastest. Here are the > seven different code snippets I measured, with results: > > > sieve[:hi].count(True) > sum(sieve[:hi]) > sum(islice(sieve, hi)) > sum(x for x in islice(sieve, hi) if x) > sum(x for x in islice(sieve, hi) if x is True) > sum(1 for x in islice(sieve, hi) if x is True) > len(list(filter(None, islice(sieve, hi)))) Yes, I did time it and I agree with your results (where my tests overlap with yours). But when using a sub-sequence, I do suffer a significant reduction in speed for a count when compared with count on the full list. When the list is small enough not to cause memory allocation issues this is about 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS memory allocation becomes an issue and the cost on my system rises to over 600%. I agree that this is not a big issue but it seems to me a high price to pay for the lack of a sieve.count(value, limit), which I feel is a useful function (given that memoryview operations are not available for lists). > Of course. But don't optimize this until you know that you *need* to > optimize it. Is it really a bottleneck in your code? There's no point in > saving the 0.1 second it takes to copy the list if it takes 2 seconds to > count the items regardless. > >> Are there any other solutions that will avoid copying a large part of >> the list? > > Yes, but they're slower. > > Perhaps a better solution might be to avoid counting anything. If you can > keep a counter, and each time you add a value to the list you update the > counter, then getting the number of True values will be instantaneous. Creating the sieve is currently very fast as it is not done by adding single items but by adding a large number of items at the same time using a slice operation. I could count the items in each slice as it is added but this would add complexity that I would prefer to avoid because the creation of the sieve is quite tricky to get right and I would prefer not to fiddle with this. Thank you (and others) for advice on this.