Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #44051 > unrolled thread
| Started by | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| First post | 2013-04-22 12:58 +0100 |
| Last post | 2013-04-22 06:36 -0700 |
| Articles | 20 on this page of 42 — 8 participants |
Back to article view | Back to comp.lang.python
List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 12:58 +0100
Re: List Count Dave Angel <davea@davea.name> - 2013-04-22 08:51 -0400
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 14:03 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 14:03 +0100
Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 13:13 +0000
Re: List Count Skip Montanaro <skip@pobox.com> - 2013-04-22 08:57 -0500
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 15:15 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 16:14 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 16:50 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 17:06 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 17:38 +0100
Re: List Count Skip Montanaro <skip@pobox.com> - 2013-04-22 12:48 -0500
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 20:22 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 21:18 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 22:25 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 00:06 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 07:45 +0100
Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 23:28 +0000
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:00 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 22:03 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 22:32 +0100
Re: List Count Dave Angel <davea@davea.name> - 2013-04-22 21:47 -0400
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:02 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 17:38 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 16:50 +0100
Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 23:01 +0000
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:05 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 12:08 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 12:45 +0100
Re: List Count Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-23 15:01 -0400
Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 14:49 +0000
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 17:57 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 18:45 +0100
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 19:30 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 20:16 +0100
Re: List Count Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-23 16:00 -0400
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 21:41 +0100
Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 21:38 +0100
Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-24 01:59 +0000
Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-24 10:01 +0100
Re: List Count Peter Otten <__peter__@web.de> - 2013-04-22 15:22 +0200
Re: List Count 88888 Dihedral <dihedral88888@googlemail.com> - 2013-04-22 06:36 -0700
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-22 22:32 +0100 |
| Message-ID | <MOmdnVeRh4PiMejMnZ2dnUVZ7t6dnZ2d@brightview.co.uk> |
| In reply to | #44113 |
On 22/04/2013 22:03, Oscar Benjamin wrote: > On 22 April 2013 21:18, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: >> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote: >>> On 22/04/2013 17:06, Oscar Benjamin wrote: >>> >>>> I don't know what your application is but I would say that my first >>>> port of call here would be to consider a different algorithmic >>>> approach. An obvious question would be about the sparsity of this data >>>> structure. How frequent are the values that you are trying to count? >>>> Would it make more sense to store a list of their indices? >>> >>> Actually it is no more than a simple prime sieve implemented as a Python >>> class (and, yes, I realize that there are plenty of these around). >> >> If I understand correctly, you have a list of roughly a billion >> True/False values indicating which integers are prime and which are >> not. You would like to discover how many prime numbers there are >> between two numbers a and b. You currently do this by counting the >> number of True values in your list between the indices a and b. >> >> If my description is correct then I would definitely consider using a >> different algorithmic approach. The density of primes from 1 to 1 >> billlion is about 5%. Storing the prime numbers themselves in a sorted >> list would save memory and allow a potentially more efficient way of >> counting the number of primes within some interval. > > In fact it is probably quicker if you don't mind using all that memory > to just store the cumulative sum of your prime True/False indicator > list. This would be the prime counting function pi(n). You can then > count the primes between a and b in constant time with pi[b] - pi[a]. I did wonder whether, after creating the sieve, I should simply go through the list and replace the True values with a count. This would certainly speed up the prime count function, which is where the issue arises. I will try this and see what sort of performance trade-offs this involves. Brian
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-04-22 21:47 -0400 |
| Message-ID | <mailman.946.1366681733.3114.python-list@python.org> |
| In reply to | #44117 |
On 04/22/2013 05:32 PM, Blind Anagram wrote: > On 22/04/2013 22:03, Oscar Benjamin wrote: >> On 22 April 2013 21:18, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: >>> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote: >>>> On 22/04/2013 17:06, Oscar Benjamin wrote: >>>> >>>>> I don't know what your application is but I would say that my first >>>>> port of call here would be to consider a different algorithmic >>>>> approach. An obvious question would be about the sparsity of this data >>>>> structure. How frequent are the values that you are trying to count? >>>>> Would it make more sense to store a list of their indices? >>>> >>>> Actually it is no more than a simple prime sieve implemented as a Python >>>> class (and, yes, I realize that there are plenty of these around). >>> >>> If I understand correctly, you have a list of roughly a billion >>> True/False values indicating which integers are prime and which are >>> not. You would like to discover how many prime numbers there are >>> between two numbers a and b. You currently do this by counting the >>> number of True values in your list between the indices a and b. >>> >>> If my description is correct then I would definitely consider using a >>> different algorithmic approach. The density of primes from 1 to 1 >>> billlion is about 5%. Storing the prime numbers themselves in a sorted >>> list would save memory and allow a potentially more efficient way of >>> counting the number of primes within some interval. >> >> In fact it is probably quicker if you don't mind using all that memory >> to just store the cumulative sum of your prime True/False indicator >> list. This would be the prime counting function pi(n). You can then >> count the primes between a and b in constant time with pi[b] - pi[a]. > > I did wonder whether, after creating the sieve, I should simply go > through the list and replace the True values with a count. This would > certainly speed up the prime count function, which is where the issue > arises. I will try this and see what sort of performance trade-offs > this involves. > By doing that replacement, you'd increase memory usage manyfold (maybe 3:1, I don't know). As long as you're only using bools in the list, you only have the list overhead to consider, because all the objects involved are already cached (True and False exist only once each). If you have integers, you'll need a new object for each nonzero count. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 08:02 +0100 |
| Message-ID | <jfudneR1BOGOr-vMnZ2dnUVZ8jmdnZ2d@brightview.co.uk> |
| In reply to | #44132 |
On 23/04/2013 02:47, Dave Angel wrote: > On 04/22/2013 05:32 PM, Blind Anagram wrote: >> On 22/04/2013 22:03, Oscar Benjamin wrote: >>> On 22 April 2013 21:18, Oscar Benjamin <oscar.j.benjamin@gmail.com> >>> wrote: >>>> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote: >>>>> On 22/04/2013 17:06, Oscar Benjamin wrote: >>>>> >>>>>> I don't know what your application is but I would say that my first >>>>>> port of call here would be to consider a different algorithmic >>>>>> approach. An obvious question would be about the sparsity of this >>>>>> data >>>>>> structure. How frequent are the values that you are trying to count? >>>>>> Would it make more sense to store a list of their indices? >>>>> >>>>> Actually it is no more than a simple prime sieve implemented as a >>>>> Python >>>>> class (and, yes, I realize that there are plenty of these around). >>>> >>>> If I understand correctly, you have a list of roughly a billion >>>> True/False values indicating which integers are prime and which are >>>> not. You would like to discover how many prime numbers there are >>>> between two numbers a and b. You currently do this by counting the >>>> number of True values in your list between the indices a and b. >>>> >>>> If my description is correct then I would definitely consider using a >>>> different algorithmic approach. The density of primes from 1 to 1 >>>> billlion is about 5%. Storing the prime numbers themselves in a sorted >>>> list would save memory and allow a potentially more efficient way of >>>> counting the number of primes within some interval. >>> >>> In fact it is probably quicker if you don't mind using all that memory >>> to just store the cumulative sum of your prime True/False indicator >>> list. This would be the prime counting function pi(n). You can then >>> count the primes between a and b in constant time with pi[b] - pi[a]. >> >> I did wonder whether, after creating the sieve, I should simply go >> through the list and replace the True values with a count. This would >> certainly speed up the prime count function, which is where the issue >> arises. I will try this and see what sort of performance trade-offs >> this involves. >> > > By doing that replacement, you'd increase memory usage manyfold (maybe > 3:1, I don't know). As long as you're only using bools in the list, you > only have the list overhead to consider, because all the objects > involved are already cached (True and False exist only once each). If > you have integers, you'll need a new object for each nonzero count. Thank you, Dave, you have answered a question that I was going to ask before I even asked it!
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-22 17:38 +0100 |
| Message-ID | <mailman.931.1366648708.3114.python-list@python.org> |
| In reply to | #44092 |
On 22/04/2013 17:06, Oscar Benjamin wrote: > On 22 April 2013 16:50, Blind Anagram <blindanagram@nowhere.org> wrote: >>> >>> It would be very easy to subclass list and add this functionality in >>> cython if you decide that you do need a builtin method. > [snip] >> >> But I was really wondering if there was a simple solution that worked >> without people having to add libraries to their basic Python installations. > > There are simple solutions and some have already been listed. You are > attempting to push your program to the limit of your hardware > capabilities and it's natural that in a high-level language you'll > often want special libraries for that. Hi Oscar Yes, but it is a tribute to Python that I can do this quite fast for huge lists provided that I only count on the full list. And, unless I have completely misunderstood Python internals, it would probably be just as fast on a sub-sequence if I had a list.count(value, limit) function (however, I admit that I could be wrong here since the fact that count on lists does not offer this may mean that it is not as easy to implement as it might seem). > I don't know what your application is but I would say that my first > port of call here would be to consider a different algorithmic > approach. An obvious question would be about the sparsity of this data > structure. How frequent are the values that you are trying to count? > Would it make more sense to store a list of their indices? Actually it is no more than a simple prime sieve implemented as a Python class (and, yes, I realize that there are plenty of these around). > If the problem needs to be solved the way that you are currently doing > it and the available methods are not fast enough then you will need to > consider additional libraries. >> >> As I have never tried building an extension with cython, I am inclined >> to try this as a learning exercise if nothing else. > > I definitely recommend this over writing a C extension directly. Thanks again - I will definitely look at this. Brian
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-22 16:50 +0100 |
| Message-ID | <mailman.928.1366645838.3114.python-list@python.org> |
| In reply to | #44089 |
On 22/04/2013 16:14, Oscar Benjamin wrote: > On 22 April 2013 15:15, Blind Anagram <blindanagram@nowhere.org> wrote: >> On 22/04/2013 14:13, Steven D'Aprano wrote: >>> On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote: >>> >>>> I would be grateful for any advice people can offer on the fastest way >>>> to count items in a sub-sequence of a large list. >>>> >>>> I have a list of boolean values that can contain many hundreds of >>>> millions of elements for which I want to count the number of True values >>>> in a sub-sequence, one from the start up to some value (say hi). >>>> >>>> I am currently using: >>>> >>>> sieve[:hi].count(True) >>>> >>>> but I believe this may be costly because it copies a possibly large part >>>> of the sieve. > [snip] >> >> But when using a sub-sequence, I do suffer a significant reduction in >> speed for a count when compared with count on the full list. When the >> list is small enough not to cause memory allocation issues this is about >> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS >> memory allocation becomes an issue and the cost on my system rises to >> over 600%. > > Have you tried using numpy? I find that it reduces the memory required > to store a list of bools by a factor of 4 on my 32 bit system. I would > expect that to be a factor of 8 on a 64 bit system: > >>>> import sys >>>> a = [True] * 1000000 >>>> sys.getsizeof(a) > 4000036 >>>> import numpy >>>> a = numpy.ndarray(1000000, bool) >>>> sys.getsizeof(a) # This does not include the data buffer > 40 >>>> a.nbytes > 1000000 > > The numpy array also has the advantage that slicing does not actually > copy the data (as has already been mentioned). On this system slicing > a numpy array has a 40 byte overhead regardless of the size of the > slice. > >> I agree that this is not a big issue but it seems to me a high price to >> pay for the lack of a sieve.count(value, limit), which I feel is a >> useful function (given that memoryview operations are not available for >> lists). > > It would be very easy to subclass list and add this functionality in > cython if you decide that you do need a builtin method. Thanks Oscar, I'll take a look at this. But I was really wondering if there was a simple solution that worked without people having to add libraries to their basic Python installations. As I have never tried building an extension with cython, I am inclined to try this as a learning exercise if nothing else.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-22 23:01 +0000 |
| Message-ID | <5175c12f$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44074 |
On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:
> But when using a sub-sequence, I do suffer a significant reduction in
> speed for a count when compared with count on the full list. When the
> list is small enough not to cause memory allocation issues this is about
> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS
> memory allocation becomes an issue and the cost on my system rises to
> over 600%.
Buy more memory :-)
> I agree that this is not a big issue but it seems to me a high price to
> pay for the lack of a sieve.count(value, limit), which I feel is a
> useful function (given that memoryview operations are not available for
> lists).
There is no need to complicate the count method for such a specialised
use-case. A more general solution would be to provide list views.
Another solution might be to use arrays rather than lists. Since your
sieve list is homogeneous, you could possibly use an array of 1 or 0
bytes rather than a list of True or False bools. That would reduce the
memory overhead by a factor of four, and similarly reduce the overhead of
any copying:
py> from array import array
py> from sys import getsizeof
py> L = [True, False, False, True]*1000
py> A = array('b', L)
py> getsizeof(L)
16032
py> getsizeof(A)
4032
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 08:05 +0100 |
| Message-ID | <e4KdnXzMesF-r-vMnZ2dnUVZ8rWdnZ2d@brightview.co.uk> |
| In reply to | #44120 |
On 23/04/2013 00:01, Steven D'Aprano wrote: > On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote: > >> But when using a sub-sequence, I do suffer a significant reduction in >> speed for a count when compared with count on the full list. When the >> list is small enough not to cause memory allocation issues this is about >> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS >> memory allocation becomes an issue and the cost on my system rises to >> over 600%. > > Buy more memory :-) > > >> I agree that this is not a big issue but it seems to me a high price to >> pay for the lack of a sieve.count(value, limit), which I feel is a >> useful function (given that memoryview operations are not available for >> lists). > > There is no need to complicate the count method for such a specialised > use-case. A more general solution would be to provide list views. > > Another solution might be to use arrays rather than lists. Since your > sieve list is homogeneous, you could possibly use an array of 1 or 0 > bytes rather than a list of True or False bools. That would reduce the > memory overhead by a factor of four, and similarly reduce the overhead of > any copying: I did a lot of work comparing the overall performance of the sieve when using either lists or arrays and I found that lists were a lot faster for the majority use case when the sieve is not large. Brian
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-23 12:08 +0100 |
| Message-ID | <mailman.967.1366715343.3114.python-list@python.org> |
| In reply to | #44158 |
On 23 April 2013 08:05, Blind Anagram <blindanagram@nowhere.org> wrote:
> On 23/04/2013 00:01, Steven D'Aprano wrote:
>> On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:
>>
>>> But when using a sub-sequence, I do suffer a significant reduction in
>>> speed for a count when compared with count on the full list. When the
>>> list is small enough not to cause memory allocation issues this is about
>>> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS
>>> memory allocation becomes an issue and the cost on my system rises to
>>> over 600%.
[snip]
>>
>> Another solution might be to use arrays rather than lists. Since your
>> sieve list is homogeneous, you could possibly use an array of 1 or 0
>> bytes rather than a list of True or False bools. That would reduce the
>> memory overhead by a factor of four, and similarly reduce the overhead of
>> any copying:
>
> I did a lot of work comparing the overall performance of the sieve when
> using either lists or arrays and I found that lists were a lot faster
> for the majority use case when the sieve is not large.
Okay, now I understand. I thought you were trying to optimise for
large lists, but in fact you have already optimised for small lists.
As a result you have made algorithmic choices that don't scale very
well. Since you're still concerned about performance on small lists
you don't want to rethink those choices. Instead you want a
micro-optimisation that would compensate for them.
Elsewhere you said:
> I would not dream of doing this job by copying a slice in any other
> language that I have used so I was simply asking for advice to discover
> whether this copy could be avoided whilst staying with the simple sieve
> design.
So you already knew that there would be problems with this method, but
you've chosen it anyway since it turned out to be fastest for small
lists. You could always just do a different thing when the list is
large:
def pi(self, n):
if n < 1000000:
return self.indicator[:n].sum()
else:
return sum(itertools.islice(self.indicator, n))
However, if you really want to improve performance in computing pi(n)
for large n you should just use one of the existing algorithms having
sublinear space/time complexity. These also use evaluate pi(n) with
sieves but the sieve only needs to be as big as sqrt(n) rather than n
for the obvious method:
http://en.wikipedia.org/wiki/Prime-counting_function#Algorithms_for_evaluating_.CF.80.28x.29
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 12:45 +0100 |
| Message-ID | <TdOdnQside8Y6evMnZ2dnUVZ8s2dnZ2d@brightview.co.uk> |
| In reply to | #44171 |
On 23/04/2013 12:08, Oscar Benjamin wrote: > On 23 April 2013 08:05, Blind Anagram <blindanagram@nowhere.org> wrote: >> On 23/04/2013 00:01, Steven D'Aprano wrote: >>> On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote: >>> >>>> But when using a sub-sequence, I do suffer a significant reduction in >>>> speed for a count when compared with count on the full list. When the >>>> list is small enough not to cause memory allocation issues this is about >>>> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS >>>> memory allocation becomes an issue and the cost on my system rises to >>>> over 600%. > [snip] >>> >>> Another solution might be to use arrays rather than lists. Since your >>> sieve list is homogeneous, you could possibly use an array of 1 or 0 >>> bytes rather than a list of True or False bools. That would reduce the >>> memory overhead by a factor of four, and similarly reduce the overhead of >>> any copying: >> >> I did a lot of work comparing the overall performance of the sieve when >> using either lists or arrays and I found that lists were a lot faster >> for the majority use case when the sieve is not large. > > Okay, now I understand. I thought you were trying to optimise for > large lists, but in fact you have already optimised for small lists. > As a result you have made algorithmic choices that don't scale very > well. Since you're still concerned about performance on small lists > you don't want to rethink those choices. Instead you want a > micro-optimisation that would compensate for them. > > Elsewhere you said: > >> I would not dream of doing this job by copying a slice in any other >> language that I have used so I was simply asking for advice to discover >> whether this copy could be avoided whilst staying with the simple sieve >> design. > > So you already knew that there would be problems with this method, but > you've chosen it anyway since it turned out to be fastest for small > lists. You could always just do a different thing when the list is > large: Your analysis of my rationale is sound except that I only found out that I had a problem with counting in a subset of a list when I actually tried this for a huge sieve. It was only then that I discovered that there was no way of setting the upper limit in list.count(x) and hence that I would have to create a slice or find an alternative approach. I then wondered why count for lists has no limits whereas count for other objects (e.g. strings) has these. I also wondered whether there was an easy way of avoiding the slice, not that this is critical, but rather because it is just nice to have a sieve that still actually works for huge values, albeit inefficiently. > def pi(self, n): > if n < 1000000: > return self.indicator[:n].sum() > else: > return sum(itertools.islice(self.indicator, n)) > I have looked at itertools.islice as a complete replacement for count() where, on average, it was a lot slower. But I have not tried hybrid strategies as you suggest - I'll take a look at this. My thanks to you and others for the advice that has been offered. Brian
[toc] | [prev] | [next] | [standalone]
| From | Terry Jan Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-04-23 15:01 -0400 |
| Message-ID | <mailman.988.1366743690.3114.python-list@python.org> |
| In reply to | #44174 |
On 4/23/2013 7:45 AM, Blind Anagram wrote: > I then wondered why count for lists has no limits Probably because no one has asked for such, as least partly because it is not really needed. In any case, .count(s) is a generic method. It is part of the definition of a Sequence. It can also be implemented for non-sequence collections, such as a Tree class, that allow multiple occurrences of an item. > whereas count for other objects (e.g. strings) has these. Strings (of unicode or bytes) are exceptional in multiple ways. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-23 14:49 +0000 |
| Message-ID | <51769f96$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44158 |
On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote: > I did a lot of work comparing the overall performance of the sieve when > using either lists or arrays and I found that lists were a lot faster > for the majority use case when the sieve is not large. And when the sieve is large? I don't actually believe that the bottleneck is the cost of taking a list slice. That's pretty fast, even for huge lists, and all efforts to skip making a copy by using itertools.islice actually ended up slower. But suppose it is the bottleneck. Then *sooner or later* arrays will win over lists, simply because they're smaller. Of course, "sooner or later" might be much later. I expect that you will not find a single algorithm, or data structure, that works optimally for both small and huge inputs. In general, there are two strategies you might take: 1) Use an algorithm or data structure which is efficient for small inputs with small inputs, and after some cut-off size, swap to a different algorithm which is efficient for large inputs. That swap over may require a one-off conversion cost, but provided your sieve never shrinks, this may not matter. 2) Use only the algorithm for large inputs. For small inputs, the difference between the two is insignificant in absolute terms (who cares if the operation takes 5ms instead of 1ms?), but for large N, there is a clear winner. There's nothing that says you're limited to two algorithms. You may find that to really optimize things, you need three or more algorithms, each one optimized for a particular subset of inputs. Of course, all this added complexity is itself very costly. Is it worth it? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 17:57 +0100 |
| Message-ID | <fd6dnXTDxbbiIOvMnZ2dnUVZ8qGdnZ2d@brightview.co.uk> |
| In reply to | #44185 |
On 23/04/2013 15:49, Steven D'Aprano wrote: > On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote: > >> I did a lot of work comparing the overall performance of the sieve when >> using either lists or arrays and I found that lists were a lot faster >> for the majority use case when the sieve is not large. > > And when the sieve is large? I don't know but since the majority use case is when the sieve is small, it makes sense to choose a list. > I don't actually believe that the bottleneck is the cost of taking a list > slice. That's pretty fast, even for huge lists, and all efforts to skip > making a copy by using itertools.islice actually ended up slower. But > suppose it is the bottleneck. Then *sooner or later* arrays will win over > lists, simply because they're smaller. Maybe you have not noticed that, when I am discussing a huge sieve, I am simply pushing a sieve designed primarily for a small sieve lengths to the absolute limit. This is most definitely a minority use case. In pushing the size of the sieve upwards, it is the slice operation that is the first thing that 'breaks'. This is because the slice can be almost as big as the primary array so the OS gets driven into memory allocation problems for a sieve that is about half the length it would otherwise be. It still works but the cost of the slice once this point is reached rises from about 20% to over 600% because of all the paging going on. The unavailable list.count(value, limit) function would hence allow the sieve length to be up to twice as large before running into problems and would also cut the 20% slice cost I am seeing on smaller sieve lengths. So, all I was doing in asking for advice was to check whether there is an easy way of avoiding the slice copy, not because this is critical, but rather because it is a pity to limit the performance because Python forces a (strictly unnecessary) copy in order to perform a count within a part of a list. In other words, the lack of a list.count(value, limit) function makes Python less effective than it would otherwise be. I haven't looked at Python's C code base but I still wonder if there a good reason for NOT providing this?
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-23 18:45 +0100 |
| Message-ID | <mailman.987.1366739164.3114.python-list@python.org> |
| In reply to | #44203 |
On 23 April 2013 17:57, Blind Anagram <blindanagram@nowhere.org> wrote: > On 23/04/2013 15:49, Steven D'Aprano wrote: >> On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote: >> >>> I did a lot of work comparing the overall performance of the sieve when >>> using either lists or arrays and I found that lists were a lot faster >>> for the majority use case when the sieve is not large. >> >> And when the sieve is large? > > I don't know but since the majority use case is when the sieve is small, > it makes sense to choose a list. That's an odd comment given what you said at the start of this thread: Blind Anagram wrote: > I would be grateful for any advice people can offer on the fastest way > to count items in a sub-sequence of a large list. > > I have a list of boolean values that can contain many hundreds of > millions of elements for which I want to count the number of True values > in a sub-sequence, one from the start up to some value (say hi). >> I don't actually believe that the bottleneck is the cost of taking a list >> slice. That's pretty fast, even for huge lists, and all efforts to skip >> making a copy by using itertools.islice actually ended up slower. But >> suppose it is the bottleneck. Then *sooner or later* arrays will win over >> lists, simply because they're smaller. > > Maybe you have not noticed that, when I am discussing a huge sieve, I am > simply pushing a sieve designed primarily for a small sieve lengths to > the absolute limit. This is most definitely a minority use case. > > In pushing the size of the sieve upwards, it is the slice operation that > is the first thing that 'breaks'. This is because the slice can be > almost as big as the primary array so the OS gets driven into memory > allocation problems for a sieve that is about half the length it would > otherwise be. It still works but the cost of the slice once this point > is reached rises from about 20% to over 600% because of all the paging > going on. You keep mentioning that you want it to work with a large sieve. I would much rather compute the same quantities with a small sieve if possible. If you were using the Lehmer/Meissel algorithm you would be able to compute the same quantity (i.e. pi(1e9)) using a much smaller sieve with 30k items instead of 1e9. that would fit *very* comfortably in memory and you wouldn't even need to slice the list. Or to put it another way, you could compute pi(~1e18) using your current sieve without slicing or paging. If you want to lift the limit on computing pi(x) this is clearly the way to go. > > The unavailable list.count(value, limit) function would hence allow the > sieve length to be up to twice as large before running into problems and > would also cut the 20% slice cost I am seeing on smaller sieve lengths. > > So, all I was doing in asking for advice was to check whether there is > an easy way of avoiding the slice copy, not because this is critical, > but rather because it is a pity to limit the performance because Python > forces a (strictly unnecessary) copy in order to perform a count within > a part of a list. > > In other words, the lack of a list.count(value, limit) function makes > Python less effective than it would otherwise be. I haven't looked at > Python's C code base but I still wonder if there a good reason for NOT > providing this? If you feel that this is a good suggestion for an improvement to Python consider posting it on python-ideas. I wasn't aware of the equivalent functionality on strings but I see that the tuple.count() function is the same as list.count(). Oscar
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 19:30 +0100 |
| Message-ID | <hZednRfL7I7YTuvMnZ2dnUVZ7oednZ2d@brightview.co.uk> |
| In reply to | #44205 |
On 23/04/2013 18:45, Oscar Benjamin wrote: >>>> I did a lot of work comparing the overall performance of the sieve when >>>> using either lists or arrays and I found that lists were a lot faster >>>> for the majority use case when the sieve is not large. >>> >>> And when the sieve is large? >> >> I don't know but since the majority use case is when the sieve is small, >> it makes sense to choose a list. > > That's an odd comment given what you said at the start of this thread: > > Blind Anagram wrote: >> I would be grateful for any advice people can offer on the fastest way >> to count items in a sub-sequence of a large list. >> >> I have a list of boolean values that can contain many hundreds of >> millions of elements for which I want to count the number of True values >> in a sub-sequence, one from the start up to some value (say hi). At this early stage in the discussion I was simply explaining the immediate context of the problem on which I was seeking advice. Here I didn't think it was necessary to expand on the wider context since there might have been an very easy way that people could suggest for avoiding the slice copy when counting on a part of a list. But there isn't, so the wider details then became important in explaining why some proposals might work for the limiting case but would not make sense within the overall context of use. And here I have said on more than one occasion that the huge sieve case is a minority use case. [snip] > You keep mentioning that you want it to work with a large sieve. I > would much rather compute the same quantities with a small sieve if > possible. If you were using the Lehmer/Meissel algorithm you would be > able to compute the same quantity (i.e. pi(1e9)) using a much smaller > sieve with 30k items instead of 1e9. that would fit *very* comfortably > in memory and you wouldn't even need to slice the list. Or to put it > another way, you could compute pi(~1e18) using your current sieve > without slicing or paging. If you want to lift the limit on computing > pi(x) this is clearly the way to go. If prime_pi for huge numbers was really important to me I wouldn't be using Python! This is just one of a number of functions implemented in the class. It is nice to have and it is also nice for testing purposes to be able to run it for large sieves. But it is by no means important enough to devote dedicated code to computing it. The prime generator within the class is far more important and is the workhorse for most uses [snip] >> In other words, the lack of a list.count(value, limit) function makes >> Python less effective than it would otherwise be. I haven't looked at >> Python's C code base but I still wonder if there a good reason for NOT >> providing this? > > If you feel that this is a good suggestion for an improvement to > Python consider posting it on python-ideas. I wasn't aware of the > equivalent functionality on strings but I see that the tuple.count() > function is the same as list.count(). To be honest, I am not really in a position to judge whether this is a 'good' suggestion. It has turned up as potentially useful in two cases for me but one of these (the one discussed here) is a minority use case. I would also be a bit worried about launching this into a group of Python experts who will almost certainly be even more demanding of explanations than you folk here! Brian
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-23 20:16 +0100 |
| Message-ID | <mailman.989.1366744639.3114.python-list@python.org> |
| In reply to | #44206 |
On 23 April 2013 19:30, Blind Anagram <blindanagram@nowhere.org> wrote: > On 23/04/2013 18:45, Oscar Benjamin wrote: > > [snip] >> You keep mentioning that you want it to work with a large sieve. I >> would much rather compute the same quantities with a small sieve if >> possible. If you were using the Lehmer/Meissel algorithm you would be >> able to compute the same quantity (i.e. pi(1e9)) using a much smaller >> sieve with 30k items instead of 1e9. that would fit *very* comfortably >> in memory and you wouldn't even need to slice the list. Or to put it >> another way, you could compute pi(~1e18) using your current sieve >> without slicing or paging. If you want to lift the limit on computing >> pi(x) this is clearly the way to go. > > If prime_pi for huge numbers was really important to me I wouldn't be > using Python! I would, at least to begin with. The advantage that Python has for this sort of thing is that it takes a minimal amount of programmer effort to implement these kind of algorithms. This is because you don't have to worry about nuts and bolts problems like integer overflow and memory allocation. You also have an abundance of different data structures to fulfil the big-O requirements of most algorithms. It's easy to make generators that can iterate over infinite or arbitrarily large sequences while still using constant memory. And so on... To implement the algorithm I mentioned in e.g. C would be considerably more work. C would, however, perform much better using your brute force approach and would lift the memory constraints of your program by a small (in asymptotic terms) amount. So I actually find that I can often get a faster program in Python simply because it's much easier to implement fancier algorithms with the optimal asymptotic performance that will ultimately outperform a brute force approach whichever language is used. Although, in saying that, those particular algorithms are probably most naturally implemented in a functional language like Haskell with scalable recursion. >>> In other words, the lack of a list.count(value, limit) function makes >>> Python less effective than it would otherwise be. I haven't looked at >>> Python's C code base but I still wonder if there a good reason for NOT >>> providing this? >> >> If you feel that this is a good suggestion for an improvement to >> Python consider posting it on python-ideas. I wasn't aware of the >> equivalent functionality on strings but I see that the tuple.count() >> function is the same as list.count(). > > To be honest, I am not really in a position to judge whether this is a > 'good' suggestion. It has turned up as potentially useful in two cases > for me but one of these (the one discussed here) is a minority use case. > I would also be a bit worried about launching this into a group of > Python experts who will almost certainly be even more demanding of > explanations than you folk here! I wouldn't worry about that. I've never felt the need for this but then I would probably use numpy to do what you're doing. It's certainly not an outlandish suggestion and I'd be surprised if no one agreed with the idea. Oscar
[toc] | [prev] | [next] | [standalone]
| From | Terry Jan Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-04-23 16:00 -0400 |
| Message-ID | <mailman.990.1366747255.3114.python-list@python.org> |
| In reply to | #44203 |
On 4/23/2013 12:57 PM, Blind Anagram wrote: > So, all I was doing in asking for advice was to check whether there is > an easy way of avoiding the slice copy, And there is. > not because this is critical, > but rather because it is a pity to limit the performance because Python > forces a (strictly unnecessary) copy in order to perform a count within > a part of a list. Python does not force that. You have been given several simple no-copy alternatives. They happen to be slower *with CPython* because of the speed difference between Python code and C code. If the same timing tests were done with any of the implementations that execute python code faster, the results would likely be different. I thing str/byte/bytearray.count have more need for optional start,stop boundary parameters because a) people search in long texts and subtexts, more so I think that for other sequences, b) they search for substrings longer than 1 and hence c) the generic no-slice alternatives do not work for counting substrings. That said, I do see that tuple/list.index have had start, stop paramaters added, so doing the same for .count is conceivable. I just do not remember anyone else asking for such. The use case must be very rare. And as I said in my other post, .count(x) applies to any collections, but start,stop would only apply to sequences. > In other words, the lack of a list.count(value, limit) function makes > Python less effective than it would otherwise be. Untrue. The alternatives are just as *effective*.
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-23 21:41 +0100 |
| Message-ID | <ofedneXW2LlsbOvMnZ2dnUVZ8jWdnZ2d@brightview.co.uk> |
| In reply to | #44210 |
On 23/04/2013 21:00, Terry Jan Reedy wrote: > On 4/23/2013 12:57 PM, Blind Anagram wrote: > >> So, all I was doing in asking for advice was to check whether there is >> an easy way of avoiding the slice copy, > > And there is. > >> not because this is critical, >> but rather because it is a pity to limit the performance because Python >> forces a (strictly unnecessary) copy in order to perform a count within >> a part of a list. > > Python does not force that. You have been given several simple no-copy > alternatives. They happen to be slower *with CPython* because of the > speed difference between Python code and C code. If the same timing > tests were done with any of the implementations that execute python code > faster, the results would likely be different. Then being pedantic rather than colloquial, the lack of start, end parameters in the function list.count(value) means that anyone wishing to use this function on a part of a list is forced to slice the list and thereby invoke a possibly costly copy operation, one that is, in principle, not necessary in order to perform the underlying operation. [snip] >> In other words, the lack of a list.count(value, limit) function makes >> Python less effective than it would otherwise be. > > Untrue. The alternatives are just as *effective*. Then I fear that we will have to accept that we disagree on this. Brian
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2013-04-23 21:38 +0100 |
| Message-ID | <mailman.992.1366749953.3114.python-list@python.org> |
| In reply to | #44203 |
On 23 April 2013 21:00, Terry Jan Reedy <tjreedy@udel.edu> wrote:
>
> That said, I do see that tuple/list.index have had start, stop paramaters
> added, so doing the same for .count is conceivable.
Are those new? I don't remember them not being there.
You need the start/stop parameters to be able to use index for finding
all occurrences of an item rather than just the first:
def indices(value, sequence):
i = -1
while True:
try:
i = sequence.index(value, i + 1)
except ValueError:
return
yield i
I thought this was a common use case for .index() and that if the
interface had been designed more recently then it might have just been
an .indices() method that returned an iterator.
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-04-24 01:59 +0000 |
| Message-ID | <51773c85$0$29977$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #44203 |
On Tue, 23 Apr 2013 17:57:17 +0100, Blind Anagram wrote: > On 23/04/2013 15:49, Steven D'Aprano wrote: >> On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote: >> >>> I did a lot of work comparing the overall performance of the sieve >>> when using either lists or arrays and I found that lists were a lot >>> faster for the majority use case when the sieve is not large. >> >> And when the sieve is large? > > I don't know but since the majority use case is when the sieve is small, > it makes sense to choose a list. > >> I don't actually believe that the bottleneck is the cost of taking a >> list slice. That's pretty fast, even for huge lists, and all efforts to >> skip making a copy by using itertools.islice actually ended up slower. >> But suppose it is the bottleneck. Then *sooner or later* arrays will >> win over lists, simply because they're smaller. > > Maybe you have not noticed that, when I am discussing a huge sieve, I am > simply pushing a sieve designed primarily for a small sieve lengths to > the absolute limit. This is most definitely a minority use case. You don't say? I hadn't noticed the first three hundred times you mentioned it :-P In my opinion, it is more important to be efficient for large sieves, not small. As they say, for small N, everything is fast. Nobody is going to care about the difference between small-N taking 1ms or 10ms, but they will care about the difference between big-N taking 1 minute or 1 hour. So, as a general rule, optimize the expensive cases, and if the cheap cases are a little less cheap than they otherwise would have been, nobody will care or notice. Of course, it's your code, and you're free to make whatever trade-offs between time and space and programmer effort that you feel are best. I'm just making a suggestion. [...] > In other words, the lack of a list.count(value, limit) function makes > Python less effective than it would otherwise be. I haven't looked at > Python's C code base but I still wonder if there a good reason for NOT > providing this? The same reasons for not providing any other of an infinite number of potential features. You have to weigh up the potential benefits of the feature against the costs: - Is this a good use of developer time to build the feature? - Every new feature requires somebody to write it, somebody to check it, somebody to write tests for it, somebody to document it. Who is going to do all this work? - Every new feature increases the size and complexity of the language and makes it harder for people to learn. - Does this new feature actually have the advantages claimed? Many so- called optimizations actually turn out to be pessimizations instead. - Will the new feature be an attractive nuisance, fooling programmers into using it inappropriately? In this case, I would say that adding a limit argument to list.count is *absolutely not* worthwhile, because it is insufficiently general. To be general enough to be worthwhile, you would have to add three arguments: list.count(value, start=0, end=None, step=1) But even there I don't think it's general enough to cover the costs. I believe that a better solution would be to create list views to offer a subset of the list without actually copying. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Blind Anagram <blindanagram@nowhere.org> |
|---|---|
| Date | 2013-04-24 10:01 +0100 |
| Message-ID | <pYGdnfxJWvoeAurMnZ2dnUVZ7vednZ2d@brightview.co.uk> |
| In reply to | #44241 |
On 24/04/2013 02:59, Steven D'Aprano wrote: > On Tue, 23 Apr 2013 17:57:17 +0100, Blind Anagram wrote: [snip] > In my opinion, it is more important to be efficient for large sieves, not > small. As they say, for small N, everything is fast. Nobody is going to > care about the difference between small-N taking 1ms or 10ms, but they > will care about the difference between big-N taking 1 minute or 1 hour. > So, as a general rule, optimize the expensive cases, and if the cheap > cases are a little less cheap than they otherwise would have been, nobody > will care or notice. I agree in general but it is often the case that more sophisticated algorithms only gain traction over simpler ones at much higher points than might be expected from a simple analysis. In my experience a naive sieve is an efficient solution for a general purpose sieve class primarily intended for situations where the sieve length can be large but not huge. As I think you have pointed out, memory is cheap and the memory operations needed to do copying and counting operations are fast. So using up memory is not normally an issue. I have just found a situation where a copy can have a serious impact on performance in an admittedly limiting, minority use case. It the fact that this copy is not, in principle, necessary, that I find disappointing. [snip] > In this case, I would say that adding a limit argument to list.count is > *absolutely not* worthwhile, because it is insufficiently general. To be > general enough to be worthwhile, you would have to add three arguments: > > list.count(value, start=0, end=None, step=1) > > But even there I don't think it's general enough to cover the costs. I > believe that a better solution would be to create list views to offer a > subset of the list without actually copying. I don't know a great deal about Python internals but I suspect that these solutions would offer more but also cost more. So where the balance of advantages would lie is unclear. But I am not pushing for a particular (or, indeed, any) solution.
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web