Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #44051 > unrolled thread

List Count

Started byBlind Anagram <blindanagram@nowhere.org>
First post2013-04-22 12:58 +0100
Last post2013-04-22 06:36 -0700
Articles 20 on this page of 42 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 12:58 +0100
    Re: List Count Dave Angel <davea@davea.name> - 2013-04-22 08:51 -0400
      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 14:03 +0100
      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 14:03 +0100
    Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 13:13 +0000
      Re: List Count Skip Montanaro <skip@pobox.com> - 2013-04-22 08:57 -0500
      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 15:15 +0100
        Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 16:14 +0100
          Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 16:50 +0100
            Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 17:06 +0100
              Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 17:38 +0100
                Re: List Count Skip Montanaro <skip@pobox.com> - 2013-04-22 12:48 -0500
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 20:22 +0100
                Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 21:18 +0100
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 22:25 +0100
                    Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 00:06 +0100
                      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 07:45 +0100
                    Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 23:28 +0000
                      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:00 +0100
                Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-22 22:03 +0100
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 22:32 +0100
                    Re: List Count Dave Angel <davea@davea.name> - 2013-04-22 21:47 -0400
                      Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:02 +0100
              Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 17:38 +0100
          Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-22 16:50 +0100
        Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-22 23:01 +0000
          Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 08:05 +0100
            Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 12:08 +0100
              Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 12:45 +0100
                Re: List Count Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-23 15:01 -0400
            Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-23 14:49 +0000
              Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 17:57 +0100
                Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 18:45 +0100
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 19:30 +0100
                    Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 20:16 +0100
                Re: List Count Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-23 16:00 -0400
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-23 21:41 +0100
                Re: List Count Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-04-23 21:38 +0100
                Re: List Count Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-24 01:59 +0000
                  Re: List Count Blind Anagram <blindanagram@nowhere.org> - 2013-04-24 10:01 +0100
    Re: List Count Peter Otten <__peter__@web.de> - 2013-04-22 15:22 +0200
    Re: List Count 88888 Dihedral <dihedral88888@googlemail.com> - 2013-04-22 06:36 -0700

Page 1 of 3  [1] 2 3  Next page →


#44051 — List Count

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 12:58 +0100
SubjectList Count
Message-ID<sridnffI6YhxuOjMnZ2dnUVZ7tSdnZ2d@brightview.co.uk>
I would be grateful for any advice people can offer on the fastest way
to count items in a sub-sequence of a large list.

I have a list of boolean values that can contain many hundreds of
millions of elements for which I want to count the number of True values
in a sub-sequence, one from the start up to some value (say hi).

I am currently using:

   sieve[:hi].count(True)

but I believe this may be costly because it copies a possibly large part
of the sieve.

Ideally I would like to be able to use:

   sieve.count(True, hi)

where 'hi' sets the end of the count but this function is, sadly, not
available for lists.

The use of a bytearray with a memoryview object instead of a list solves
this particular problem but it is not a solution for me as it creates
more problems than it solves in other aspects of the program.

Can I assume that one possible solution would be to sub-class list and
create a C based extension to provide list.count(value, limit)?

Are there any other solutions that will avoid copying a large part of
the list?

[toc] | [next] | [standalone]


#44057

FromDave Angel <davea@davea.name>
Date2013-04-22 08:51 -0400
Message-ID<mailman.907.1366635127.3114.python-list@python.org>
In reply to#44051
On 04/22/2013 07:58 AM, Blind Anagram wrote:
> I would be grateful for any advice people can offer on the fastest way
> to count items in a sub-sequence of a large list.
>
> I have a list of boolean values that can contain many hundreds of
> millions of elements for which I want to count the number of True values
> in a sub-sequence, one from the start up to some value (say hi).
>
> I am currently using:
>
>     sieve[:hi].count(True)
>
> but I believe this may be costly because it copies a possibly large part
> of the sieve.
>
> Ideally I would like to be able to use:
>
>     sieve.count(True, hi)
>
> where 'hi' sets the end of the count but this function is, sadly, not
> available for lists.
>
> The use of a bytearray with a memoryview object instead of a list solves
> this particular problem but it is not a solution for me as it creates
> more problems than it solves in other aspects of the program.
>
> Can I assume that one possible solution would be to sub-class list and
> create a C based extension to provide list.count(value, limit)?
>
> Are there any other solutions that will avoid copying a large part of
> the list?
>

Instead of using the default slice notation, why not use 
itertools.islice() ?

Something like  (untested):

import itertools

it = itertools.islice(sieve, 0, hi)
sum(itertools.imap(bool, it))

I only broke it into two lines for clarity.  It could also be:

sum(itertools.imap(bool, itertools.islice(sieve, 0, hi)))

If you're using Python 3.x, say so, and I'm sure somebody can simplify 
these, since in Python 3, many functions already produce iterators 
instead of lists.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#44060

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 14:03 +0100
Message-ID<5175351D.9070102@nowhere.org>
In reply to#44057
On 22/04/2013 13:51, Dave Angel wrote:
> On 04/22/2013 07:58 AM, Blind Anagram wrote:
>> I would be grateful for any advice people can offer on the fastest way
>> to count items in a sub-sequence of a large list.
>>
>> I have a list of boolean values that can contain many hundreds of
>> millions of elements for which I want to count the number of True values
>> in a sub-sequence, one from the start up to some value (say hi).
>>
>> I am currently using:
>>
>>     sieve[:hi].count(True)
>>
>> but I believe this may be costly because it copies a possibly large part
>> of the sieve.
>>
>> Ideally I would like to be able to use:
>>
>>     sieve.count(True, hi)
>>
>> where 'hi' sets the end of the count but this function is, sadly, not
>> available for lists.
>>
>> The use of a bytearray with a memoryview object instead of a list solves
>> this particular problem but it is not a solution for me as it creates
>> more problems than it solves in other aspects of the program.
>>
>> Can I assume that one possible solution would be to sub-class list and
>> create a C based extension to provide list.count(value, limit)?
>>
>> Are there any other solutions that will avoid copying a large part of
>> the list?
>>
> 
> Instead of using the default slice notation, why not use
> itertools.islice() ?
> 
> Something like  (untested):
> 
> import itertools
> 
> it = itertools.islice(sieve, 0, hi)
> sum(itertools.imap(bool, it))
> 
> I only broke it into two lines for clarity.  It could also be:
> 
> sum(itertools.imap(bool, itertools.islice(sieve, 0, hi)))
> 
> If you're using Python 3.x, say so, and I'm sure somebody can simplify
> these, since in Python 3, many functions already produce iterators
> instead of lists.

Thanks, I'll look at these ideas.  And, yes, my interest is mainly in
Python 3.

[toc] | [prev] | [next] | [standalone]


#44061

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 14:03 +0100
Message-ID<mailman.910.1366636010.3114.python-list@python.org>
In reply to#44057
On 22/04/2013 13:51, Dave Angel wrote:
> On 04/22/2013 07:58 AM, Blind Anagram wrote:
>> I would be grateful for any advice people can offer on the fastest way
>> to count items in a sub-sequence of a large list.
>>
>> I have a list of boolean values that can contain many hundreds of
>> millions of elements for which I want to count the number of True values
>> in a sub-sequence, one from the start up to some value (say hi).
>>
>> I am currently using:
>>
>>     sieve[:hi].count(True)
>>
>> but I believe this may be costly because it copies a possibly large part
>> of the sieve.
>>
>> Ideally I would like to be able to use:
>>
>>     sieve.count(True, hi)
>>
>> where 'hi' sets the end of the count but this function is, sadly, not
>> available for lists.
>>
>> The use of a bytearray with a memoryview object instead of a list solves
>> this particular problem but it is not a solution for me as it creates
>> more problems than it solves in other aspects of the program.
>>
>> Can I assume that one possible solution would be to sub-class list and
>> create a C based extension to provide list.count(value, limit)?
>>
>> Are there any other solutions that will avoid copying a large part of
>> the list?
>>
> 
> Instead of using the default slice notation, why not use
> itertools.islice() ?
> 
> Something like  (untested):
> 
> import itertools
> 
> it = itertools.islice(sieve, 0, hi)
> sum(itertools.imap(bool, it))
> 
> I only broke it into two lines for clarity.  It could also be:
> 
> sum(itertools.imap(bool, itertools.islice(sieve, 0, hi)))
> 
> If you're using Python 3.x, say so, and I'm sure somebody can simplify
> these, since in Python 3, many functions already produce iterators
> instead of lists.

Thanks, I'll look at these ideas.  And, yes, my interest is mainly in
Python 3.

[toc] | [prev] | [next] | [standalone]


#44062

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-04-22 13:13 +0000
Message-ID<5175377f$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#44051
On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:

> I would be grateful for any advice people can offer on the fastest way
> to count items in a sub-sequence of a large list.
> 
> I have a list of boolean values that can contain many hundreds of
> millions of elements for which I want to count the number of True values
> in a sub-sequence, one from the start up to some value (say hi).
> 
> I am currently using:
> 
>    sieve[:hi].count(True)
> 
> but I believe this may be costly because it copies a possibly large part
> of the sieve.

Have you timed it? Because Python is a high-level language, it is rarely 
obvious what code will be fast. Yes, sieve[:hi] will copy the first hi 
entries, but that's likely to be fast, basically just a memcopy, unless 
sieve is huge and memory is short. In other words, unless your sieve is 
so huge that the operating system cannot find enough memory for it, 
making a copy is likely to be relatively insignificant.

I've just tried seven different techniques to "optimize" this, and the 
simplest, most obvious technique is by far the fastest. Here are the 
seven different code snippets I measured, with results:


sieve[:hi].count(True)
sum(sieve[:hi])
sum(islice(sieve, hi))
sum(x for x in islice(sieve, hi) if x)
sum(x for x in islice(sieve, hi) if x is True)
sum(1 for x in islice(sieve, hi) if x is True)
len(list(filter(None, islice(sieve, hi))))


Here's the code I used to time them. Just copy and paste into an 
interactive interpreter:

=== cut ===

import random
sieve = [random.random() < 0.5 for i in range(10**7)]

from timeit import Timer
setup = """from __main__ import sieve
from itertools import islice
hi = 7*10**6
"""

t1 = Timer("sieve[:hi].count(True)", setup)
t2 = Timer("sum(sieve[:hi])", setup)
t3 = Timer("sum(islice(sieve, hi))", setup)
t4 = Timer("sum(x for x in islice(sieve, hi) if x)", setup)
t5 = Timer("sum(x for x in islice(sieve, hi) if x is True)", setup)
t6 = Timer("sum(1 for x in islice(sieve, hi) if x is True)", setup)
t7 = Timer("len(list(filter(None, islice(sieve, hi))))", setup)

for t in (t1, t2, t3, t4, t5, t6, t7):
    print( min(t.repeat(number=10)) )

=== cut ===


On my computer, using Python 3.3, here are the timing results I get:

2.3714727330952883
7.96061935601756
7.230580328032374
10.080201900098473
11.544118068180978
9.216834562830627
3.499635103158653


Times shown are in seconds, and are for the best of three trials, each 
trial having 10 repetitions of the code being tested.

As you can see, clever tricks using sum are horrible pessimisations, the 
only thing that comes close to the obvious solution is the one using 
filter.

Although I have only tested a list with ten million items, not hundreds 
of millions, I don't expect that the results will be significantly 
different if you use a larger list, unless you are very short of memory.

[...]
> Can I assume that one possible solution would be to sub-class list and
> create a C based extension to provide list.count(value, limit)?

Of course. But don't optimize this until you know that you *need* to 
optimize it. Is it really a bottleneck in your code? There's no point in 
saving the 0.1 second it takes to copy the list if it takes 2 seconds to 
count the items regardless.


> Are there any other solutions that will avoid copying a large part of
> the list?

Yes, but they're slower.

Perhaps a better solution might be to avoid counting anything. If you can 
keep a counter, and each time you add a value to the list you update the 
counter, then getting the number of True values will be instantaneous.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#44071

FromSkip Montanaro <skip@pobox.com>
Date2013-04-22 08:57 -0500
Message-ID<mailman.915.1366639056.3114.python-list@python.org>
In reply to#44062
Numpy is a big improvement here.  In Py 2.7 I get this output if I run
Steven's benchmark:

2.10364603996
3.68471002579
4.01849389076
7.41974878311
10.4202470779
9.16782712936
3.36137390137

(confirming his results).  If I then run the numpy idiom for this:

########################
import random
from timeit import Timer

import numpy

sieve = numpy.array([random.random() < 0.5 for i in range(10**7)],
                    dtype=bool)

setup = """from __main__ import sieve
from itertools import islice
hi = 7*10**6
"""

t1 = Timer("(True == sieve[:hi]).sum()", setup)

print(min(t1.repeat(number=10)))
###########################

I get :

0.344316959381

It likely consumes less space as well, since it doesn't store Python
objects in the array.

Skip

[toc] | [prev] | [next] | [standalone]


#44074

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 15:15 +0100
Message-ID<517545F7.5090209@nowhere.org>
In reply to#44062
On 22/04/2013 14:13, Steven D'Aprano wrote:
> On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:
> 
>> I would be grateful for any advice people can offer on the fastest way
>> to count items in a sub-sequence of a large list.
>>
>> I have a list of boolean values that can contain many hundreds of
>> millions of elements for which I want to count the number of True values
>> in a sub-sequence, one from the start up to some value (say hi).
>>
>> I am currently using:
>>
>>    sieve[:hi].count(True)
>>
>> but I believe this may be costly because it copies a possibly large part
>> of the sieve.
> 
> Have you timed it? Because Python is a high-level language, it is rarely 
> obvious what code will be fast. Yes, sieve[:hi] will copy the first hi 
> entries, but that's likely to be fast, basically just a memcopy, unless 
> sieve is huge and memory is short. In other words, unless your sieve is 
> so huge that the operating system cannot find enough memory for it, 
> making a copy is likely to be relatively insignificant.
> 
> I've just tried seven different techniques to "optimize" this, and the 
> simplest, most obvious technique is by far the fastest. Here are the 
> seven different code snippets I measured, with results:
> 
> 
> sieve[:hi].count(True)
> sum(sieve[:hi])
> sum(islice(sieve, hi))
> sum(x for x in islice(sieve, hi) if x)
> sum(x for x in islice(sieve, hi) if x is True)
> sum(1 for x in islice(sieve, hi) if x is True)
> len(list(filter(None, islice(sieve, hi))))

Yes, I did time it and I agree with your results (where my tests overlap
with yours).

But when using a sub-sequence, I do suffer a significant reduction in
speed for a count when compared with count on the full list.  When the
list is small enough not to cause memory allocation issues this is about
30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
memory allocation becomes an issue and the cost on my system rises to
over 600%.

I agree that this is not a big issue but it seems to me a high price to
pay for the lack of a sieve.count(value, limit), which I feel is a
useful function (given that memoryview operations are not available for
lists).

> Of course. But don't optimize this until you know that you *need* to 
> optimize it. Is it really a bottleneck in your code? There's no point in 
> saving the 0.1 second it takes to copy the list if it takes 2 seconds to 
> count the items regardless.
> 
>> Are there any other solutions that will avoid copying a large part of
>> the list?
> 
> Yes, but they're slower.
> 
> Perhaps a better solution might be to avoid counting anything. If you can 
> keep a counter, and each time you add a value to the list you update the 
> counter, then getting the number of True values will be instantaneous.

Creating the sieve is currently very fast as it is not done by adding
single items but by adding a large number of items at the same time
using a slice operation.  I could count the items in each slice as it is
added but this would add complexity that I would prefer to avoid because
the creation of the sieve is quite tricky to get right and I would
prefer not to fiddle with this.

Thank you (and others) for advice on this.

[toc] | [prev] | [next] | [standalone]


#44089

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-22 16:14 +0100
Message-ID<mailman.927.1366644147.3114.python-list@python.org>
In reply to#44074
On 22 April 2013 15:15, Blind Anagram <blindanagram@nowhere.org> wrote:
> On 22/04/2013 14:13, Steven D'Aprano wrote:
>> On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:
>>
>>> I would be grateful for any advice people can offer on the fastest way
>>> to count items in a sub-sequence of a large list.
>>>
>>> I have a list of boolean values that can contain many hundreds of
>>> millions of elements for which I want to count the number of True values
>>> in a sub-sequence, one from the start up to some value (say hi).
>>>
>>> I am currently using:
>>>
>>>    sieve[:hi].count(True)
>>>
>>> but I believe this may be costly because it copies a possibly large part
>>> of the sieve.
[snip]
>
> But when using a sub-sequence, I do suffer a significant reduction in
> speed for a count when compared with count on the full list.  When the
> list is small enough not to cause memory allocation issues this is about
> 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
> memory allocation becomes an issue and the cost on my system rises to
> over 600%.

Have you tried using numpy? I find that it reduces the memory required
to store a list of bools by a factor of 4 on my 32 bit system. I would
expect that to be a factor of 8 on a 64 bit system:

>>> import sys
>>> a = [True] * 1000000
>>> sys.getsizeof(a)
4000036
>>> import numpy
>>> a = numpy.ndarray(1000000, bool)
>>> sys.getsizeof(a)  # This does not include the data buffer
40
>>> a.nbytes
1000000

The numpy array also has the advantage that slicing does not actually
copy the data (as has already been mentioned). On this system slicing
a numpy array has a 40 byte overhead regardless of the size of the
slice.

> I agree that this is not a big issue but it seems to me a high price to
> pay for the lack of a sieve.count(value, limit), which I feel is a
> useful function (given that memoryview operations are not available for
> lists).

It would be very easy to subclass list and add this functionality in
cython if you decide that you do need a builtin method.


Oscar

[toc] | [prev] | [next] | [standalone]


#44090

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 16:50 +0100
Message-ID<51755C38.4000204@nowhere.org>
In reply to#44089
On 22/04/2013 16:14, Oscar Benjamin wrote:

> On 22 April 2013 15:15, Blind Anagram <blindanagram@nowhere.org> wrote:
>> On 22/04/2013 14:13, Steven D'Aprano wrote:
>>> On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:
>>>
>>>> I would be grateful for any advice people can offer on the fastest way
>>>> to count items in a sub-sequence of a large list.
>>>>
>>>> I have a list of boolean values that can contain many hundreds of
>>>> millions of elements for which I want to count the number of True values
>>>> in a sub-sequence, one from the start up to some value (say hi).
>>>>
>>>> I am currently using:
>>>>
>>>>    sieve[:hi].count(True)
>>>>
>>>> but I believe this may be costly because it copies a possibly large part
>>>> of the sieve.
> [snip]
>>
>> But when using a sub-sequence, I do suffer a significant reduction in
>> speed for a count when compared with count on the full list.  When the
>> list is small enough not to cause memory allocation issues this is about
>> 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
>> memory allocation becomes an issue and the cost on my system rises to
>> over 600%.
> 
> Have you tried using numpy? I find that it reduces the memory required
> to store a list of bools by a factor of 4 on my 32 bit system. I would
> expect that to be a factor of 8 on a 64 bit system:
> 
>>>> import sys
>>>> a = [True] * 1000000
>>>> sys.getsizeof(a)
> 4000036
>>>> import numpy
>>>> a = numpy.ndarray(1000000, bool)
>>>> sys.getsizeof(a)  # This does not include the data buffer
> 40
>>>> a.nbytes
> 1000000
> 
> The numpy array also has the advantage that slicing does not actually
> copy the data (as has already been mentioned). On this system slicing
> a numpy array has a 40 byte overhead regardless of the size of the
> slice.
> 
>> I agree that this is not a big issue but it seems to me a high price to
>> pay for the lack of a sieve.count(value, limit), which I feel is a
>> useful function (given that memoryview operations are not available for
>> lists).
> 
> It would be very easy to subclass list and add this functionality in
> cython if you decide that you do need a builtin method.

Thanks Oscar, I'll take a look at this.

But I was really wondering if there was a simple solution that worked
without people having to add libraries to their basic Python installations.

As I have never tried building an extension with cython, I am inclined
to try this as a learning exercise if nothing else.

[toc] | [prev] | [next] | [standalone]


#44092

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-22 17:06 +0100
Message-ID<mailman.929.1366646795.3114.python-list@python.org>
In reply to#44090
On 22 April 2013 16:50, Blind Anagram <blindanagram@nowhere.org> wrote:
>>
>> It would be very easy to subclass list and add this functionality in
>> cython if you decide that you do need a builtin method.
[snip]
>
> But I was really wondering if there was a simple solution that worked
> without people having to add libraries to their basic Python installations.

There are simple solutions and some have already been listed. You are
attempting to push your program to the limit of your hardware
capabilities and it's natural that in a high-level language you'll
often want special libraries for that.

I don't know what your application is but I would say that my first
port of call here would be to consider a different algorithmic
approach. An obvious question would be about the sparsity of this data
structure. How frequent are the values that you are trying to count?
Would it make more sense to store a list of their indices?

If the problem needs to be solved the way that you are currently doing
it and the available methods are not fast enough then you will need to
consider additional libraries.

>
> As I have never tried building an extension with cython, I am inclined
> to try this as a learning exercise if nothing else.

I definitely recommend this over writing a C extension directly.


Oscar

[toc] | [prev] | [next] | [standalone]


#44095

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 17:38 +0100
Message-ID<51756769.20206@nowhere.org>
In reply to#44092
On 22/04/2013 17:06, Oscar Benjamin wrote:
> On 22 April 2013 16:50, Blind Anagram <blindanagram@nowhere.org> wrote:
>>>
>>> It would be very easy to subclass list and add this functionality in
>>> cython if you decide that you do need a builtin method.
> [snip]
>>
>> But I was really wondering if there was a simple solution that worked
>> without people having to add libraries to their basic Python installations.
> 
> There are simple solutions and some have already been listed. You are
> attempting to push your program to the limit of your hardware
> capabilities and it's natural that in a high-level language you'll
> often want special libraries for that.

Hi Oscar

Yes, but it is a tribute to Python that I can do this quite fast for
huge lists provided that I only count on the full list.

And, unless I have completely misunderstood Python internals, it would
probably be just as fast on a sub-sequence if I had a list.count(value,
limit) function (however, I admit that I could be wrong here since the
fact that count on lists does not offer this may mean that it is not as
easy to implement as it might seem).

> I don't know what your application is but I would say that my first
> port of call here would be to consider a different algorithmic
> approach. An obvious question would be about the sparsity of this data
> structure. How frequent are the values that you are trying to count?
> Would it make more sense to store a list of their indices?

Actually it is no more than a simple prime sieve implemented as a Python
class (and, yes, I realize that there are plenty of these around).

> If the problem needs to be solved the way that you are currently doing
> it and the available methods are not fast enough then you will need to
> consider additional libraries.
>>
>> As I have never tried building an extension with cython, I am inclined
>> to try this as a learning exercise if nothing else.
> 
> I definitely recommend this over writing a C extension directly.

Thanks again - I will definitely look at this.

   Brian

[toc] | [prev] | [next] | [standalone]


#44101

FromSkip Montanaro <skip@pobox.com>
Date2013-04-22 12:48 -0500
Message-ID<mailman.933.1366652894.3114.python-list@python.org>
In reply to#44095
> But I was really wondering if there was a simple solution that worked
> without people having to add libraries to their basic Python installations.

I think installing numpy is approximately

    pip install numpy

assuming you have write access to your site-packages directory.  If
not, install using --prefix and set PYTHONPATH accordingly.

I forgot that Python also has an array module.  With numpy available,
mature, and well-supported, I imagine it doesn't get much love these
days though.  Still, I gave it a whirl:

#######################################
import random
import array
from timeit import Timer

import numpy

stuff = [random.random() < 0.5 for i in range(10**7)]
sieve1 = numpy.array(stuff, dtype=bool)
sieve2 = array.array('B', stuff)

setup = """from __main__ import sieve1, sieve2
from itertools import islice
hi = 7*10**6
"""

t1 = Timer("(True == sieve1[:hi]).sum()", setup)
t2 = Timer("sieve2[:hi].count(True)", setup)
# t3 = Timer("sum(islice(sieve, hi))", setup)
# t4 = Timer("sum(x for x in islice(sieve, hi) if x)", setup)
# t5 = Timer("sum(x for x in islice(sieve, hi) if x is True)", setup)
# t6 = Timer("sum(1 for x in islice(sieve, hi) if x is True)", setup)
# t7 = Timer("len(list(filter(None, islice(sieve, hi))))", setup)

print(min(t1.repeat(number=10)))
print(min(t2.repeat(number=10)))
# for t in (t1, t2, t3, t4, t5, t6, t7):
#     print( min(t.repeat(number=10)) )
#######################################

Performance was not all that impressive:

0.340315103531
5.42102503777

Still, you might fiddle around with it a bit.  Perhaps unsigned ints
instead of unsigned bytes will provide more efficient counting...

Skip

[toc] | [prev] | [next] | [standalone]


#44105

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 20:22 +0100
Message-ID<c6adnRfaHflnEOjMnZ2dnUVZ8sSdnZ2d@brightview.co.uk>
In reply to#44101
On 22/04/2013 18:48, Skip Montanaro wrote:
>> But I was really wondering if there was a simple solution that worked
>> without people having to add libraries to their basic Python installations.
> 
> I think installing numpy is approximately
> 
>     pip install numpy
> 
> assuming you have write access to your site-packages directory.  If
> not, install using --prefix and set PYTHONPATH accordingly.
> 
> I forgot that Python also has an array module.  With numpy available,
> mature, and well-supported, I imagine it doesn't get much love these
> days though.  Still, I gave it a whirl:
> 
> #######################################
> import random
> import array
> from timeit import Timer
> 
> import numpy
> 
> stuff = [random.random() < 0.5 for i in range(10**7)]
> sieve1 = numpy.array(stuff, dtype=bool)
> sieve2 = array.array('B', stuff)
> 
> setup = """from __main__ import sieve1, sieve2
> from itertools import islice
> hi = 7*10**6
> """
> 
> t1 = Timer("(True == sieve1[:hi]).sum()", setup)
> t2 = Timer("sieve2[:hi].count(True)", setup)
> # t3 = Timer("sum(islice(sieve, hi))", setup)
> # t4 = Timer("sum(x for x in islice(sieve, hi) if x)", setup)
> # t5 = Timer("sum(x for x in islice(sieve, hi) if x is True)", setup)
> # t6 = Timer("sum(1 for x in islice(sieve, hi) if x is True)", setup)
> # t7 = Timer("len(list(filter(None, islice(sieve, hi))))", setup)
> 
> print(min(t1.repeat(number=10)))
> print(min(t2.repeat(number=10)))
> # for t in (t1, t2, t3, t4, t5, t6, t7):
> #     print( min(t.repeat(number=10)) )
> #######################################
> 
> Performance was not all that impressive:
> 
> 0.340315103531
> 5.42102503777
> 
> Still, you might fiddle around with it a bit.  Perhaps unsigned ints
> instead of unsigned bytes will provide more efficient counting...

I spent a lot of time comparing python arrays and lists but found that
lists were always much faster in this application.

I do have numpy installed but I remember that when I did this (some time
ago) it was far from easy with Python 3.x running natively on Windows x64.

  Brian

[toc] | [prev] | [next] | [standalone]


#44108

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-22 21:18 +0100
Message-ID<mailman.936.1366661918.3114.python-list@python.org>
In reply to#44095
On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote:
> On 22/04/2013 17:06, Oscar Benjamin wrote:
>
>> I don't know what your application is but I would say that my first
>> port of call here would be to consider a different algorithmic
>> approach. An obvious question would be about the sparsity of this data
>> structure. How frequent are the values that you are trying to count?
>> Would it make more sense to store a list of their indices?
>
> Actually it is no more than a simple prime sieve implemented as a Python
> class (and, yes, I realize that there are plenty of these around).

If I understand correctly, you have a list of roughly a billion
True/False values indicating which integers are prime and which are
not. You would like to discover how many prime numbers there are
between two numbers a and b. You currently do this by counting the
number of True values in your list between the indices a and b.

If my description is correct then I would definitely consider using a
different algorithmic approach. The density of primes from 1 to 1
billlion is about 5%. Storing the prime numbers themselves in a sorted
list would save memory and allow a potentially more efficient way of
counting the number of primes within some interval.

To see how it saves memory (on a 64 bit system):

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> a = ([True] + [False]*19) * 50000
>>> len(a)
1000000
>>> sys.getsizeof(a)
8000072
>>> a = list(range(50000))
>>> sys.getsizeof(a)
450120
>>> sum(sys.getsizeof(x) for x in a)
1200000

So you're using about 1/5th of the memory with a list of primes
compared to a list of True/False values. Further savings would be
possible if you used an array to store the primes as 64 bit integers.
In this case it would take about 400MB to store all the primes up to 1
billion.

The more efficient way of counting the primes would then be to use the
bisect module. This gives you a way of counting the primes between a
and b with a cost that is logarithmic in the total number of primes
stored rather than linear in the size of the range (e.g. b-a). For
large enough primes/ranges this is certain to be faster. Whether it
actually works that way for your numbers I can't say.


Oscar

[toc] | [prev] | [next] | [standalone]


#44116

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-22 22:25 +0100
Message-ID<_umdnUiJWp9yN-jMnZ2dnUVZ7o-dnZ2d@brightview.co.uk>
In reply to#44108
On 22/04/2013 21:18, Oscar Benjamin wrote:
> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote:

[snip]

> If my description is correct then I would definitely consider using a
> different algorithmic approach. The density of primes from 1 to 1
> billlion is about 5%. Storing the prime numbers themselves in a sorted
> list would save memory and allow a potentially more efficient way of
> counting the number of primes within some interval.

That is correct but I need to say that the lengths I have been
describing are limiting cases - almost all of the time the sieve length
will be quite small.

But I was still interested to see if I could push the limit without
changing the essential simplicity of the sieve.

And here the cost of creating the slice (which I have measured) set me
wondering why a list.count(value, limit) function did not exist.

I also wondered whether I had missed any obvious way of avoiding the
slicing cost (intellectually it seemed wrong to me to have to copy the
list in order to count items within it).
[snip]

> 
> So you're using about 1/5th of the memory with a list of primes
> compared to a list of True/False values. Further savings would be
> possible if you used an array to store the primes as 64 bit integers.
> In this case it would take about 400MB to store all the primes up to 1
> billion.

I have looked at solutions based on listing primes and here I have found
that they are very much slower than my existing solution when the sieve
is not large (which is the majority use case).

I have also tried counting using a loop such as:

  while i < limit:
    i = sieve.index(1, i) + 1
    cnt += 1

but this is slower than count even on huge lists.

Thank you again for your advice.

   Brian


[toc] | [prev] | [next] | [standalone]


#44122

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-23 00:06 +0100
Message-ID<mailman.940.1366672027.3114.python-list@python.org>
In reply to#44116
On 22 April 2013 22:25, Blind Anagram <blindanagram@nowhere.org> wrote:
> On 22/04/2013 21:18, Oscar Benjamin wrote:
>> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote:
>
> I also wondered whether I had missed any obvious way of avoiding the
> slicing cost (intellectually it seemed wrong to me to have to copy the
> list in order to count items within it).
 [snip]
>
> I have looked at solutions based on listing primes and here I have found
> that they are very much slower than my existing solution when the sieve
> is not large (which is the majority use case).

What matters is not so much the size of the sieve but the size of the
interval you want to query. You say that slicing cost is somehow
significant which suggests to me that it's not a small interval. An
approach using a sorted list of primes and bisect would have a cost
that is independent of the size of the interval (and depends only
logarithmically on the size of the sieve).


Oscar

[toc] | [prev] | [next] | [standalone]


#44149

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-23 07:45 +0100
Message-ID<ipOdnSDret2ls-vMnZ2dnUVZ8rGdnZ2d@brightview.co.uk>
In reply to#44122
On 23/04/2013 00:06, Oscar Benjamin wrote:
> On 22 April 2013 22:25, Blind Anagram <blindanagram@nowhere.org> wrote:
>> On 22/04/2013 21:18, Oscar Benjamin wrote:
>>> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote:
>>
>> I also wondered whether I had missed any obvious way of avoiding the
>> slicing cost (intellectually it seemed wrong to me to have to copy the
>> list in order to count items within it).
>  [snip]
>>
>> I have looked at solutions based on listing primes and here I have found
>> that they are very much slower than my existing solution when the sieve
>> is not large (which is the majority use case).
> 
> What matters is not so much the size of the sieve but the size of the
> interval you want to query. You say that slicing cost is somehow
> significant which suggests to me that it's not a small interval. An
> approach using a sorted list of primes and bisect would have a cost
> that is independent of the size of the interval (and depends only
> logarithmically on the size of the sieve).

The issue here is that the prime number count is only one of the
features of the class so I would have to essentially rewrite it to use
the technique you suggest.

And I found in my experiments that creating the sieve in the form of a
list of primes (either directly or by converting a linear sieve) is a
great deal slower than a simple sieve for the majority use cases where
the sieve is not huge.

I don't want to take up peoples time but I am willing to expose the code
to anyone who has an interest as I am sure that it has wrinkles that
others with more experience with Python would find.

But I would also be interested to discover whether there is a rationale
for not offering list.count(value, limit) as this seems to me a useful
function which I have found myself wanting more than once now.

Thank you again for your input, which I appreciate.

   Brian

[toc] | [prev] | [next] | [standalone]


#44123

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-04-22 23:28 +0000
Message-ID<5175c791$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to#44116
On Mon, 22 Apr 2013 22:25:50 +0100, Blind Anagram wrote:

> I have looked at solutions based on listing primes and here I have found
> that they are very much slower than my existing solution when the sieve
> is not large (which is the majority use case).

Yes. This is hardly surprising. Algorithms suitable for dealing with the 
first million primes are not suitable for dealing with the first trillion 
primes, and vice versa. We like to pretend that computer programming is 
an abstraction, and for small enough data we often can get away with 
that, but like all abstractions eventually it breaks and the cost of 
dealing with real hardware becomes significant.

But I must ask, given that the primes are so widely distributed, why are 
you storing them in a list instead of a sparse array (i.e. a dict)? There 
are 50,847,534 primes less than or equal to 1,000,000,000, so you are 
storing roughly 18 False values for every True value. That ratio will 
only get bigger. With a billion entries, you are using 18 times more 
memory than necessary.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#44154

FromBlind Anagram <blindanagram@nowhere.org>
Date2013-04-23 08:00 +0100
Message-ID<jfudneV1BOEprOvMnZ2dnUVZ8jmdnZ2d@brightview.co.uk>
In reply to#44123
On 23/04/2013 00:28, Steven D'Aprano wrote:
> On Mon, 22 Apr 2013 22:25:50 +0100, Blind Anagram wrote:
> 
>> I have looked at solutions based on listing primes and here I have found
>> that they are very much slower than my existing solution when the sieve
>> is not large (which is the majority use case).
> 
> Yes. This is hardly surprising. Algorithms suitable for dealing with the 
> first million primes are not suitable for dealing with the first trillion 
> primes, and vice versa. We like to pretend that computer programming is 
> an abstraction, and for small enough data we often can get away with 
> that, but like all abstractions eventually it breaks and the cost of 
> dealing with real hardware becomes significant.
> 
> But I must ask, given that the primes are so widely distributed, why are 
> you storing them in a list instead of a sparse array (i.e. a dict)? There 
> are 50,847,534 primes less than or equal to 1,000,000,000, so you are 
> storing roughly 18 False values for every True value. That ratio will 
> only get bigger. With a billion entries, you are using 18 times more 
> memory than necessary.

Because the majority use case for my Prime class is for a sieve that is
not large.  I am just pushing the envelope for a minority use case so
that it still works for huge sieves, albeit inefficiently.

I accept it is inefficient, but the fact remains that I can produce a
sieve that can yield and count a billion primes in a reasonable time but
this fails when I wish to count on a part of the sieve because this can
double the memory requirement for the lack of a list.count(value, limit)
function.

I would not dream of doing this job by copying a slice in any other
language that I have used so I was simply asking for advice to discover
whether this copy could be avoided whilst staying with the simple sieve
design.

Thank you for your input.

   Brian

[toc] | [prev] | [next] | [standalone]


#44113

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2013-04-22 22:03 +0100
Message-ID<mailman.938.1366664643.3114.python-list@python.org>
In reply to#44095
On 22 April 2013 21:18, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> On 22 April 2013 17:38, Blind Anagram <blindanagram@nowhere.org> wrote:
>> On 22/04/2013 17:06, Oscar Benjamin wrote:
>>
>>> I don't know what your application is but I would say that my first
>>> port of call here would be to consider a different algorithmic
>>> approach. An obvious question would be about the sparsity of this data
>>> structure. How frequent are the values that you are trying to count?
>>> Would it make more sense to store a list of their indices?
>>
>> Actually it is no more than a simple prime sieve implemented as a Python
>> class (and, yes, I realize that there are plenty of these around).
>
> If I understand correctly, you have a list of roughly a billion
> True/False values indicating which integers are prime and which are
> not. You would like to discover how many prime numbers there are
> between two numbers a and b. You currently do this by counting the
> number of True values in your list between the indices a and b.
>
> If my description is correct then I would definitely consider using a
> different algorithmic approach. The density of primes from 1 to 1
> billlion is about 5%. Storing the prime numbers themselves in a sorted
> list would save memory and allow a potentially more efficient way of
> counting the number of primes within some interval.

In fact it is probably quicker if you don't mind using all that memory
to just store the cumulative sum of your prime True/False indicator
list. This would be the prime counting function pi(n). You can then
count the primes between a and b in constant time with pi[b] - pi[a].


Oscar

[toc] | [prev] | [next] | [standalone]


Page 1 of 3  [1] 2 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web