Groups > comp.lang.python > #52289 > unrolled thread

PEP 450 Adding a statistics module to Python

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2013-08-10 01:10 +0000
Last post	2013-08-17 21:57 -0600
Articles	9 on this page of 29 — 17 participants

Back to article view | Back to comp.lang.python

  PEP 450 Adding a statistics module to Python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-10 01:10 +0000
    Re: PEP 450 Adding a statistics module to Python Skip Montanaro <skip@pobox.com> - 2013-08-09 22:14 -0500
      Re: PEP 450 Adding a statistics module to Python Roy Smith <roy@panix.com> - 2013-08-10 07:50 -0400
        Re: PEP 450 Adding a statistics module to Python Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-08-10 13:23 +0100
          Re: PEP 450 Adding a statistics module to Python Roy Smith <roy@panix.com> - 2013-08-10 08:43 -0400
            Re: PEP 450 Adding a statistics module to Python Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-08-10 14:17 +0100
    Re: PEP 450 Adding a statistics module to Python Ben Finney <ben+python@benfinney.id.au> - 2013-08-10 15:05 +1000
    Re: PEP 450 Adding a statistics module to Python Stefan Behnel <stefan_ml@behnel.de> - 2013-08-10 09:55 +0200
    Re: PEP 450 Adding a statistics module to Python Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-08-10 16:19 -0400
    Re: PEP 450 Adding a statistics module to Python Skip Montanaro <skip@pobox.com> - 2013-08-11 06:50 -0500
      Re: PEP 450 Adding a statistics module to Python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-11 13:33 +0000
      Re: PEP 450 Adding a statistics module to Python Roy Smith <roy@panix.com> - 2013-08-11 10:02 -0400
        Re: PEP 450 Adding a statistics module to Python duncan smith <buzzard@invalid.invalid> - 2013-08-11 16:44 +0100
    Re: PEP 450 Adding a statistics module to Python Nicholas Cole <nicholas.cole@gmail.com> - 2013-08-11 13:27 +0100
    Re: PEP 450 Adding a statistics module to Python Wolfgang Keller <feliphil@gmx.net> - 2013-08-13 20:14 +0200
      Re: PEP 450 Adding a statistics module to Python Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-08-13 19:44 +0100
      Re: PEP 450 Adding a statistics module to Python Steven D'Aprano <steve@pearwood.info> - 2013-08-14 06:21 +0000
    Re: PEP 450 Adding a statistics module to Python CM <cmpython@gmail.com> - 2013-08-14 21:26 -0700
      RE: PEP 450 Adding a statistics module to Python "Prasad, Ramit" <ramit.prasad@jpmorgan.com.dmarc.invalid> - 2013-08-16 19:17 +0000
    Re: PEP 450 Adding a statistics module to Python taldcroft@cfa.harvard.edu - 2013-08-16 08:50 -0700
      Re: PEP 450 Adding a statistics module to Python chris.barker@noaa.gov - 2013-08-16 09:31 -0700
        Re: PEP 450 Adding a statistics module to Python Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-08-16 18:15 +0100
          Re: PEP 450 Adding a statistics module to Python chris.barker@noaa.gov - 2013-08-16 12:00 -0700
            Re: PEP 450 Adding a statistics module to Python Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2013-08-16 20:41 +0100
        Re: PEP 450 Adding a statistics module to Python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-16 18:51 +0000
          Re: PEP 450 Adding a statistics module to Python chris.barker@noaa.gov - 2013-08-16 12:48 -0700
      Re: PEP 450 Adding a statistics module to Python Roy Smith <roy@panix.com> - 2013-08-16 22:06 -0400
        Re: PEP 450 Adding a statistics module to Python Josef Pktd <josef.pktd@gmail.com> - 2013-08-17 05:13 -0700
    Re: PEP 450 Adding a statistics module to Python Jason Friedman <jsf80238@gmail.com> - 2013-08-17 21:57 -0600

Page 2 of 2 — ← Prev page 1 [2]

#52594

From	chris.barker@noaa.gov
Date	2013-08-16 09:31 -0700
Message-ID	<cb7878a8-7700-46a8-b4e8-3f12fdf2cf19@googlegroups.com>
In reply to	#52593

> > I am seeking comments on PEP 450, Adding a statistics module to Python's 

The trick here is that numpy really is the "right" way to do this stuff.

I like to say:
"crunching numbers in python without numpy is like doing text processing without using the string object"

What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...


A few other comments:

1) the numpy folks have been VERY good at providing binaries for Windows and OS-X -- easy point and click installing.

2) I hope we're almost there with standardizing pip and binary wheels, at which point pip install will be painless.

even before (2) -- pip install works fine anywhere the system is set up to build python extensions (granted, not a given on Windows and Mac, but pretty likely on Linux) -- the idea that running pip install wrote out a lot of text (but worked!) is somehow a barrier to entry is absurd -- anyone building their own stuff on Linux is used to that.

(NOTE: you only need Fortran if you want highly optimized linear algebra stuff -- clearly this use-case is for folks that don't need that!)

3) The fact that the numpy functions have optional arguments is NOT a problem -- the simple calls work as expected -- no one needs to figure out the optional arguments that doesn't need them -- and if they do need them, they had better be there!

All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.

If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.


One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.

 
-Chris

[toc] | [prev] | [next] | [standalone]

#52599

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-08-16 18:15 +0100
Message-ID	<mailman.3.1376673382.23369.python-list@python.org>
In reply to	#52594

On 16 August 2013 17:31,  <chris.barker@noaa.gov> wrote:
>> > I am seeking comments on PEP 450, Adding a statistics module to Python's
>
> The trick here is that numpy really is the "right" way to do this stuff.

Although it doesn't mention this in the PEP, a significant point that
is worth bearing in mind is that numpy is only for CPython, not PyPy,
IronPython, Jython etc. See here for a recent update  on the status of
NumPyPy:
http://morepypy.blogspot.co.uk/2013_08_01_archive.html

> I like to say:
> "crunching numbers in python without numpy is like doing text processing without using the string object"

It depends what kind of number crunching you're doing. Numpy gives
efficient C-style number crunching but it doesn't really give
efficient ways to take advantage of the areas where Python is better
than C such as having efficient infinite range integers, and decimal
and rational arithmetic in the standard library. You can use
dtype=object to use all these things with numpy arrays but in my
experience this is typically not faster than working with Python lists
and is only really useful when you want numpy's multi-dimensional,
view-type slicing.

Here's an example where Steven's statistics module is more accurate:

    >>> numpy.mean([-1e60, 100, 100, 1e60])
    0.0
    >>> statistics.mean([-1e60, 100, 100, 1e60])
    50.0

Okay so that's a toy example but it illustrates that Steven is aiming
for ultra-high accuracy where numpy is primarily aimed at speed. He's
also tried to ensure that it works properly with e.g. fractions:

    >>> from fractions import Fraction as F
    >>> data = [F('1/7'), F('3/7')]
    >>> numpy.mean(data)
    0.2857142857142857
    >>> statistics.mean(data)
    Fraction(2, 7)

and decimals:

    >>> data = [D('0.1'), D('0.01'), D('0.001')]
    >>> numpy.mean(data)
        ....
    TypeError: unsupported operand type(s) for /: 'decimal.Decimal' and 'float'
    >>> statistics.mean(data)
    Decimal('0.037')

> What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...

If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
provide much of what numpy provides. I would describe it as a Pythonic
implementation of elementary statistical computation rather than a
numpy-lite.

[snip]
>
> All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.
>
> If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.

Why? Yes I'd also like an ndarray-lite or rather an ultra-lite
1-dimensional version but why would it be useful for the statistics
module over using standard Python containers? Note that numpy arrays
do work with the reference implementation of the statistics module
(they're just treated as iterables):

    >>> import numpy
    >>> import statistics
    >>> statistics.mean(numpy.array([1, 2, 3]))
    2.0
    >>> statistics.mean(numpy.array([[1, 2, 3], [4, 5, 6]]))
    array([ 2.5,  3.5,  4.5])

> One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.

It might be good to have a C accelerator one day but actually I think
the pure-Python-ness of it is a strong reason to have it since it
provides accurate statistics functions to all Python implementations
(unlike numpy) at no additional cost.

Oscar

[toc] | [prev] | [next] | [standalone]

#52603

From	chris.barker@noaa.gov
Date	2013-08-16 12:00 -0700
Message-ID	<55530414-1cf0-46fa-bdce-890d8679b292@googlegroups.com>
In reply to	#52599

On Friday, August 16, 2013 10:15:52 AM UTC-7, Oscar Benjamin wrote:
> On 16 August 2013 17:31,  <chris.barker@noaa.gov> wrote:
> Although it doesn't mention this in the PEP, a significant point that
> 
> is worth bearing in mind is that numpy is only for CPython, not PyPy,
> 
> IronPython, Jython etc. See here for a recent update  on the status of

It does mention it, though I think not the additional implementations by name. And yes, the lack of numpy on the other implementation is a major limitation.

> > "crunching numbers in python without numpy is like doing text processing without using the string object"
> 
> It depends what kind of number crunching you're doing.

As it depends on what kind of text processing your doing.....you could go a long way with a pure-python sequence of abstract characters library,  but it would be painfully slow -- no one would even try.

I guess there are more people working with, say, hundreds of numbers, than people trying to process an equally tiny amount of text...but this is a digression.

My point about that is that you can only reasonably do string processing with python because python has the concept of a string, not just an arbitrary sequence of characters, and not just for speed's sake, but for the nice semantics.

Anyone that has used an array-oriented language or library is likely to get addicted to the idea that an array of numbers as a first class concept is really, really helpful, for both performance and semantics.

> Numpy gives efficient C-style number crunching

which is the vastly most common case. Also, a properly designed algorithm may well need to know something about the internal storage/processing of the data type -- i.e. the best way to compute a given statistic for floating point may not be the same as for integers (or decimal, or...). Maybe you can get a good one that works for most, but....

> You can use  dtype=object to use all these things with numpy arrays but in my 
> experience this is typically not faster than working with Python lists

That's quite true. In fact, often slower.

> and is only really useful when you want numpy's multi-dimensional, 
> view-type slicing.

which is very useful indeed!

> Here's an example where Steven's statistics module is more accurate:

>     >>> numpy.mean([-1e60, 100, 100, 1e60])
> 
>     0.0
> 
>     >>> statistics.mean([-1e60, 100, 100, 1e60])
> 
>     50.0

the wonders of floating point arithmetic! -- but this looks like more of an argument for a better algorithm in numpy, than a reason to have something in the stdlib -- in fact, that's been discussed lately, there is talk of using compensated summation in the numpy sum() method -- not sure of the status.

> Okay so that's a toy example but it illustrates that Steven is aiming 
> for ultra-high accuracy where numpy is primarily aimed at speed. 

well, yes, for the most part, numpy does trade speed for accuracy when it has too -- but that's not the case here, I think this is ta case of "no one took the time to write a better algorithm"

He's also tried to ensure that it works properly with e.g. fractions:

That is pretty cool, yes.

> > What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...
> 
> If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
> provide much of what numpy provides.

I wasn't clear -- my point was that things like this should be build on a numpy-like array object (numpy-lite) -- so first adding such an object to the stdlib, then building this off it would be nice. But a key problem with that is where do you draw the line that defines numpy-lite? I"d say jsut the core storage object, but then someone wants to add statistics, and someone else wants to add polynomial, and then random numbers, then ... and pretty sure you've got numpy again!

> > All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.
> 
> > If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.
> 
> Why? Yes I'd also like an ndarray-lite or rather an ultra-lite 
> 1-dimensional version but why would it be useful for the statistics 
> module over using standard Python containers? Note that numpy arrays 
> do work with the reference implementation of the statistics module 
> (they're just treated as iterables):

One of the really great things about numpy is that when you work with a LOT of numbers (which is not rare in this era of Big Data) it stores them efficiently, and you can push them around between different arrays, and other libraries without unpacking and copying data. That's what PEP 3118 is all about.

It looks like there is some real care being put into these algorithms, so it would be nice if they could be efficiently used for large data sets and with numpy.

>     >>> import numpy
>     >>> import statistics 
>     >>> statistics.mean(numpy.array([1, 2, 3]))

you'll probably find that this is slower than a python list -- numpy has some overhead when used as a generic sequence.

 > > One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.
>  
> It might be good to have a C accelerator one day but actually I think 
> the pure-Python-ness of it is a strong reason to have it since it 
> provides accurate statistics functions to all Python implementations 
> (unlike numpy) at no additional cost.

Well, I'd rather not have a package that is great for education and  toy problems, but not-so-good for the real ones...

I guess my point is this:

This is a way to make the standard python distribution better for some common computational tasks. But rather than think of it as "we need some stats functions in the python stdlib", perhaps we should be thinking: "out of the box python should be better for computation" -- in which case, I'd start with a decent array object.

-Chris

[toc] | [prev] | [next] | [standalone]

#52606

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2013-08-16 20:41 +0100
Message-ID	<mailman.6.1376682090.23369.python-list@python.org>
In reply to	#52603

On 16 August 2013 20:00,  <chris.barker@noaa.gov> wrote:
>  > > One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.
>>
>> It might be good to have a C accelerator one day but actually I think
>> the pure-Python-ness of it is a strong reason to have it since it
>> provides accurate statistics functions to all Python implementations
>> (unlike numpy) at no additional cost.
>
> Well, I'd rather not have a package that is great for education and  toy problems, but not-so-good for the real ones...

Again it depends what you mean by "real". From the other lists where
we meet I'd guess that your problems are in the "needs a nuclear
reactor" camp. I doubt that the stdlib will ever be sufficiently
mathematically/computationally oriented to fully service either of our
needs (and I don't mean that as a criticism). I persuaded the IT guys
at my work that we needed the whole Enthought Python Distribution on
all machines just because I didn't want to have to argue about
individual packages.

However in my real work, where I compute means and variances etc. I
very often do work with very small datasets and I know a lot of others
who work almost exclusively with them (think e.g. clinical data where
N is often less than 100).

> I guess my point is this:
>
> This is a way to make the standard python distribution better for some common computational tasks. But rather than think of it as "we need some stats functions in the python stdlib", perhaps we should be thinking: "out of the box python should be better for computation" -- in which case, I'd start with a decent array object.

I think that, whether or not the statistics module gains a C
accelerator, if a fast numerical array type comes along then I'd
expect that the statistics module would use its methods as a fast
path. And if it provides a speed boost without compromising
boundedness or accuracy I'm sure that the array type would be used
internally where appropriate (just as numpy converts collections to
arrays before computation).

Oscar

[toc] | [prev] | [next] | [standalone]

#52602

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-08-16 18:51 +0000
Message-ID	<520e74c5$0$30000$c3e8da3$5496439d@news.astraweb.com>
In reply to	#52594

On Fri, 16 Aug 2013 09:31:34 -0700, chris.barker wrote:

>> > I am seeking comments on PEP 450, Adding a statistics module to
>> > Python's
> 
> The trick here is that numpy really is the "right" way to do this stuff.

Numpy does not have a monopoly on the correct algorithms for statistics 
functions, and a big, heavyweight library like numpy is overkill for many 
lightweight statistics tasks. One shouldn't need to turn on a nuclear 
reactor just to put the light on in your fridge.

> I like to say:
> "crunching numbers in python without numpy is like doing text processing
> without using the string object"

Your analogy is backwards. String objects actually aren't optimal for 
heavy duty text processing, because they're immutable. If you're serious 
about crunching vast amounts of numbers, you'll use numpy. If you're 
serious about crunch vast amounts of text, say for a text editor or word 
processor, you *won't* use strings, you'll use some sort of mutable 
buffer, or ropes, or some other data type. But very unlikely to use 
strings.

> What this is really an argument for is a numpy-lite in the standard
> library, which could be used to build these sorts of things on. But
> that's been rejected before...

"Numpy-lite". Which parts of numpy? Who maintains it? The numpy release 
schedule is nothing like the standard library's release schedule, so 
which one has to change? Or does somebody fork numpy, giving two 
independent code bases?

What about Jython, IronPython, and other Python implementations? Even 
PyPy doesn't support numpy yet, and Jython and IronPython probably never 
will, since they're not C-based.

> A few other comments:
> 
> 1) the numpy folks have been VERY good at providing binaries for Windows
> and OS-X -- easy point and click installing.
> 
> 2) I hope we're almost there with standardizing pip and binary wheels,
> at which point pip install will be painless.

Yeah, right, sure it will be. I've been waiting a decade for package 
management on Linux to become painless, and it still isn't. There's no 
reason to expect pip will be more painless than aptitude or yum.

But even if it is, installation of software is not just a software 
problem to be solved by better technology. There is also the social 
problem that not everyone is permitted to arbitrarily install software. 
I'm not just talking about security policies on the machine, but security 
policies in real life. People can be sacked for installing software they 
don't have permission to install.

Machines may be locked down, users may have to submit a request before 
software will be installed. That may involve a security audit, legal 
review of licencing, strategy for full roll-back, potentially even a 
complete code audit. (Imagine auditing all of numpy.) Or policy may 
simply say, *no software from unapproved vendors* full stop.

Not everyone is privileged to be permitted to install whatever software 
they like, when they like. Here are two organisations that make software 
installation requests *easy*:

http://www.uhd.edu/computing/acl/SoftwareInstallationRequest.html

http://www.calstatela.edu/its/services/software/
instructsoftwarerequest.php/form2.php

Pip install isn't going to fix that.

There are many, many people in a situation where the Python std lib is 
approved, usually because it comes from a vendor with a support contract 
(say, RedHat, Ubuntu, or Suse), but getting third-party packages like 
numpy approved is next to impossible. "Just install numpy" is a solution 
for a privileged few.

> even before (2) -- pip install works fine anywhere the system is set up
> to build python extensions (granted, not a given on Windows and Mac, but
> pretty likely on Linux)

Oh, well that's okay then -- that's three, maybe four percent of the 
computing world taken care of! Problem solved!

Not.

> -- the idea that running pip install wrote out a
> lot of text (but worked!) is somehow a barrier to entry is absurd --
> anyone building their own stuff on Linux is used to that.

Do you realise that not all Python programmers are used to, or able to, 
"build their own stuff on Linux"?

[...]
> All that being said -- if you do decide to do this, please use a PEP
> 3118 (enhanced buffer) supporting data type (probably array.array) --
> compatibility with numpy and other packages for crunching numbers is
> very nice.

py> import array
py> data = array.array('f', range(1000))
py> import statistics
py> statistics.mean(data)
499.5
py> statistics.stdev(data)
288.8194360957494

If the data type supports the sequence protocol, it should work with my 
module. If it fails to work, submit a bug report, and I will fix it.

> If someone decides to build a stand-alone stats package -- building it
> on a ndarray-lite (PEP 3118 compatible) object would be a nice way to
> go.
> 
> 
> One other point -- for performance reason, is would be nice to have some
> compiled code in there -- this adds incentive to put it in the stdlib --
> external packages that need compiling is what makes numpy unacceptable
> to some folks.

Like the decimal module, it will probably remain pure-Python for a few 
releases, but I hope that in the future the statistics module will gain a 
C-accelerated version. (Or Java-accelerated for Jython, etc.) I expect 
that PyPy won't need one. But because it's not really aimed at number-
crunching megabytes of data, speed is not the priority.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#52607

From	chris.barker@noaa.gov
Date	2013-08-16 12:48 -0700
Message-ID	<ab0be0f5-7d93-4043-9002-965f99c5f234@googlegroups.com>
In reply to	#52602

On Friday, August 16, 2013 11:51:49 AM UTC-7, Steven D'Aprano wrote:
> > The trick here is that numpy really is the "right" way to do this stuff.

> Numpy does not have a monopoly on the correct algorithms for statistics  
> functions,

indeed not -- in fact, a number of them are quite lame, either because of chosen speed vs. accuracy trade offs, or just plain no-one-got-around-to-writing-the-code.

I kind of mis-spoke: what I meant was: "a numpy ndarray-similar object is the "right"way to do this", not numpy itself.

> and a big, heavyweight library like numpy is overkill for many  
> lightweight statistics tasks. One shouldn't need to turn on a nuclear  
> reactor just to put the light on in your fridge.

sure -- but you are talking stdlib here -- where do we draw the line? a hard choice every time.

> > "crunching numbers in python without numpy is like doing text processing 
> > without using the string object"
> 
> Your analogy is backwards. String objects actually aren't optimal for  
> heavy duty text processing, because they're immutable. If you're serious 
> about crunching vast amounts of numbers, you'll use numpy. If you're 
> serious about crunch vast amounts of text, say for a text editor or word  
> processor, you *won't* use strings, you'll use some sort of mutable  
> buffer, or ropes, or some other data type. But very unlikely to use  
> strings.

but you sure as heck won't use arbitrary pyton sequences of characters. which is what you are doing with this module.

> > What this is really an argument for is a numpy-lite in the standard 
> > library, which could be used to build these sorts of things on. But 
> > that's been rejected before...

> "Numpy-lite". Which parts of numpy? Who maintains it? The numpy release  
> schedule is nothing like the standard library's release schedule, so  
> which one has to change? Or does somebody fork numpy, giving two  
> independent code bases?

yup -- that's why it's been rejected before -- but we did get PEP 3118 as a compromise, so one could build an nd-array-lite that was PEP 3118 compatible, and avoid many of the problems above.

However, as much a problem is is to install a third-party compiled package, it's a hell of a lot less work than writing a bunch of new code, so it'll probably never get done.

I myself am trying to write my new stuff to take PEP 3118 buffers, so I can get full high-performing numpy support, but not require users to have numpy -- it is a bit tricky, but can be done. If/when you get to the C-accelerated version, I suggest you consider it.

> What about Jython, IronPython, and other Python implementations? Even  
> PyPy doesn't support numpy yet, and Jython and IronPython probably never 
> will, since they're not C-based.

There is a numpy for IronPython, though I don't hink it got beyond the alpha stage. But your point is well taken -- but also a reason for an ndarray in the stdlib, then maybe other implementations would support it.

> Yeah, right, sure it will be. I've been waiting a decade for package  
> management on Linux to become painless, and it still isn't. There's no  
> reason to expect pip will be more painless than aptitude or yum.

Probably not, true -- but you needed to get Python from somewhere didn't you? You can't see it's easy to compile that on Windows!

> There is also the social  problem that not everyone is permitted to arbitrarily install software. 

I work for the Federal Government -- believe me, I know.

There's Google App Engine, and things like that too, to support your point....

> complete code audit. (Imagine auditing all of numpy.)

well, the more we add to Pyton's stdlib, the bigger an issue that will be for all Ptyon users -- antoher reason to be cautios.

But at the end, I don't think there is a lot you can do with pyton without installing some third-party package? How many people do all their code development in IDLE? al their GUI's with tk? no image processing , writing their own web framework from scratch? The list goes on and on. I may have a few simple text processing scripts that don't use any third party packages, but nothing major.

I teach Intro to Python, and while I could probably get away with only the stdlib for the intro class (but sure as heck not the web development class), I don't -- because there is a lot folks should know about do anything real in Python.

So as much of a pain as it can be to use third-party packages, we can't put everything in the stdlib for that reason.

> There are many, many people in a situation where the Python std lib is  
> approved, usually because it comes from a vendor with a support contract 
> (say, RedHat, Ubuntu, or Suse), but getting third-party packages like  
> numpy approved is next to impossible. 

don't all three of those ship numpy? I haven't used them in ages.

> > to build python extensions (granted, not a given on Windows and Mac, but
> > pretty likely on Linux)
> 
> Oh, well that's okay then -- that's three, maybe four percent of the  
> computing world taken care of! Problem solved!

hence the binaries....

really -- the "I can't install an unapproved package" is a show-stopper. "I can't built it" isn't.

> > anyone building their own stuff on Linux is used to that.
> 
> Do you realise that not all Python programmers are used to, or able to, 
> 
> "build their own stuff on Linux"?

then why not "yum install numpy"? or whatever?

> > All that being said -- if you do decide to do this, please use a PEP
> > 3118 (enhanced buffer) supporting data type (probably array.array) --
> > compatibility with numpy and other packages for crunching numbers is
> > very nice.
> 
> py> import array
> py> data = array.array('f', range(1000))
> py> import statistics
> py> statistics.mean(data)
> 499.5

I realized this after posting -- that is a nice feature, and could help a lot -- hurray for the buffer protocol!  This makes room for compiled optimization down the road, and then you might be able to use your code with numpy arrays efficiently.

> If the data type supports the sequence protocol, it should work with my  
> module. If it fails to work, submit a bug report, and I will fix it.

fair enough.

> Like the decimal module, it will probably remain pure-Python for a few  
> releases, but I hope that in the future the statistics module will gain a  
> C-accelerated version. (Or Java-accelerated for Jython, etc.)

a perfectly reasonable development path.

 I expect 

> that PyPy won't need one. But because it's not really aimed at number- 
> crunching megabytes of data, speed is not the priority.

I thought one of the key points of PyPy was performance? But anyway, maybe RPython and the JIT will take care of that.

Anyway, this looks like a great project -- not so sure about putting it in the stdlib, and do hope you'll keep the number crunchers in mind, but great stuff none the less.

-Chris

[toc] | [prev] | [next] | [standalone]

#52614

From	Roy Smith <roy@panix.com>
Date	2013-08-16 22:06 -0400
Message-ID	<roy-83BBEF.22062216082013@news.panix.com>
In reply to	#52593

In article <0d60fd90-eb19-4702-acd5-dd7ba0eddeda@googlegroups.com>,
 taldcroft@cfa.harvard.edu wrote:

>  Python is showing up in high-school and colllege intro programming 
>  courses here in the U.S. 

Yup.  For the past few years, I've been a judge in the NYC Science and 
Engineering Fair (http://collegenow.cuny.edu/sciencefair/).  By far, the 
most common language I see CS projects done in, is Python.

[toc] | [prev] | [next] | [standalone]

#52621

From	Josef Pktd <josef.pktd@gmail.com>
Date	2013-08-17 05:13 -0700
Message-ID	<734ccb26-8ca5-4c3e-be46-e1c9470c0a90@googlegroups.com>
In reply to	#52614

I think the install issues in the pep are exaggerated, and are in my opinion not a sufficient reason to get something into the standard lib.

google appengine includes numpy
https://developers.google.com/appengine/docs/python/tools/libraries27

I'm on Windows, and installing numpy and scipy are just binary installers that install without problems.
There are free binary distributions (for Windows and Ubuntu) that include all the main scientific applications. One-click installer on Windows
http://code.google.com/p/pythonxy/wiki/Welcome
http://code.google.com/p/winpython/

How many Linux distributions don't include numpy? (I have no idea.)

For commercial support Enthought's and Continuum's distributions include all the main packages.

I think having basic descriptive statistics is still useful in a basic python installation. Similarly, almost all the descriptive statistics moved from scipy.stats to numpy.

However, what is the longterm scope of this supposed to be?

I think working with pure python is interesting for educational purposes
http://www.greenteapress.com/thinkstats/
but I don't think it will get very far for more extensive uses. Soon you will need some linear algebra (numpy.linalg and scipy.linalg) and special functions (scipy.special).

You can reimplement them, but what's the point to duplicate them in the standard lib?

For example:

t test: which versions? one-sample, two-sample, paired and unpaired, with and without homogeneous variances, with 3 alternative hypothesis.

If we have t test, shouldn't we also have ANOVA when we want to compare more than two samples?

...

If the Python versions that are not using a C backend need a statistics package and partial numpy replacement, then I don't think it needs to be in the CPython lib.

If think the "nuclear reactor" analogy is in my opinion misplaced.

A python implementation of statistics is a bycycle, numpy is a car, and if you need some heavier lifting in statistics or machine learning, then the trucks are scipy, scikit-learn and statsmodels (and pandas for the data handling).
And rpy for things that are not directly available in python.

I'm one of the maintainers for scipy.stats and for statsmodels.

We have a similar problem of deciding on the boundaries and scope of numpy, scipy.stats, pandas, patsy, statsmodels and scikit-learn. There is some overlap of functionality where the purpose or use cases are different, but in general we try to avoid too much duplication.

https://pypi.python.org/pypi/statsmodels
https://pypi.python.org/pypi/pandas
https://pypi.python.org/pypi/patsy (R like formulas)
https://pypi.python.org/pypi/scikit-learn

Josef

[toc] | [prev] | [next] | [standalone]

#52649

From	Jason Friedman <jsf80238@gmail.com>
Date	2013-08-17 21:57 -0600
Message-ID	<mailman.23.1376798234.23369.python-list@python.org>
In reply to	#52289

> NumPy and SciPy are not available for many Python users, including those
> using a Python implementation for which there is no Numpy support
> <URL:http://new.scipy.org/faq.html#python-version-support> and those for
> whom large, dependency-heavy third-party packages are too much burden.
>
> See the Rationale of PEP 450 for more reasons why “install NumPy” is not
> a feasible solution for many use cases, and why having ‘statistics’ as a
> pure-Python, standard-library package is desirable.

+1

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

PEP 450 Adding a statistics module to Python

Contents

#52594

#52599

#52603

#52606

#52602

#52607

#52614

#52621

#52649