Path: csiph.com!usenet.pasdenom.info!news.etla.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.013 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'broken': 0.04; 'root': 0.05; 'say,': 0.05; 'subject:Python': 0.06; '*not*': 0.07; 'pypi': 0.07; '161': 0.09; 'cc:addr:python-list': 0.11; 'jan': 0.12; 'latter,': 0.16; 'pypi.': 0.16; 'reedy': 0.16; 'skewed': 0.16; 'statistics,': 0.16; 'wrote:': 0.18; 'cc:addr:python.org': 0.22; 'people,': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'am,': 0.29; 'especially': 0.30; 'message-id:@mail.gmail.com': 0.30; 'url:mailman': 0.30; "skip:' 10": 0.31; 'values.': 0.31; 'summary': 0.32; 'url:python': 0.33; 'raw': 0.33; 'received:google.com': 0.35; 'data,': 0.36; 'url:listinfo': 0.36; 'thanks': 0.36; 'url:org': 0.36; 'either': 0.39; 'url:mail': 0.40; 'here:': 0.62; 'back': 0.62; 'such': 0.63; 'skip:n 10': 0.64; 'total': 0.65; 'taking': 0.65; 'published': 0.71; 'square': 0.74; '4.2': 0.84; 'from:charset:iso-8859-9': 0.84; 'average': 0.93; 'confidence': 0.95; 'obtained': 0.96 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=XMKwuggwpdwlDlXonqDHtxCpgES5JGURvDvr7mxWGAk=; b=CT5RP41WDbRLTY1ulyoV5x7fkJXgmba4+BFnEjLTcbPW9himZrEeyiuyRmikfyqN39 75PXxQEvgWF5wvo4t5aPoan8YQrqC81BxVlP3+TBn621zfTlVWVfEJsA1Z6Ey8stQi3B rzIaV4AGdNLhvSUP6DhnP6Rx5o1cpCEQaxIHHWHYqPNDXhRQ9Jhb+GvjgV36J3YE2zJY 4yb+322rRnLhZQorkCbYpJEfAoIDRhl+nCuwvwFWYIKUOMJAqvFYhuWgioo44pH/YEYo sWsWNiSLc+3ZXKT+cRHSDvCekuw2dWOzCH0kke4ePOw8VlDVlwQWnXT8FHjJiGsvXCon 4g3w== MIME-Version: 1.0 X-Received: by 10.205.22.71 with SMTP id qv7mr902321bkb.20.1382130455735; Fri, 18 Oct 2013 14:07:35 -0700 (PDT) In-Reply-To: References: Date: Sat, 19 Oct 2013 00:07:35 +0300 Subject: Re: Python package statistics From: =?ISO-8859-9?Q?Ya=FEar_Arabac=FD?= To: Terry Reedy Content-Type: text/plain; charset=ISO-8859-9 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Sat, 19 Oct 2013 02:34:42 +0200 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 52 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1382142882 news.xs4all.nl 16005 [2001:888:2000:d::a6]:59675 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57084 Hi Terry, Thanks for pointing it out.matplotlib's hist function wasn't broken after all :) I published non-parametric statistics here: http://ysar.net/python/python-package-statistics-additions.html 2013/10/18 Terry Reedy : > On 10/18/2013 8:41 AM, Ya=FEar Arabac=FD wrote: >> >> Hi people, >> >> I collected some data on PyPI and published some statistics about >> packages on PyPI. I think you might find it an interesting read: >> >> http://ysar.net/python/python-package-statistics.html > > > "b2gpopulate (36MB) > ... > Total sizes on packages in PyPI amounted to 4.2 GB. Average package size = is > 161 KB and standard deviation is 1MB." > > For such highly skewed data, the mean and especially the standard deviati= on > and confidence intervals are meaningless. The are 'parameteric' statistic= s, > which is to say, were designed for bell-shaped distributions. (I will not > say 'normal' =3D=3D Guassian distributions because they are *not* normal = for > much raw data.) > > A better summary is obtained from either 'non-parametric' statistics > (median, inter-quartile range) or from 'normalizing' the data (if possibl= e). > For the latter, try taking the square root or log of the sizes and plot t= he > distribution. If either works, take the mean and sd of the transformed > values. Then report those and also the transformed back mean and mean+-sd= . > > -- > Terry Jan Reedy > > > -- > https://mail.python.org/mailman/listinfo/python-list --=20 http://ysar.net/