Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <3158703.Lr4HFMbMOd@PointedEars.de>
References: <87oaksowwg.fsf@Equus.decebal.nl> <1451048.pW9z17ilMA@PointedEars.de> <mailman.242.1433677915.13271.python-list@python.org> <3158703.Lr4HFMbMOd@PointedEars.de>
Date: Mon, 8 Jun 2015 02:25:21 +1000
Subject: Re: Testing random
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.257.1433694323.13271.python-list@python.org>
Lines: 60
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:92267

On Mon, Jun 8, 2015 at 1:51 AM, Thomas 'PointedEars' Lahn
<PointedEars@web.de> wrote:
> Chris Angelico wrote:
>
>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>> <PointedEars@web.de> wrote:
>>> Cecil Westerhof wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> As there is no guarantee that every number will occur randomly, using a
>>> dictionary at first should be more efficient than a list:
>>
>> Hmm, I'm not sure that's actually so. His code is aiming to get
>> 'multiplier' values in each box; for any serious multiplier (he starts
>> with 10 in the main code), you can be fairly confident that every
>> number will come up at least once.
>
> The wording shows a common misconception: that random distribution would
> mean that it is guaranteed or more probable that every element of the set
> will occur at least once.  It is another common misconception that
> increasing the number of trials would increase the probability of that
> happening.  But that is not so.

The greater the multiplier, the lower the chance that any element will
have no hits. With uniform distribution, a length of 10, and a
multiplier of 10, there are 100 attempts with a 90% chance each that
any given number will be avoided - which works out to 0.9**100 ==
2.6561398887587544e-05 probability that (say) there'll be no 4s in the
list. Invert that and raise to the 10th power, and you get a
probability of 0.9997344177567317 that there'll be at least one in
every bucket. Raise the multiplier to 100, and the probability of a
single whiff becomes 1.7478712517226947e-46; invert and raise to the
tenth power, and it becomes closer to certainty than IEEE double
precision can represent. Raise the length to 100 and the numbers get
lower again; with multiplier 10, probability 0.9956920878572284 of
having one in every bucket (that's a half a percent chance of a zero
anywhere), and at multiplier 100, still underflows to certainty.

But you'll notice that I wasn't actually talking about certainty here.
I was talking about confidence, at levels sufficient to make data-type
decisions on. Sure, there's no guarantee that every number will occur;
but if there's a 0.4% chance that any number will be omitted, I think
the list is going to work out more efficient.

You'll notice that some of the other posts have been concerned more
about correctness (for instance, using collections.Counter and then
making sure there's a zero slot for every element - otherwise empty
slots would be omitted), and then they _do_ acknowledge the chance
that something will underflow. But with the numbers the OP gave, I
would be fully satisfied with optimizing for the case where every
bucket gets at least something.

ChrisA