Groups > comp.lang.python > #92207 > unrolled thread

Testing random

Started by	Cecil Westerhof <Cecil@decebal.nl>
First post	2015-06-07 08:27 +0200
Last post	2015-06-07 11:04 -0400
Articles	20 on this page of 80 — 22 participants

Back to article view | Back to comp.lang.python

  Testing random Cecil Westerhof <Cecil@decebal.nl> - 2015-06-07 08:27 +0200
    Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 12:40 +0200
      Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-07 21:51 +1000
        Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 17:51 +0200
          Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 02:25 +1000
            Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 18:36 +0200
              Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 02:44 +1000
                Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 20:23 +0200
                  Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 04:52 +1000
                    Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 21:41 +0200
                  Re: Testing random Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2015-06-07 22:08 +0300
                    Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 21:29 +0200
                      Re: Testing random random832@fastmail.us - 2015-06-07 15:44 -0400
                        Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 22:09 +0200
                          Re: Testing random random832@fastmail.us - 2015-06-07 16:41 -0400
                            Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 22:59 +0200
                            Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-08 11:26 +1000
                            Re: Testing random random832@fastmail.us - 2015-06-07 21:34 -0400
                            Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 11:42 +1000
                            Re: Testing random MRAB <python@mrabarnett.plus.com> - 2015-06-08 02:49 +0100
                            Re: Testing random random832@fastmail.us - 2015-06-07 21:57 -0400
                      Re: Testing random Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2015-06-08 10:40 +0300
                        Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-10 19:03 +0200
                          Re: Testing random sohcahtoa82@gmail.com - 2015-06-10 10:52 -0700
                            Re: Testing random Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2015-06-10 23:00 +0300
                          Re: Testing random Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-10 12:02 -0600
                            Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-12 23:32 +0200
                              Re: Testing random alister <alister.nospam.ware@ntlworld.com> - 2015-06-12 21:46 +0000
                              Re: Testing random random832@fastmail.us - 2015-06-12 17:52 -0400
                              Re: Testing random Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-12 16:00 -0600
                                Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-13 00:09 +0200
                                  Re: Testing random sohcahtoa82@gmail.com - 2015-06-12 15:55 -0700
                                  Re: Testing random random832@fastmail.us - 2015-06-12 18:57 -0400
                              Re: Testing random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-06-13 08:53 +0100
                          Re: Testing random random832@fastmail.us - 2015-06-10 14:26 -0400
                  Re: Testing random Ned Batchelder <ned@nedbatchelder.com> - 2015-06-07 14:21 -0700
                    Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-16 21:18 +0200
                      Re: Testing random random832@fastmail.us - 2015-06-16 16:23 -0400
                      Re: Testing random Ned Batchelder <ned@nedbatchelder.com> - 2015-06-16 13:48 -0700
                        Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-16 23:57 +0200
                          Re: Testing random sohcahtoa82@gmail.com - 2015-06-16 15:30 -0700
                            Re: Testing random Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-16 16:58 -0600
                            Re: Testing random Laura Creighton <lac@openend.se> - 2015-06-17 11:28 +0200
                          Re: Testing random Ned Batchelder <ned@nedbatchelder.com> - 2015-06-16 16:26 -0700
                            Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-17 01:45 +0200
                              Re: Testing random sohcahtoa82@gmail.com - 2015-06-16 17:36 -0700
                                Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-17 11:01 +1000
                                Re: Testing random Ethan Furman <ethan@stoneleaf.us> - 2015-06-16 18:32 -0700
                                Re: Testing random Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-06-17 09:41 +0100
                                  Re: Testing random Grant Edwards <invalid@invalid.invalid> - 2015-06-17 14:04 +0000
                                    Re: Testing random Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-17 09:01 -0600
                              Re: Testing random MRAB <python@mrabarnett.plus.com> - 2015-06-17 01:42 +0100
                                Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-17 08:53 +0200
                                  Re: Testing random Christian Gollwitzer <auriocus@gmx.de> - 2015-06-17 09:22 +0200
                                    Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-17 17:28 +1000
                                    Re: Testing random Tim Golden <mail@timgolden.me.uk> - 2015-06-17 08:30 +0100
                                      Re: Testing random Cecil Westerhof <Cecil@decebal.nl> - 2015-06-17 11:57 +0200
                              Re: Testing random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-17 01:36 +0000
                              Re: Testing random Laura Creighton <lac@openend.se> - 2015-06-17 12:33 +0200
                                Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-17 22:47 +1000
                                  Re: Testing random Laura Creighton <lac@openend.se> - 2015-06-17 15:50 +0200
                        Re: Testing random Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-17 01:35 +0000
                    Re: Testing random Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2015-06-17 07:41 +0300
                  Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-08 11:11 +1000
              Re: Testing random Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-07 11:07 -0600
              Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 03:20 +1000
              Re: Testing random "C.D. Reimer" <chris@cdreimer.com> - 2015-06-07 10:36 -0700
              Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-08 04:28 +1000
                Re: Testing random Chris Angelico <rosuav@gmail.com> - 2015-06-08 04:40 +1000
          Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-08 04:24 +1000
    Re: Testing random Jonas Wielicki <jonas@wielicki.name> - 2015-06-07 12:41 +0200
    Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-07 22:52 +1000
      Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-07 23:06 +1000
      Re: Testing random Peter Otten <__peter__@web.de> - 2015-06-07 15:35 +0200
        Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 18:36 +0200
          Re: Testing random Peter Otten <__peter__@web.de> - 2015-06-07 18:48 +0200
            Re: Testing random Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-06-07 22:15 +0200
        Re: Testing random Steven D'Aprano <steve@pearwood.info> - 2015-06-08 11:35 +1000
    Re: Testing random Christian Gollwitzer <auriocus@gmx.de> - 2015-06-07 14:53 +0200
    Re: Testing random Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2015-06-07 11:04 -0400

Page 4 of 4 — ← Prev page 1 2 3 [4]

#92744

From	Laura Creighton <lac@openend.se>
Date	2015-06-17 15:50 +0200
Message-ID	<mailman.548.1434549061.13271.python-list@python.org>
In reply to	#92738

In a message of Wed, 17 Jun 2015 22:47:37 +1000, "Steven D'Aprano" writes:
>There are magicians who are capable of forcing coins to land the required
>way up, and somebody once built a machine capable of tossing a coin with
>the precise equal force and velocity every single time. Dice are rarely
>unbiased, and neither are roulette wheels. Nevertheless, we ignore those
>factual biases for the sake of simplicity.

Simlplicity is fine.  But if you use this for real world decisions,
such as 'Let's try one more time, I really want a daughter!'
because you didn't know that the simplicity is there, or rather
that what is true in populations isn't true because it is also
true for each individual in a population .. well, that can be tragic.

>Correct. And more Y sperm die than X sperm. Together, these two factors lead
>to a small but real bias towards girl children, at least on countries that
>don't practice wide-spread sex-specific abortion.

It is not clear, last I checked, that 'more Y sperm die' matters, because
most of the time, for reasons we still don't know, the ratio of male
and female sperm produced is not 50/50 either.  So it may matter.  But
maybe not.

>> But when you look at breeding age men and women you will find 
>> that the ratio is a lot closer to 50/50 than the birth age ratio.
>> Human beings, as a population currently produce more males than
>> females.
>
>Surely that depends on where (and when) you are?
>
>I know that, today, both India and China both abort far more female fetuses
>than male ones, leading to a large excess of men. But in countries that
>don't practice selective abortions, my understanding is that there is a
>small excess of women at virtually all ages, especially among the elderly.
>Men tend to die earlier than women in every age bracket.

No.  Before breeding age there is an excess of males.

I am excluding the notion of abortion here.  (Another reason why humans
make lousy experimental subjects for this sort of study.)  What I am
saying is that a whole lot more boy babies get born than female ones.
If infant mortality was sex-neutral, we would expect that, when we
took a census of 16-year-olds we would find that there continued to
be an excess of boys at that level, too.  Instead, we have a small
excess of females, enough to believe that:

       For populations of human beings, when they hit breeding age, it
       is desirable to have a sex ratio near 50/50 but with a small
       excess of females.

'Desirable' here just means that this is the answer that pops out when
you study real populations (or model them with computer simulations).

And to get this Desirable outcome, you have to produce many more male
babies than female ones.

Laura

[toc] | [prev] | [next] | [standalone]

#92718

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-06-17 01:35 +0000
Message-ID	<5580ceed$0$1667$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92699

On Tue, 16 Jun 2015 13:48:04 -0700, Ned Batchelder wrote:

> I apologize, I'm sure I've been using the mathematical terms
> imprecisely. We are all intelligent people, so I still believe we
> disagree because we are talking about different things.

Neil, I believe that your actual mistake is assuming that Thomas is 
arguing in good faith. I see no evidence that he is, especially given the 
content of his latest posts.

Multiple people have repeatedly explained the difference between his 
argument and what everyone else is talking about. Others, including me, 
have demonstrated empirically that he is mistaken, using both simulated 
tests and direct calculation of the probabilities.

At this point, his insistence that we are making the gambler's fallacy is 
clearly not a mere misunderstanding due to confusion. It is wilful and 
deliberate misrepresentation of what we are saying.

Thomas is correct for a completely different question. Rather than 
acknowledge that he has misunderstood the question, at every point he 
doubles down harder and insists that he is right and we are wrong. We 
have given *absolutely no reason* to think we have fallen for the 
Gambler's Fallacy. Throughout this thread, Thomas has repeatedly picked 
on trivial and unimportant errors in terminology as an excuse for 
dismissing what others have said, while ignoring the substance of their 
argument. When people have given mathematically indisputable and correct 
arguments, he has ignored them, or misrepresented them.

-- 
Steven D'Aprano

[toc] | [prev] | [next] | [standalone]

#92724

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2015-06-17 07:41 +0300
Message-ID	<lf5k2v3ymj3.fsf@ling.helsinki.fi>
In reply to	#92307

Ned Batchelder writes:

> Thomas: let's say I generate streams of N digits drawn randomly from
> 0-9.  I then consider the probability of a zero *never appearing once*
> in my stream.  Let's call that P(N).  Do you agree that as N
> increases, P(N) decreases?

In probability theory, that could be phrased as the probability that N
unknown digits d_1, ..., d_N are all positive, assuming the digits are
independent (so learning one digit doesn't reveal anything about any
other digit), and for each d_k, the probability p_k of having a positive
digit is the same. Mathematicians often abbreviate these assumptions as
"i.i.d" for "independent" and "identically distributed".

Also assuming uniform distributions, p_1 = p_2 = ... = p_N = 9/10.

P(d_k > 0 for k = 1, ..., N) =
P(d_1 > 0 and d_2 > 0 and ... and d_N > 0) = (by independence)
P(d_1 > 0) * P(d_2 > 0) * ... * P(d_N > 0) =
p_1 * p_2 * ... * p_N = (by identical uniform distribution)
(9/10)^N

In mathematics, (9/10)^N decreases as N increases, so one should indeed
agree. Using more impressive notation and terminology correctly will not
change the analysis.

[toc] | [prev] | [next] | [standalone]

#92317

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-08 11:11 +1000
Message-ID	<5574ebb9$0$13009$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92283

On Mon, 8 Jun 2015 04:23 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:
> 
>> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
>> <PointedEars@web.de> wrote:
>>>> The greater the multiplier, the lower the chance that any element will
>>>> have no hits.
>>> Wrong.
>>>
>>>> [ex falso quodlibet]
>> 
>> Huh. Do you want to explain how, mathematically, I am wrong, or do you
>> want to join the RUE in my ignore list?
> 
> I already did; you have overlooked it.  In a nutshell, the probability of
> 
>   1 1 1 1 1 1 1 1 1 1 1
> 
> is *the same* as that of
> 
>   1 2 3 4 5 6 7 8 9 1 2
> 
> and the same as that of
> 
>   8 3 6 3 1 2 6 8 2 1 6.

True, but irrelevant. What is irrelevant is that the probability of:

    1 1 1 1 1 1 1 1 1 1 1

is MUCH LESS than the probability of:

    1 2 3 4 5 6 7 8 9 1 2  or
    8 3 6 3 1 2 6 8 2 1 6  or
    9 3 1 5 5 4 7 1 6 8 1  or
    2 3 3 1 8 9 7 2 5 1 6  or
    7 8 9 4 2 7 7 1 3 8 2  or
    ...
    [enumerate millions of other possibilities]
    ...
    2 1 5 5 8 1 8 7 6 5 1 or
    3 4 9 5 2 8 4 2 3 2 6.

> So, I am sorry to tell you this, but you do _not_ understand probability.

It's certainly true that you don't.

> And you *cannot* understand it intuitively, like you tried to. 

Rubbish. Chris has got it right. Whether he did so "intuitively" or whether
he learned this after years of study is irrelevant, and you have no
evidence for which it is. You are jumping to wrong conclusions -- you are
so sure that you're so much smarter than everyone else that you've ASSUMED
that Chris must be wrong and *completely failed to pay attention* to what
he has actually said, responding to things he hasn't said.

> It is why humans play the lottery using their “lucky numbers” whereas
> other numbers have exactly the same probability of being drawn; 
[snip more irrelevant examples]

None of these things are relevant in the slightest.

> An extension of that misconception is emphasized by an anecdote (which may
> be apocryphal) told about Richard Feynman (I heard it from Lawrence Krauss
> in “A Universe from Nothing”; he can tell that in a much more funny way
> than I am able to reproduce it here [1]):
> 
>   Richard Feynman used to go up to people all the time and he’d say:
> 
>     “You won’t believe what happend to me today!
>      You won’t believe what happend to me today!”
> 
>   And people would say:
> 
>     “What?”
> 
>   And he would reply:
> 
>     “Absolutely nothing!”

And I wouldn't believe it either. *Something happened*. He got up. He put
his shoes on. He washed his face. These are all events, but they are not
memorable events. They are equivalent to a sample like:

    6 8 1 5 9 2 7 6 3 3 2

There are millions of ways to get a non-memorable sample like that, but only
nine ways to get a memorable sample like:

    2 2 2 2 2 2 2 2 2 2 2

Your mistake is that you are comparing that one memorable sample to any
*one* of the non-memorable ones, when you should be comparing the
probability of *any* memorable sample against *any* non-memorable ones.

Actually, your real mistake is hubris. *Nothing* in Chris' comments in this
thread should lead you to the conclusion he doesn't know what he is talking
about, but you're so full of the preconceived notion that he must be wrong
because he is not Thomas Lahn that you've actually made an astonishing
blunder and compounded it with laughable arrogance.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92274

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-06-07 11:07 -0600
Message-ID	<mailman.261.1433697238.13271.python-list@python.org>
In reply to	#92271

On Sun, Jun 7, 2015 at 10:44 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
> <PointedEars@web.de> wrote:
>>> The greater the multiplier, the lower the chance that any element will
>>> have no hits.
>>
>> Wrong.
>>
>>> [ex falso quodlibet]
>
> Huh. Do you want to explain how, mathematically, I am wrong, or do you
> want to join the RUE in my ignore list?

My best speculation is that he's either objecting to the generality of
your statement (it's false if the probability of some element
occurring is zero or eventually degrades to zero), or misreading the
word "multiplier" to the conclusion that the value of each element is
being multiplied rather than the number of trials. Or trolling; I
suppose that's always an option too.

[toc] | [prev] | [next] | [standalone]

#92276

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-08 03:20 +1000
Message-ID	<mailman.263.1433697617.13271.python-list@python.org>
In reply to	#92271

On Mon, Jun 8, 2015 at 3:07 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Sun, Jun 7, 2015 at 10:44 AM, Chris Angelico <rosuav@gmail.com> wrote:
>> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
>> <PointedEars@web.de> wrote:
>>>> The greater the multiplier, the lower the chance that any element will
>>>> have no hits.
>>>
>>> Wrong.
>>>
>>>> [ex falso quodlibet]
>>
>> Huh. Do you want to explain how, mathematically, I am wrong, or do you
>> want to join the RUE in my ignore list?
>
> My best speculation is that he's either objecting to the generality of
> your statement (it's false if the probability of some element
> occurring is zero or eventually degrades to zero), or misreading the
> word "multiplier" to the conclusion that the value of each element is
> being multiplied rather than the number of trials. Or trolling; I
> suppose that's always an option too.

My first thought was your first option, which is why I specifically
responded with explanation about how data type selection can viably be
about expectations rather than certainties, but he responded with a
one-word "Wrong" so I must have been answering the wrong question.
I've no idea about the misreading of "multiplier". A fourth
possibility is that mathematics works differently for him and for us,
which I suppose is possible; when I visited sci.math a while ago, I
found some people for whom everything I'd learned in grade school was
clearly wrong, and they were doing their best to enlighten the world
about the new truths of mathematics that they'd found.

ChrisA

[toc] | [prev] | [next] | [standalone]

#92279

From	"C.D. Reimer" <chris@cdreimer.com>
Date	2015-06-07 10:36 -0700
Message-ID	<mailman.266.1433698601.13271.python-list@python.org>
In reply to	#92271

On 6/7/2015 10:20 AM, Chris Angelico wrote:
> A fourth
> possibility is that mathematics works differently for him and for us,
> which I suppose is possible; when I visited sci.math a while ago, I
> found some people for whom everything I'd learned in grade school was
> clearly wrong, and they were doing their best to enlighten the world
> about the new truths of mathematics that they'd found.

I had the unfortunate luck of taking "Harvard Calculus" in college 
(circa 1995). The textbook was nothing but word problems from end to 
end. The goal was to get the students away from symbolic thinking of 
traditional calculus into thinking about real world problems that uses 
calculus. (Never mind that the majority of students don't use calculus 
after they get out school.) I bailed out after two weeks. Taking the 
class at 7:30AM probably didn't help.

Chris R.

[toc] | [prev] | [next] | [standalone]

#92284

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-08 04:28 +1000
Message-ID	<55748d5f$0$13008$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92271

On Mon, 8 Jun 2015 02:36 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:

>> The greater the multiplier, the lower the chance that any element will
>> have no hits.
> 
> Wrong.

No, Chris is correct. The "multiplier" increases the number of samples. The
larger the number of samples, the lower the chance that any element is not
represented.

If my previous post didn't convince you, consider an even simpler random
distribution: tossing a fair coin. The probability of getting a head is
exactly 1/2 whether you toss the coin once or a thousand times. But if you
toss the coin once, your chances of getting "no heads" is 1/2. If you toss
it a thousand times, your chances of getting "no heads" is 1/2**1000.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92286

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-08 04:40 +1000
Message-ID	<mailman.269.1433702438.13271.python-list@python.org>
In reply to	#92284

On Mon, Jun 8, 2015 at 4:28 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> If my previous post didn't convince you, consider an even simpler random
> distribution: tossing a fair coin. The probability of getting a head is
> exactly 1/2 whether you toss the coin once or a thousand times. But if you
> toss the coin once, your chances of getting "no heads" is 1/2. If you toss
> it a thousand times, your chances of getting "no heads" is 1/2**1000.

Although... purity and reality do sometimes disagree. The concept of
"independent events" says that if I toss a coin 99 times and it comes
up heads every time, the probability that it'll come up heads again
the hundredth time is still 50-50. But the reality of cheats coins
says that the chances of the coin being a fair one and having been
tossed 99 times and come up heads every time is one in 1<<99, and
given that the mass of this planet is roughly 6E27 grams, that would
mean that you'd need about a thousand earths' worth of coins to get
that many... so the chances are much better that you're looking at
either an imbalanced coin or a crafty operator, and the probability is
much higher that it comes up heads :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#92282

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-08 04:24 +1000
Message-ID	<55748c4f$0$13012$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92263

On Mon, 8 Jun 2015 01:51 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:
> 
>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>> <PointedEars@web.de> wrote:
>>> Cecil Westerhof wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> As there is no guarantee that every number will occur randomly, using a
>>> dictionary at first should be more efficient than a list:
>> 
>> Hmm, I'm not sure that's actually so. His code is aiming to get
>> 'multiplier' values in each box; for any serious multiplier (he starts
>> with 10 in the main code), you can be fairly confident that every
>> number will come up at least once.
> 
> The wording shows a common misconception: that random distribution would
> mean that it is guaranteed or more probable that every element of the set
> will occur at least once.  It is another common misconception that
> increasing the number of trials would increase the probability of that
> happening.  But that is not so.

You are badly mistaken.

Of course you are correct that a random distribution does not *guarantee*
that every element of the set is seen at least once. But as soon as you
call it "a common misconception" that increasing the number of trials
increases the probability of seeing every element, you are simply mistaken.
It's not a misconception at all, it is true. See below for calculations.

[...]
> If the distribution is even and the game is fair, every outcome *always*
> has
> the same probability of occurring.  As a result, every sequence of
> outcomes of equal length *always* has the same probability of occurring,
> and the probability for a particular sequence of equal length does _not_
> increase or
> decrease based on the number of occurrences of previous outcomes.  Those
> are *independent* events.

You are confused. Think about it more carefully.

To make it simple, let's randomly generate from the set 1, 2, 3, and do just
three samples. That's small enough to enumerate all the possibilities:

111 112 113 121 122 123 131 132 133
211 212 213 221 222 223 231 232 233
311 312 313 321 322 323 331 332 333

Each event (111 etc) has a probability of 1/27, but that's not important.
We're comparing the probability of all elements being seen, versus at least
one element being missed. I'm going to tag the events that contain all
three elements:

111 112 113 121 122 123* 131 132* 133
211 212 213* 221 222 223 231* 232 233
311 312* 313 321* 322 323 331 332 333

So there are six events out of 27, or 2/9, where all elements are seen in a
trial, versus 7/9 where at least one element will have a count of zero.

Putting it another way:

Pr(at least one element has zero count)
= Pr(1 has zero count) + Pr(2 has zero count) + Pr(3 has zero count)
  - Pr(1 and 2 both have zero counts) 
  - Pr(1 and 3 both have zero counts) 
  - Pr(2 and 3 both have zero counts)

= (2/3)**3 * 3 - (1/3)**3 * 3

which gives us 7/9 as confirmed by direct enumeration of all the
possibilities.

With four samples per trial:
Pr(at least one element has zero count) = 5/9

With five samples:
Pr(at least one element has zero count) = 31/81

Ten samples:
Pr(at least one element has zero count) = 341/6561 
or approximately 0.05

Twenty samples:
Pr(at least one element has zero count) = 349525/387420489
or approximately 0.0009

For 100 samples:
Pr(at least one element has zero count) = 7.4e-18 (approx)

which means that with 100 samples from the set (1, 2, 3), if you could run
one million trials every second, it would on average take you almost 4300
years to see a trial that had a zero count for one or more of the elements.

The probability of any element *not* showing up in a random trial with N
samples decreases rapidly as N increases. Elements with zero count are only
a factor when the number of samples is small, or the number of elements is
large.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92227

From	Jonas Wielicki <jonas@wielicki.name>
Date	2015-06-07 12:41 +0200
Message-ID	<mailman.237.1433674295.13271.python-list@python.org>
In reply to	#92207

[Multipart message — attachments visible in raw view] — view raw

On 07.06.2015 08:27, Cecil Westerhof wrote:
> I wrote a very simple function to test random:
>     def test_random(length, multiplier = 10000):
>         number_list = length * [0]
>         for i in range(length * multiplier):
>             number_list[random.randint(0, length - 1)] += 1
>         minimum = min(number_list)
>         maximum = max(number_list)
>         return (minimum, maximum, minimum / maximum)
> 
> When running:
>     for i in range(1, 7):
>         print(test_random(100, 10 ** i))
> I get:
>     (3, 17, 0.17647058823529413)
>     (73, 127, 0.5748031496062992)
>     (915, 1086, 0.8425414364640884)
>     (9723, 10195, 0.9537027954879843)
>     (99348, 100620, 0.9873583780560524)
>     (997198, 1002496, 0.9947151908835546)
> 
> and when running:
>     for i in range(1, 7):
>         print(test_random(1000, 10 ** i))
> I get:
>     (2, 20, 0.1)
>     (69, 138, 0.5)
>     (908, 1098, 0.8269581056466302)
>     (9684, 10268, 0.9431242695753799)
>     (99046, 100979, 0.9808574059953059)
>     (996923, 1003598, 0.9933489305478886)
> 
> It shows that when the first parameter increases the deviation
> increases and when the second parameter increases the deviation
> decreases. Exactly what you would expect. But what are the ranges you
> would expect with a good random function. 

Really depends on the number of samples. Appearantly, a good RNG would
come close to 1.0 for the ratio.

> Then it could be used to test a random function.

This is an interesting test (more interesting to me than it looks at the
first sight, and certainly better than what I had come up with), but
unfortunately, there is more to testing randomness.

The test clearly suggests that random functions should have a min/max
ratio of about 1.0. Think of a "random" function like this:

    def fake_random(minv, maxv, _cache={}):
        try:
            return next(_cache[minv, maxv])
        except (StopIteration, KeyError):
            iterator = iter(range(minv, maxv+1))
            _cache[minv, maxv] = iterator
            return next(iterator)

(if that is hard to read, I agree; it returns the sequence from minv to
maxv (inclusively) over and over again for equal minv and maxv. don’t
pass anything to _cache :), just call it like random.randint)

This gives a "perfect" ratio of 1.0, but is not very random. This kind
of function would probably be ruled out by a compression or correlation
test. If you want to dig deeper into the algorithms for random testing,
the dieharder test suite [1] is probably worth a look.

It all boils down to the use case. For some use cases, the
``fake_random`` might be good enough (unittesting would be such a case:
it is certainly uniformly distributed and allows full coverage of the
tested range), for others it would fail catastrophically (don’t generate
your cryptographic keys with this!).

cheers,
Jonas

   [1]: http://www.phy.duke.edu/~rgb/General/dieharder.php

[toc] | [prev] | [next] | [standalone]

#92242

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-07 22:52 +1000
Message-ID	<55743ea4$0$13004$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92207

On Sun, 7 Jun 2015 04:27 pm, Cecil Westerhof wrote:

> I wrote a very simple function to test random:
>     def test_random(length, multiplier = 10000):
>         number_list = length * [0]
>         for i in range(length * multiplier):
>             number_list[random.randint(0, length - 1)] += 1
>         minimum = min(number_list)
>         maximum = max(number_list)
>         return (minimum, maximum, minimum / maximum)

Putting aside the timing aspects, your frequency calculations are not done
in a very Pythonic manner. A better way might be:

from collections import Counter
from random import randint

def test_random(length, multiplier = 10000):
    freq = Counter(
        randint(0, length - 1) for i in range(length * multiplier)
        )
    minimum = min(freq[i] for i in range(length))
    maximum = max(freq.values())
    return (minimum, maximum, minimum / maximum)

Note carefully the way I calculate the minimum value. That way, if by chance
some number has a frequency of zero, the minimum returned will be zero. For
the maximum, that doesn't matter, since any missing numbers will not be the
most frequent number.

Better than returning the min and max, you should return the counter object
itself. Before doing so, I make sure that any missing numbers are given a
value of zero.

def test_random(size, samples=1000000):
    """Return a sample of `samples` random numbers from 0 to size-1."""
    freq = Counter(randint(0, size-1) for i in range(samples))
    # Avoid missing numbers. (Likely for small samples.)
    freq.subtract(dict.fromkeys(range(size), 0))
    return freq

Now once you have the frequencies, you can look at many interesting
statistics, and verify that you have a valid sample.

py> f = test_random(100)
py> sum(f.values()) == 1000000  # Total number of samples is right?
True
py> sorted(f) == list(range(100))  # Any missing numbers?
True
py> max(f) - min(f)  # statistical range
99
py> min(f.values())  # Lowest frequency?
9775
py> max(f.values())  # And the highest?
10313

The ratio of those two frequencies isn't very interesting, but we can
calculate it:

py> min(f.values())/max(f.values())
0.9478328323475226

For a uniformly distributed population, that ratio will be 1.0 exactly, but
for any finite sample, it could be any number between 0 and 1. We expect
that if we have a large random sample, the ratio should approach 1, but
moderate deviations from that are expected. In fact, we should be surprised
if the ratio is exactly 1, and suspect shenanigans if it is.

We can extract the mode, and its frequency:

py> f.most_common(1)  # mode
[(85, 10313)]

In Python 3.4 or better, we can look at some more statistics:

py> import statistics
py> statistics.median(f.elements())
50.0
py> statistics.mean(f.elements())
49.531949
py> statistics.stdev(f.elements())
28.873178122988367

And verify that the mode is correct:

py> statistics.mode(f.elements())
85

We can see what happens if the data is fiddled with:

py> g = f.copy()
py> g[42] += g[23]
py> g[23] = 0
py> statistics.median(g.elements())
50.0
py> statistics.mean(g.elements())
49.7263
py> statistics.stdev(g.elements())
28.757647388343212
py> g.most_common(3)
[(42, 20258), (85, 10313), (56, 10208)]

The median is unchanged, the mean shifts slightly higher, and the standard
deviation increases. But as you can see, these are not particularly
powerful tests of randomness: only the mode shows an obvious deviation from
uniformity.

> But what are the ranges you
> would expect with a good random function. Then it could be used to
> test a random function.

Python's pseudo-random number generator is based on the Mersenne Twister.
This is one of the most high-quality and extensively tested PRNGs
available, with a period of 2**19937-1 before it repeats. Its randomness
properties are very good indeed. (Although of course that isn't to say that
there aren't bugs in the Python implementation.)

Mersenne Twister is is not suitable for cryptographic work, but apart from
that, it is pretty much as indistinguishable from random as any
deterministic computer program can be.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92247

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-07 23:06 +1000
Message-ID	<557441ed$0$13009$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92242

On Sun, 7 Jun 2015 10:52 pm, Steven D'Aprano wrote:

> The median is unchanged, the mean shifts slightly higher, and the standard
> deviation increases. But as you can see, these are not particularly
> powerful tests of randomness: only the mode shows an obvious deviation
> from uniformity.

Oh, I forgot: we can look at the frequencies themselves as well. If our
sample is absolutely perfectly distributed uniformly, then every element
will have a frequency of exactly 10000. We don't necessarily expect this
from a random sample, but if we're too far from that, we should be
concerned. Here are the results from the fair sample:

py> statistics.mean(f.values())
10000.0
py> statistics.median(f.values())
9995.0
py> statistics.stdev(f.values())
91.8027089673141

If I do the same thing with the biased sample, the standard deviation stands
out like a sore thumb:

py> statistics.mean(g.values())
10000.0
py> statistics.median(g.values())
9993.0
py> statistics.stdev(g.values())
1442.527341617181

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92252

From	Peter Otten <__peter__@web.de>
Date	2015-06-07 15:35 +0200
Message-ID	<mailman.247.1433684166.13271.python-list@python.org>
In reply to	#92242

Steven D'Aprano wrote:

>> I wrote a very simple function to test random:
>>     def test_random(length, multiplier = 10000):
>>         number_list = length * [0]
>>         for i in range(length * multiplier):
>>             number_list[random.randint(0, length - 1)] += 1
>>         minimum = min(number_list)
>>         maximum = max(number_list)
>>         return (minimum, maximum, minimum / maximum)
> 
> Putting aside the timing aspects, your frequency calculations are not done
> in a very Pythonic manner. 

I would agree if the frequency table were sparse, i. e. many indices with

number_list[index] == 0

but that's not the case with on average 10000 hits per index.

> A better way might be:

I'm too lazy to measure, but this will likely be a tad slower. Did you mean 
to convey this by "Putting aside the timing aspects"?

> from collections import Counter
> from random import randint
> 
> def test_random(length, multiplier = 10000):
>     freq = Counter(
>         randint(0, length - 1) for i in range(length * multiplier)
>         )
>     minimum = min(freq[i] for i in range(length))

How about

      if len(freq) < length:
          minimum = 0
      else:
          minimum = min(freq.values())

Not as an optimisation (replacing randint() with randrange() is probably 
more effective in that department), but because it's closer to being self-
explanatory.

>     maximum = max(freq.values())
>     return (minimum, maximum, minimum / maximum)

[toc] | [prev] | [next] | [standalone]

#92270

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2015-06-07 18:36 +0200
Message-ID	<1586765.9ka6YhQpJC@PointedEars.de>
In reply to	#92252

Peter Otten wrote:

> Steven D'Aprano wrote:
>>> I wrote a very simple function to test random:
>>>     def test_random(length, multiplier = 10000):
>>>         number_list = length * [0]
>>>         for i in range(length * multiplier):
>>>             number_list[random.randint(0, length - 1)] += 1
>>>         minimum = min(number_list)
>>>         maximum = max(number_list)
>>>         return (minimum, maximum, minimum / maximum)
>> 
>> Putting aside the timing aspects, your frequency calculations are not
>> done in a very Pythonic manner.
> 
> I would agree if the frequency table were sparse, i. e. many indices with
> 
> number_list[index] == 0
> 
> but that's not the case with on average 10000 hits per index.

A common misconception.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#92273

From	Peter Otten <__peter__@web.de>
Date	2015-06-07 18:48 +0200
Message-ID	<mailman.260.1433695702.13271.python-list@python.org>
In reply to	#92270

Thomas 'PointedEars' Lahn wrote:

> Peter Otten wrote:
> 
>> Steven D'Aprano wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>> 
>>> Putting aside the timing aspects, your frequency calculations are not
>>> done in a very Pythonic manner.
>> 
>> I would agree if the frequency table were sparse, i. e. many indices with
>> 
>> number_list[index] == 0
>> 
>> but that's not the case with on average 10000 hits per index.
> 
> A common misconception.

A pointless remark. Not counting the ears ;)

[toc] | [prev] | [next] | [standalone]

#92300

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2015-06-07 22:15 +0200
Message-ID	<2115433.UTq2kp7hT1@PointedEars.de>
In reply to	#92273

Peter Otten wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Peter Otten wrote: 
>>> Steven D'Aprano wrote:
>>>>> I wrote a very simple function to test random:
>>>>>     def test_random(length, multiplier = 10000):
>>>>>         number_list = length * [0]
>>>>>         for i in range(length * multiplier):
>>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>>         minimum = min(number_list)
>>>>>         maximum = max(number_list)
>>>>>         return (minimum, maximum, minimum / maximum)
>>>> 
>>>> Putting aside the timing aspects, your frequency calculations are not
>>>> done in a very Pythonic manner.
>>> 
>>> I would agree if the frequency table were sparse, i. e. many indices
>>> with
>>> 
>>> number_list[index] == 0
>>> 
>>> but that's not the case with on average 10000 hits per index.
>> 
>> A common misconception.
> 
> A pointless remark. Not counting the ears ;)

The details are in my other follow-ups.  Go read them.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#92320

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-08 11:35 +1000
Message-ID	<5574f14c$0$13008$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92252

On Sun, 7 Jun 2015 11:35 pm, Peter Otten wrote:

> Steven D'Aprano wrote:
> 
> 
>>> I wrote a very simple function to test random:
>>>     def test_random(length, multiplier = 10000):
>>>         number_list = length * [0]
>>>         for i in range(length * multiplier):
>>>             number_list[random.randint(0, length - 1)] += 1
>>>         minimum = min(number_list)
>>>         maximum = max(number_list)
>>>         return (minimum, maximum, minimum / maximum)
>> 
>> Putting aside the timing aspects, your frequency calculations are not
>> done in a very Pythonic manner.
> 
> I would agree if the frequency table were sparse, i. e. many indices with
> 
> number_list[index] == 0
> 
> but that's not the case with on average 10000 hits per index.
> 
>> A better way might be:
> 
> I'm too lazy to measure, but this will likely be a tad slower. Did you
> mean to convey this by "Putting aside the timing aspects"?

No, I meant that using a Counter to count stuff is more Pythonic than using
a list.



>> from collections import Counter
>> from random import randint
>> 
>> def test_random(length, multiplier = 10000):
>>     freq = Counter(
>>         randint(0, length - 1) for i in range(length * multiplier)
>>         )
>>     minimum = min(freq[i] for i in range(length))
> 
> How about
> 
>       if len(freq) < length:
>           minimum = 0
>       else:
>           minimum = min(freq.values())

That's quite nice, and it's extensible to the case where I return the
counter:

if len(freq) != length:
    # At least one element never showed up at all.
    freq.subtract(dict.fromkeys(range(length), 0)


 
> Not as an optimisation (replacing randint() with randrange() is probably
> more effective in that department), but because it's closer to being self-
> explanatory.

I thought about replacing randint with randrange, but they are different
functions with different implementations, and could exhibit different
biases. So one might wish to test them separately for statistical
properties.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92243

From	Christian Gollwitzer <auriocus@gmx.de>
Date	2015-06-07 14:53 +0200
Message-ID	<ml1epn$peo$1@dont-email.me>
In reply to	#92207

Am 07.06.15 um 08:27 schrieb Cecil Westerhof:
> I wrote a very simple function to test random:
>      def test_random(length, multiplier = 10000):
>          number_list = length * [0]
>          for i in range(length * multiplier):
>              number_list[random.randint(0, length - 1)] += 1
>          minimum = min(number_list)
>          maximum = max(number_list)
>          return (minimum, maximum, minimum / maximum)
>

> It shows that when the first parameter increases the deviation
> increases and when the second parameter increases the deviation
> decreases. Exactly what you would expect. But what are the ranges you
> would expect with a good random function. Then it could be used to
> test a random function.

Random number generators (RNG or PRNG) are usually tested using similar 
methods, where the outcome can be assigned a probability distribution. 
For example, diehard is a famous sequence of tests for RNGs.

http://en.wikipedia.org/wiki/Diehard_tests

You are only checking for uniform distribution. This couldn't detect 
correlation, for instance if the RNG would just return increasing 
numbers. A better check for uniformness could be done by the chi square 
test or Kolmogorov-Smirnov. Then there are tables which relate the 
deviations to significance.

	Christian

[toc] | [prev] | [next] | [standalone]

#92259

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2015-06-07 11:04 -0400
Message-ID	<mailman.252.1433689486.13271.python-list@python.org>
In reply to	#92207

On Sun, 07 Jun 2015 08:27:59 +0200, Cecil Westerhof <Cecil@decebal.nl>
declaimed the following:

>
>It shows that when the first parameter increases the deviation
>increases and when the second parameter increases the deviation
>decreases. Exactly what you would expect. But what are the ranges you
>would expect with a good random function. Then it could be used to
>test a random function.

	From what I could make out, you are testing the distribution of an
equally weighted function (that is, any value in the lower..upper range is
equally likely to appear).

	Unfortunately, a measure of distribution shows nothing for
predictability (non-randomness).

	Here's an example displaying ideal distribution (I should have made the
first for-loop a generator)

>>> def lin(smpl=10000):
... 	samples = {}
... 	for i in range(smpl):
... 		samples[i % 10] = samples.get(i % 10, 0) + 1
... 	for i in range(10):
... 		print i, samples[i]
... 		
>>> lin()
0 1000
1 1000
2 1000
3 1000
4 1000
5 1000
6 1000
7 1000
8 1000
9 1000
>>> lin(9876)
0 988
1 988
2 988
3 988
4 988
5 988
6 987
7 987
8 987
9 987
>>> 

	But is it random? Only in that there is a finite chance of a truly
random sequence being: 0, 1, 2, 3, 4, ..., 8, 9, 0, 1, ...

	If the distribution is widely uneven, the odds are the generator is not
random -- or at least, not random in the equally weighted sense; it may be
perfectly random gaussian or Poisson distribution.

	For the equally weighted case, one may need to also compute, say,
differences between adjacent samples, between alternate samples, between
every 10th sample, whatever... with the intent being to reveal if there is
any cyclic pattern detectable.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

Page 4 of 4 — ← Prev page 1 2 3 [4]

csiph-web

Testing random

Contents

#92744

#92718

#92724

#92317

#92274

#92276

#92279

#92284

#92286

#92282

#92227

#92242

#92247

#92252

#92270

#92273

#92300

#92320

#92243

#92259