Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.058 X-Spam-Evidence: '*H*': 0.88; '*S*': 0.00; 'sufficient': 0.05; '[0]': 0.07; 'optimizing': 0.09; 'cc:addr:python-list': 0.10; 'def': 0.14; '100,': 0.16; '8:40': 0.16; '_do_': 0.16; 'bucket': 0.16; 'certainty': 0.16; 'code),': 0.16; 'correctness': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hits.': 0.16; 'omitted,': 0.16; 'something.': 0.16; 'subject:random': 0.16; 'wording': 0.16; 'wrote:': 0.16; 'element': 0.18; 'instance,': 0.18; '>>>': 0.20; 'cc:2**0': 0.21; 'cc:addr:python.org': 0.21; 'fairly': 0.22; 'so.': 0.22; 'am,': 0.23; '2015': 0.23; 'wrote': 0.23; 'header:In-Reply-To:1': 0.24; 'raise': 0.24; 'mon,': 0.24; 'chris': 0.26; 'least': 0.27; 'message-id:@mail.gmail.com': 0.28; "i'm": 0.29; 'closer': 0.29; 'dictionary': 0.29; 'once.': 0.29; 'sure,': 0.29; 'random': 0.29; 'starts': 0.29; 'function': 0.30; "skip:' 10": 0.30; 'values': 0.30; 'becomes': 0.31; 'posts': 0.31; 'code': 0.31; 'gets': 0.32; 'common': 0.33; 'another': 0.34; '(for': 0.34; 'received:google.com': 0.34; 'minimum': 0.35; 'list:': 0.35; 'something': 0.35; 'list': 0.35; 'but': 0.36; 'there': 0.36; 'should': 0.37; 'subject:: ': 0.37; 'list.': 0.37; 'mean': 0.38; 'pm,': 0.39; 'test': 0.39; 'sure': 0.40; 'where': 0.40; 'some': 0.40; 'maximum': 0.61; "you'll": 0.61; 'simple': 0.61; 'here.': 0.61; 'more': 0.62; 'thomas': 0.63; 'making': 0.64; 'percent': 0.66; 'talking': 0.67; 'guaranteed': 0.67; 'levels': 0.70; 'increase': 0.72; 'increasing': 0.76; '100': 0.79; 'satisfied': 0.83; '10th': 0.84; 'cecil': 0.84; 'chrisa': 0.84; 'confidence,': 0.84; 'probable': 0.84; 'subject:Testing': 0.84; "there'll": 0.84; 'westerhof': 0.84; 'to:none': 0.90; 'power,': 0.91; 'acknowledge': 0.93; 'serious': 0.97 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=vefS3GfKP2kcRQj7Lzmrk8d0cZk0Wix1mVZBKElK+Z8=; b=mmEPb3dNM+8+fwewHRht57I8P4FRARyADm+r/nG/q/0yWsBdvZgJtWgeYa+zw4f5Hk K8aYygTlWx+jMh0WqKmY2GDC7E+qa34smQ3/sHnKIxDCUalnneyMTiydsYdCbLgkw2Ml Rv1NQnJ2cWXH43E9a9SMeLPIfWFDK5kcViT+GBLLRQnVb9gLAeS7vSFtsCWyrlUPfpJd S0HA4LSzuhdpf4vo4TwV62IF/BZtOO7PyfpVJl4gpPCj1H/d+RZmUIJYEQPh5wbwB9cb q0PxnNh8swl4yRQeYFMW8ThvNcZobpDrY2oMN2LV3OyUl1guzE/iV40vUt5eR9C1hqYx hdTQ== MIME-Version: 1.0 X-Received: by 10.50.141.164 with SMTP id rp4mr8910766igb.2.1433694321096; Sun, 07 Jun 2015 09:25:21 -0700 (PDT) In-Reply-To: <3158703.Lr4HFMbMOd@PointedEars.de> References: <87oaksowwg.fsf@Equus.decebal.nl> <1451048.pW9z17ilMA@PointedEars.de> <3158703.Lr4HFMbMOd@PointedEars.de> Date: Mon, 8 Jun 2015 02:25:21 +1000 Subject: Re: Testing random From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 60 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1433694323 news.xs4all.nl 2968 [2001:888:2000:d::a6]:40613 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:92267 On Mon, Jun 8, 2015 at 1:51 AM, Thomas 'PointedEars' Lahn wrote: > Chris Angelico wrote: > >> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn >> wrote: >>> Cecil Westerhof wrote: >>>> I wrote a very simple function to test random: >>>> def test_random(length, multiplier = 10000): >>>> number_list = length * [0] >>>> for i in range(length * multiplier): >>>> number_list[random.randint(0, length - 1)] += 1 >>>> minimum = min(number_list) >>>> maximum = max(number_list) >>>> return (minimum, maximum, minimum / maximum) >>> >>> As there is no guarantee that every number will occur randomly, using a >>> dictionary at first should be more efficient than a list: >> >> Hmm, I'm not sure that's actually so. His code is aiming to get >> 'multiplier' values in each box; for any serious multiplier (he starts >> with 10 in the main code), you can be fairly confident that every >> number will come up at least once. > > The wording shows a common misconception: that random distribution would > mean that it is guaranteed or more probable that every element of the set > will occur at least once. It is another common misconception that > increasing the number of trials would increase the probability of that > happening. But that is not so. The greater the multiplier, the lower the chance that any element will have no hits. With uniform distribution, a length of 10, and a multiplier of 10, there are 100 attempts with a 90% chance each that any given number will be avoided - which works out to 0.9**100 == 2.6561398887587544e-05 probability that (say) there'll be no 4s in the list. Invert that and raise to the 10th power, and you get a probability of 0.9997344177567317 that there'll be at least one in every bucket. Raise the multiplier to 100, and the probability of a single whiff becomes 1.7478712517226947e-46; invert and raise to the tenth power, and it becomes closer to certainty than IEEE double precision can represent. Raise the length to 100 and the numbers get lower again; with multiplier 10, probability 0.9956920878572284 of having one in every bucket (that's a half a percent chance of a zero anywhere), and at multiplier 100, still underflows to certainty. But you'll notice that I wasn't actually talking about certainty here. I was talking about confidence, at levels sufficient to make data-type decisions on. Sure, there's no guarantee that every number will occur; but if there's a 0.4% chance that any number will be omitted, I think the list is going to work out more efficient. You'll notice that some of the other posts have been concerned more about correctness (for instance, using collections.Counter and then making sure there's a zero slot for every element - otherwise empty slots would be omitted), and then they _do_ acknowledge the chance that something will underflow. But with the numbers the OP gave, I would be fully satisfied with optimizing for the case where every bucket gets at least something. ChrisA