Re: random.sample with large weighted sample-sets?

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'string': 0.09; '[0,': 0.09; 'ascii': 0.09; 'integers': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'jan': 0.12; 'random': 0.14; '(0,': 0.16; 'absolute,': 0.16; 'array.': 0.16; 'assumptions': 0.16; 'bisect': 0.16; 'calculates': 0.16; 'data)': 0.16; 'frequencies': 0.16; 'objects.': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'reedy': 0.16; 'subject:sample': 0.16; 'index': 0.16; 'wrote:': 0.18; 'trying': 0.19; 'meant': 0.20; 'seems': 0.21; '>>>': 0.22; 'appears': 0.22; 'select': 0.22; 'import': 0.22; 'separate': 0.22; 'header:User-Agent:1': 0.23; 'right.': 0.26; 'code:': 0.26; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'appear': 0.29; 'testing': 0.29; 'am,': 0.29; 'array': 0.29; 'tim': 0.29; 'compared': 0.30; 'returned': 0.30; 'code': 0.31; 'lines': 0.31; 'chase': 0.31; 'implicit': 0.31; 'class': 0.32; 'selection': 0.32; 'could': 0.34; 'subject:with': 0.35; 'there': 0.35; 'subject:?': 0.36; 'should': 0.36; 'two': 0.37; 'form,': 0.38; 'needed': 0.38; 'to:addr:python- list': 0.38; 'issue': 0.38; 'pm,': 0.38; 'rather': 0.38; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'even': 0.60; 'read': 0.60; 'received:173': 0.61; 'you.': 0.62; 'show': 0.63; 'yes': 0.68; 'below.': 0.71; '100': 0.79; '50),': 0.84; 'data;': 0.84; 'received:fios.verizon.net': 0.84
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Terry Reedy <tjreedy@udel.edu>
Subject	Re: random.sample with large weighted sample-sets?
Date	Sun, 16 Feb 2014 14:47:56 -0500
References	<20140215224145.68c82eb4@bigbox.christie.dr> <ldpve4$tnk$1@ger.gmane.org> <20140216082250.7aecebb4@bigbox.christie.dr>
Mime-Version	1.0
Content-Type	text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding	7bit
X-Gmane-NNTP-Posting-Host	pool-173-75-254-207.phlapa.fios.verizon.net
User-Agent	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
In-Reply-To	<20140216082250.7aecebb4@bigbox.christie.dr>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.7070.1392580125.18130.python-list@python.org> (permalink)
Lines	79
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1392580125 news.xs4all.nl 2959 [2001:888:2000:d::a6]:50083
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:66564

Show key headers only | View raw

On 2/16/2014 9:22 AM, Tim Chase wrote:
> On 2014-02-16 04:12, Terry Reedy wrote:
>> On 2/15/2014 11:41 PM, Tim Chase wrote:
>>>     data = (
>>>       ("apple", 20),
>>>       ("orange", 50),
>>>       ("grape", 30),
>>>       )

>> If you actually start with date in this form, write the few lines
>> needed to produce the form below.
>>
>> import bisect
>> import random
>>
>> data = [
>>     (0, 'apple'),
>>     (20, 'orange'),
>>     (70, 'grape'),
>> ]
>>
>> for i in range(10):
>>       r = random.randrange(0, 100)
>>       i = bisect.bisect(data, (r, 'zzzzz')) - 1
>>       print(data[i][1])
>
> Trying to read what may be implicit assumptions in your code:

0) The frequencies are relative, not absolute, and the selection is with 
replacement (== selection without replacement from an 'infinite' population)

> 1) your code calculates "100" as sum(item[0] for item in data)

yes

> 2) the data has to be sorted for bisect to work

cumulative sums are automatically sorted.

>
> 3) you meant to write "(10, 'apple')" rather than 0.

No (as Ned explained). There are 20 integers in [0, 20), so 20 chances 
out of 100 to select an apple.

> 4) that "zzzzzz" is some arbitrary value that should come after any
> string that could appear in the data;

Right. Use "\U000FFFFF" if not using ascii only.

 > perhaps using some custom "InfinityString" class where everything
 > compared to it is always less than it.

Why bother when a good-enough 'top' string is available?
Up to you.

The issue can be avoided by transposing the n x 2 array into a 2 x n 
array with separate subarrays of cumulative sums and objects.  Do the 
bisect search in the subarray of cusums and use the returned index to 
retrieve the object from the object array.

>
> Some long-running testing on this code seems to show that if two
> items have the same probability, bisect only appears to find the last
> one.  Tested with
>
>    data = [
>      (10, "apple"),
>      (20, "banana"), # I never get any bananas, even after thousands of iterations

because 20 - 20 == 0

>      (20, "grape"),
>      (50, "orange"),
>      ]

-- 
Terry Jan Reedy

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: random.sample with large weighted sample-sets? Terry Reedy <tjreedy@udel.edu> - 2014-02-16 14:47 -0500

csiph-web