Groups > comp.lang.python > #66507 > unrolled thread

Re: random.sample with large weighted sample-sets?

Started by	Ben Finney <ben+python@benfinney.id.au>
First post	2014-02-16 16:08 +1100
Last post	2014-02-16 17:38 +0000
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: random.sample with large weighted sample-sets? Ben Finney <ben+python@benfinney.id.au> - 2014-02-16 16:08 +1100
    Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 16:01 +0000
    Re: random.sample with large weighted sample-sets? Charles Allen <ca137tmp@earthlink.net> - 2014-02-16 10:35 -0600
      Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 17:38 +0000

#66507 — Re: random.sample with large weighted sample-sets?

From	Ben Finney <ben+python@benfinney.id.au>
Date	2014-02-16 16:08 +1100
Subject	Re: random.sample with large weighted sample-sets?
Message-ID	<mailman.7036.1392527326.18130.python-list@python.org>

Tim Chase <python.list@tim.thechases.com> writes:

> I'm not coming up with the right keywords to find what I'm hunting.
> I'd like to randomly sample a modestly compact list with weighted
> distributions, so I might have
>
>   data = (
>     ("apple", 20),
>     ("orange", 50),
>     ("grape", 30),
>     )

That's not a list, it's a tuple. I think you want a list.

When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.

When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.

So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:

    records = [
            ("apple", 20),
            ("orange", 50),
            ("grape", 30),
            ]

> and I'd like to random.sample() it as if it was a 100-element list.

The implication being, I suppose, that you'd like the number in each
tuple to be a weighting for the probability of choosing that item.

For probability weightings, you should arrange for the weightings to sum
to 1 (instead of 100 in your example). Then each weighting is simply the
desired probability of that item, and those values will work with
various libraries that deal with probability.

> What am I missing? (links to relevant keywords/searches/algorithms
> welcome in lieu of actually answering in-line)

You're looking for a “probability distribution” and “weighted choice”.

Hope that helps!

-- 
 \     “This world in arms is not spending money alone. It is spending |
  `\      the sweat of its laborers, the genius of its scientists, the |
_o__)           hopes of its children.” —Dwight Eisenhower, 1953-04-16 |
Ben Finney

[toc] | [next] | [standalone]

#66549

From	duncan smith <buzzard@invalid.invalid>
Date	2014-02-16 16:01 +0000
Message-ID	<5300e118$0$28923$862e30e2@ngroups.net>
In reply to	#66507

On 16/02/14 05:08, Ben Finney wrote:
> Tim Chase <python.list@tim.thechases.com> writes:
>
>> I'm not coming up with the right keywords to find what I'm hunting.
>> I'd like to randomly sample a modestly compact list with weighted
>> distributions, so I might have
>>
>>    data = (
>>      ("apple", 20),
>>      ("orange", 50),
>>      ("grape", 30),
>>      )
>
> That's not a list, it's a tuple. I think you want a list.
>
> When you want a sequence where each position has a semantic meaning, use
> a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
> the position it's in; if the items were in a different order, they'd
> mean different things.
>
> When you want a sequence where the positions don't have a special
> meaning – each item means exactly the same no matter if you change the
> order – that's sometimes called a “homogeneous” sequence, and you want a
> list.
>
> So a “record” should be represented as a tuple, and a “table” of records
> should be represented as a list of tuples:
>
>      records = [
>              ("apple", 20),
>              ("orange", 50),
>              ("grape", 30),
>              ]
>
>> and I'd like to random.sample() it as if it was a 100-element list.
>

[snip]

That's a description of sampling without replacement. The probabilities 
change as items are sampled. e.g. The probability of the first item 
being "apple"is 20/100. But the probability that the second sampled item 
is "apple" is either 19/99 or 20/99, depending on the value of the first 
sampled item. The following (due to Knuth) will generate indices into a 
notional list of items.


def indices(n, pop):
     # generates indices into a
     # population list containing
     # items with frequencies in pop
     # [("apple", 10), ("orange", 50), ...]
     N = sum(tup[1] for tup in pop)
     i = m = 0
     while m < n:
         u = random.random()
         if (N-i)*u >= n-m:
             i += 1
         else:
             yield i
             i += 1
             m += 1


 >>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
 >>>


The indices are generated in order, so it could easily be extended to 
generate items or item count pairs.

There might be something more efficient based on the hypergeometric 
distribution (generate a number of apples, then a number of oranges 
given the number of sampled apples, then a number of grapes given the 
number of sampled apples and oranges, etc.).

Duncan

[toc] | [prev] | [next] | [standalone]

#66555

From	Charles Allen <ca137tmp@earthlink.net>
Date	2014-02-16 10:35 -0600
Message-ID	<slrnlg1q7e.3nv.ca137tmp@zorro.local>
In reply to	#66507

How efficient does this thing need to be?

You can always just turn it into a two-dimensional sampling problem by
thinking of the data as a function f(x=item), generating a random x=xr
in [0,x], then generating a random y in [0,max(f(x))].  The xr is
accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
y > max(f(xr)).

[toc] | [prev] | [next] | [standalone]

#66559

From	duncan smith <buzzard@invalid.invalid>
Date	2014-02-16 17:38 +0000
Message-ID	<5300f7c5$0$29259$862e30e2@ngroups.net>
In reply to	#66555

On 16/02/14 16:35, Charles Allen wrote:
> How efficient does this thing need to be?
>
> You can always just turn it into a two-dimensional sampling problem by
> thinking of the data as a function f(x=item), generating a random x=xr
> in [0,x], then generating a random y in [0,max(f(x))].  The xr is
> accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
> y > max(f(xr)).
>

You can avoid rejection by constructing an alias table. A list can be 
constructed such that each list element contains a pair of values and a 
cutoff. e.g.

[("apple", 20), ("orange", 50), ("grape", 30)]

would become (using one particular algorithm)

[(("apple", "orange"), 0.6),
  (("orange", "apple"), 1.0),
  (("grape", "orange"), 0.9)]

Generate a random index, then select one of the values on the basis of 
the cutoff. For short enough lists you can generate a single 0-1 random 
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to 
the cutoff, where n is the length of the list. It's still sampling with 
replacement though.

Duncan

[toc] | [prev] | [standalone]

csiph-web

Re: random.sample with large weighted sample-sets?

Contents

#66507 — Re: random.sample with large weighted sample-sets?

#66549

#66555

#66559