Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #66507 > unrolled thread

Re: random.sample with large weighted sample-sets?

Started byBen Finney <ben+python@benfinney.id.au>
First post2014-02-16 16:08 +1100
Last post2014-02-16 17:38 +0000
Articles 4 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: random.sample with large weighted sample-sets? Ben Finney <ben+python@benfinney.id.au> - 2014-02-16 16:08 +1100
    Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 16:01 +0000
    Re: random.sample with large weighted sample-sets? Charles Allen <ca137tmp@earthlink.net> - 2014-02-16 10:35 -0600
      Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 17:38 +0000

#66507 — Re: random.sample with large weighted sample-sets?

FromBen Finney <ben+python@benfinney.id.au>
Date2014-02-16 16:08 +1100
SubjectRe: random.sample with large weighted sample-sets?
Message-ID<mailman.7036.1392527326.18130.python-list@python.org>
Tim Chase <python.list@tim.thechases.com> writes:

> I'm not coming up with the right keywords to find what I'm hunting.
> I'd like to randomly sample a modestly compact list with weighted
> distributions, so I might have
>
>   data = (
>     ("apple", 20),
>     ("orange", 50),
>     ("grape", 30),
>     )

That's not a list, it's a tuple. I think you want a list.

When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.

When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.

So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:

    records = [
            ("apple", 20),
            ("orange", 50),
            ("grape", 30),
            ]

> and I'd like to random.sample() it as if it was a 100-element list.

The implication being, I suppose, that you'd like the number in each
tuple to be a weighting for the probability of choosing that item.

For probability weightings, you should arrange for the weightings to sum
to 1 (instead of 100 in your example). Then each weighting is simply the
desired probability of that item, and those values will work with
various libraries that deal with probability.

> What am I missing? (links to relevant keywords/searches/algorithms
> welcome in lieu of actually answering in-line)

You're looking for a “probability distribution” and “weighted choice”.

Hope that helps!

-- 
 \     “This world in arms is not spending money alone. It is spending |
  `\      the sweat of its laborers, the genius of its scientists, the |
_o__)           hopes of its children.” —Dwight Eisenhower, 1953-04-16 |
Ben Finney

[toc] | [next] | [standalone]


#66549

Fromduncan smith <buzzard@invalid.invalid>
Date2014-02-16 16:01 +0000
Message-ID<5300e118$0$28923$862e30e2@ngroups.net>
In reply to#66507
On 16/02/14 05:08, Ben Finney wrote:
> Tim Chase <python.list@tim.thechases.com> writes:
>
>> I'm not coming up with the right keywords to find what I'm hunting.
>> I'd like to randomly sample a modestly compact list with weighted
>> distributions, so I might have
>>
>>    data = (
>>      ("apple", 20),
>>      ("orange", 50),
>>      ("grape", 30),
>>      )
>
> That's not a list, it's a tuple. I think you want a list.
>
> When you want a sequence where each position has a semantic meaning, use
> a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
> the position it's in; if the items were in a different order, they'd
> mean different things.
>
> When you want a sequence where the positions don't have a special
> meaning – each item means exactly the same no matter if you change the
> order – that's sometimes called a “homogeneous” sequence, and you want a
> list.
>
> So a “record” should be represented as a tuple, and a “table” of records
> should be represented as a list of tuples:
>
>      records = [
>              ("apple", 20),
>              ("orange", 50),
>              ("grape", 30),
>              ]
>
>> and I'd like to random.sample() it as if it was a 100-element list.
>

[snip]

That's a description of sampling without replacement. The probabilities 
change as items are sampled. e.g. The probability of the first item 
being "apple"is 20/100. But the probability that the second sampled item 
is "apple" is either 19/99 or 20/99, depending on the value of the first 
sampled item. The following (due to Knuth) will generate indices into a 
notional list of items.


def indices(n, pop):
     # generates indices into a
     # population list containing
     # items with frequencies in pop
     # [("apple", 10), ("orange", 50), ...]
     N = sum(tup[1] for tup in pop)
     i = m = 0
     while m < n:
         u = random.random()
         if (N-i)*u >= n-m:
             i += 1
         else:
             yield i
             i += 1
             m += 1


 >>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
 >>>


The indices are generated in order, so it could easily be extended to 
generate items or item count pairs.

There might be something more efficient based on the hypergeometric 
distribution (generate a number of apples, then a number of oranges 
given the number of sampled apples, then a number of grapes given the 
number of sampled apples and oranges, etc.).

Duncan

[toc] | [prev] | [next] | [standalone]


#66555

FromCharles Allen <ca137tmp@earthlink.net>
Date2014-02-16 10:35 -0600
Message-ID<slrnlg1q7e.3nv.ca137tmp@zorro.local>
In reply to#66507
How efficient does this thing need to be?

You can always just turn it into a two-dimensional sampling problem by
thinking of the data as a function f(x=item), generating a random x=xr
in [0,x], then generating a random y in [0,max(f(x))].  The xr is
accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
y > max(f(xr)).

[toc] | [prev] | [next] | [standalone]


#66559

Fromduncan smith <buzzard@invalid.invalid>
Date2014-02-16 17:38 +0000
Message-ID<5300f7c5$0$29259$862e30e2@ngroups.net>
In reply to#66555
On 16/02/14 16:35, Charles Allen wrote:
> How efficient does this thing need to be?
>
> You can always just turn it into a two-dimensional sampling problem by
> thinking of the data as a function f(x=item), generating a random x=xr
> in [0,x], then generating a random y in [0,max(f(x))].  The xr is
> accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
> y > max(f(xr)).
>

You can avoid rejection by constructing an alias table. A list can be 
constructed such that each list element contains a pair of values and a 
cutoff. e.g.

[("apple", 20), ("orange", 50), ("grape", 30)]

would become (using one particular algorithm)

[(("apple", "orange"), 0.6),
  (("orange", "apple"), 1.0),
  (("grape", "orange"), 0.9)]

Generate a random index, then select one of the values on the basis of 
the cutoff. For short enough lists you can generate a single 0-1 random 
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to 
the cutoff, where n is the length of the list. It's still sampling with 
replacement though.

Duncan

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web