Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #66507 > unrolled thread
| Started by | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| First post | 2014-02-16 16:08 +1100 |
| Last post | 2014-02-16 17:38 +0000 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: random.sample with large weighted sample-sets? Ben Finney <ben+python@benfinney.id.au> - 2014-02-16 16:08 +1100
Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 16:01 +0000
Re: random.sample with large weighted sample-sets? Charles Allen <ca137tmp@earthlink.net> - 2014-02-16 10:35 -0600
Re: random.sample with large weighted sample-sets? duncan smith <buzzard@invalid.invalid> - 2014-02-16 17:38 +0000
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2014-02-16 16:08 +1100 |
| Subject | Re: random.sample with large weighted sample-sets? |
| Message-ID | <mailman.7036.1392527326.18130.python-list@python.org> |
Tim Chase <python.list@tim.thechases.com> writes:
> I'm not coming up with the right keywords to find what I'm hunting.
> I'd like to randomly sample a modestly compact list with weighted
> distributions, so I might have
>
> data = (
> ("apple", 20),
> ("orange", 50),
> ("grape", 30),
> )
That's not a list, it's a tuple. I think you want a list.
When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.
When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.
So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:
records = [
("apple", 20),
("orange", 50),
("grape", 30),
]
> and I'd like to random.sample() it as if it was a 100-element list.
The implication being, I suppose, that you'd like the number in each
tuple to be a weighting for the probability of choosing that item.
For probability weightings, you should arrange for the weightings to sum
to 1 (instead of 100 in your example). Then each weighting is simply the
desired probability of that item, and those values will work with
various libraries that deal with probability.
> What am I missing? (links to relevant keywords/searches/algorithms
> welcome in lieu of actually answering in-line)
You're looking for a “probability distribution” and “weighted choice”.
Hope that helps!
--
\ “This world in arms is not spending money alone. It is spending |
`\ the sweat of its laborers, the genius of its scientists, the |
_o__) hopes of its children.” —Dwight Eisenhower, 1953-04-16 |
Ben Finney
[toc] | [next] | [standalone]
| From | duncan smith <buzzard@invalid.invalid> |
|---|---|
| Date | 2014-02-16 16:01 +0000 |
| Message-ID | <5300e118$0$28923$862e30e2@ngroups.net> |
| In reply to | #66507 |
On 16/02/14 05:08, Ben Finney wrote:
> Tim Chase <python.list@tim.thechases.com> writes:
>
>> I'm not coming up with the right keywords to find what I'm hunting.
>> I'd like to randomly sample a modestly compact list with weighted
>> distributions, so I might have
>>
>> data = (
>> ("apple", 20),
>> ("orange", 50),
>> ("grape", 30),
>> )
>
> That's not a list, it's a tuple. I think you want a list.
>
> When you want a sequence where each position has a semantic meaning, use
> a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
> the position it's in; if the items were in a different order, they'd
> mean different things.
>
> When you want a sequence where the positions don't have a special
> meaning – each item means exactly the same no matter if you change the
> order – that's sometimes called a “homogeneous” sequence, and you want a
> list.
>
> So a “record” should be represented as a tuple, and a “table” of records
> should be represented as a list of tuples:
>
> records = [
> ("apple", 20),
> ("orange", 50),
> ("grape", 30),
> ]
>
>> and I'd like to random.sample() it as if it was a 100-element list.
>
[snip]
That's a description of sampling without replacement. The probabilities
change as items are sampled. e.g. The probability of the first item
being "apple"is 20/100. But the probability that the second sampled item
is "apple" is either 19/99 or 20/99, depending on the value of the first
sampled item. The following (due to Knuth) will generate indices into a
notional list of items.
def indices(n, pop):
# generates indices into a
# population list containing
# items with frequencies in pop
# [("apple", 10), ("orange", 50), ...]
N = sum(tup[1] for tup in pop)
i = m = 0
while m < n:
u = random.random()
if (N-i)*u >= n-m:
i += 1
else:
yield i
i += 1
m += 1
>>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
>>>
The indices are generated in order, so it could easily be extended to
generate items or item count pairs.
There might be something more efficient based on the hypergeometric
distribution (generate a number of apples, then a number of oranges
given the number of sampled apples, then a number of grapes given the
number of sampled apples and oranges, etc.).
Duncan
[toc] | [prev] | [next] | [standalone]
| From | Charles Allen <ca137tmp@earthlink.net> |
|---|---|
| Date | 2014-02-16 10:35 -0600 |
| Message-ID | <slrnlg1q7e.3nv.ca137tmp@zorro.local> |
| In reply to | #66507 |
How efficient does this thing need to be? You can always just turn it into a two-dimensional sampling problem by thinking of the data as a function f(x=item), generating a random x=xr in [0,x], then generating a random y in [0,max(f(x))]. The xr is accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if y > max(f(xr)).
[toc] | [prev] | [next] | [standalone]
| From | duncan smith <buzzard@invalid.invalid> |
|---|---|
| Date | 2014-02-16 17:38 +0000 |
| Message-ID | <5300f7c5$0$29259$862e30e2@ngroups.net> |
| In reply to | #66555 |
On 16/02/14 16:35, Charles Allen wrote:
> How efficient does this thing need to be?
>
> You can always just turn it into a two-dimensional sampling problem by
> thinking of the data as a function f(x=item), generating a random x=xr
> in [0,x], then generating a random y in [0,max(f(x))]. The xr is
> accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
> y > max(f(xr)).
>
You can avoid rejection by constructing an alias table. A list can be
constructed such that each list element contains a pair of values and a
cutoff. e.g.
[("apple", 20), ("orange", 50), ("grape", 30)]
would become (using one particular algorithm)
[(("apple", "orange"), 0.6),
(("orange", "apple"), 1.0),
(("grape", "orange"), 0.9)]
Generate a random index, then select one of the values on the basis of
the cutoff. For short enough lists you can generate a single 0-1 random
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to
the cutoff, where n is the length of the list. It's still sampling with
replacement though.
Duncan
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web