Groups > comp.lang.python > #36623 > unrolled thread

Multiple disjoint sample sets?

Started by	Roy Smith <roy@panix.com>
First post	2013-01-11 09:15 -0500
Last post	2013-01-13 11:16 +0100
Articles	4 — 4 participants

Back to article view | Back to comp.lang.python

  Multiple disjoint sample sets? Roy Smith <roy@panix.com> - 2013-01-11 09:15 -0500
    Re: Multiple disjoint sample sets? MRAB <python@mrabarnett.plus.com> - 2013-01-11 14:36 +0000
    Re: Multiple disjoint sample sets? Dave Angel <d@davea.name> - 2013-01-11 10:14 -0500
    Re: Multiple disjoint sample sets? Peter Otten <__peter__@web.de> - 2013-01-13 11:16 +0100

#36623 — Multiple disjoint sample sets?

From	Roy Smith <roy@panix.com>
Date	2013-01-11 09:15 -0500
Subject	Multiple disjoint sample sets?
Message-ID	<roy-7E69C0.09152911012013@news.panix.com>

I have a list of items.  I need to generate n samples of k unique items 
each.  I not only want each sample set to have no repeats, but I also 
want to make sure the sets are disjoint (i.e. no item repeated between 
sets).

random.sample(items, k) will satisfy the first constraint, but not the 
second.  Should I just do random.sample(items, k*n), and then split the 
resulting big list into n pieces?  Or is there some more efficient way?

Typical values:

len(items) = 5,000,000
n = 10
k = 100,000

[toc] | [next] | [standalone]

#36625

From	MRAB <python@mrabarnett.plus.com>
Date	2013-01-11 14:36 +0000
Message-ID	<mailman.400.1357915022.2939.python-list@python.org>
In reply to	#36623

On 2013-01-11 14:15, Roy Smith wrote:
> I have a list of items.  I need to generate n samples of k unique items
> each.  I not only want each sample set to have no repeats, but I also
> want to make sure the sets are disjoint (i.e. no item repeated between
> sets).
>
> random.sample(items, k) will satisfy the first constraint, but not the
> second.  Should I just do random.sample(items, k*n), and then split the
> resulting big list into n pieces?  Or is there some more efficient way?
>
> Typical values:
>
> len(items) = 5,000,000
> n = 10
> k = 100,000
>
I don't know how efficient it would be, but couldn't you shuffle the
list and then use slicing to get the samples?

[toc] | [prev] | [next] | [standalone]

#36627

From	Dave Angel <d@davea.name>
Date	2013-01-11 10:14 -0500
Message-ID	<mailman.402.1357917280.2939.python-list@python.org>
In reply to	#36623

On 01/11/2013 09:36 AM, MRAB wrote:
> On 2013-01-11 14:15, Roy Smith wrote:
>> I have a list of items.  I need to generate n samples of k unique items
>> each.  I not only want each sample set to have no repeats, but I also
>> want to make sure the sets are disjoint (i.e. no item repeated between
>> sets).
>>
>> random.sample(items, k) will satisfy the first constraint, but not the
>> second.  Should I just do random.sample(items, k*n), and then split the
>> resulting big list into n pieces?  Or is there some more efficient way?
>>
>> Typical values:
>>
>> len(items) = 5,000,000
>> n = 10
>> k = 100,000
>>
> I don't know how efficient it would be, but couldn't you shuffle the
> list and then use slicing to get the samples?

I like that answer best, but just to offer another choice...

You start with a (presumably unique) list of items.  After collecting
your first sample, you could subtract them from the list, and use the
smaller list for the next sample.

One way is to convert list to set, subtract, then convert back to list.



-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#36728

From	Peter Otten <__peter__@web.de>
Date	2013-01-13 11:16 +0100
Message-ID	<mailman.464.1358072165.2939.python-list@python.org>
In reply to	#36623

Roy Smith wrote:

> I have a list of items.  I need to generate n samples of k unique items
> each.  I not only want each sample set to have no repeats, but I also
> want to make sure the sets are disjoint (i.e. no item repeated between
> sets).
> 
> random.sample(items, k) will satisfy the first constraint, but not the
> second.  Should I just do random.sample(items, k*n), and then split the
> resulting big list into n pieces?  Or is there some more efficient way?
> 
> Typical values:
> 
> len(items) = 5,000,000
> n = 10
> k = 100,000

I would expect that your simple approach is more efficient than shuffling 
the whole list. 

Assuming there is a sample_iter(population) that generates unique items from 
the population (which has no repetitions itself) you can create the samples 
with

g = sample_iter(items)
samples = [list(itertools.islice(g, k) for _ in xrange(n)]

My ideas for such a sample_iter():

def sample_iter_mark(items):
    n = len(items)
    while True:
        i = int(random()*n)
        v = items[i]
        if v is not None:
            yield v
            items[i] = None

This is destructive and will degrade badly as the number of None items 
increases. For your typical values it seems to be OK though. You can make 
this non-destructive by adding a bit array or a set (random.Random.sample() 
has code that uses a set) to keep track of the seen items.

Another sample_iter() (which is also part of the random.Random.sample() 
implementation):

def sample_iter_replace(items):
    n = len(items)
    for k in xrange(n):
        i = int(random()*(n-k))
        yield items[i]
        items[i] = items[n-k-1]

You can micro-optimise that a bit to avoid the index calculation. Also, 
instead of overwriting items you could swap them, so that no values would be 
lost, only their initial order.

[toc] | [prev] | [standalone]

csiph-web

Multiple disjoint sample sets?

Contents

#36623 — Multiple disjoint sample sets?

#36625

#36627

#36728