Wednesday 31 December 2014

selection with and without replacement

Selection without replacement
Often, we want to selects K items at random from a pool of N items, without replacement.  That means we select the first item at random, remove it from the pool, put it aside, then take out the second from the remaining N–1 items, and so on.  This is different from selection with replacement, where we do not remove the selected items from the pool, so we are always selecting from N items, and may select the same item more than once.

An example of selection with replacement is forming a random string by selecting from a bag containing several balls, each marked with a unique character of the relevant alphabet.  The first character in the string is that on the first ball drawn.  The ball is replaced, the bag shaken, and a second ball drawn to produce the second character, and so on.  Clearly the same ball may be drawn more than once in this process, so characters can repeat.  If instead we drew without replacement, we would instead get a string where each character appeared at most once (since each ball in the bag has a unique character), and the string length would be limited by the number of balls in the bag.

In Python, we cam implement choice with replacement readily, using the random module.
# choose K from N, with replacement
import random

K = 6
N = 16
bag = range(0,N-1)

draw = []
for i in range(0,K):
    draw += [random.choice(bag)]
    
print draw 
Running this gives results like
[6, 2, 6, 1, 5, 5]
and
[1, 0, 0, 4, 9, 9]
In fact, choosing several items without replacement is so common, the numpy module extends choice() to take an extra argument
# choose K from N, with replacement
import numpy

K = 6
N = 16
bag = range(0,N-1)

draw = numpy.random.choice(bag,K)
print draw 
Running this gives results like
array([12,  1, 11,  4, 10, 10])

How to choose without replacement?

We could keep choosing with replacement, checking that we have no duplicates, until we have enough unique choices.  But this is very inefficient, particularly when K is close to N, as finding that last unique choice is unlikely! It also doesn’t extend to the case where there are duplicate items in the bag, as it is not possible to tell whether a drawn item has been drawn before, or is merely the same as one that has been drawn before.

We could explicitly remove the item from the bag each time. In Python, we can implement choice with replacement, choosing at random as before, but now explicitly removing the item from the bag, too.
# choose K from N, without replacement
import random

K = 6
N = 16
bag = range(0,N-1)

draw = []
for i in range(0,6):
    item = random.choice(bag)
    draw += [item]
    bag.remove(item)
    
print draw, bag
Running this gives results like
[3, 13, 7, 9, 11, 2] [0, 1, 4, 5, 6, 8, 10, 12, 14]
and
[4, 3, 5, 14, 11, 9] [0, 1, 2, 6, 7, 8, 10, 12, 13]
There are no duplicates in the list, and the chosen items are removed from the bag.

Iterations usually feel non-Pythonic, unless over something necessarily temporal.  Here the iterative solution arises from the temporal nature of the description: draw a ball, then another, then another.  But is there another way: can I just draw all K items all together in a single handful? Can we remove the iteration?

To draw a handful, it seems I need to make several random choices from my neatly ordered list in the bag.  But actually, the point of a bag is that all the items are jumbled up.  This is the clue: jumble up the list, then take a handful from that jumble.

So, we can randomly permute (jumble) the items in the bag, and then take the first K elements (handful) of that permutation:
# choose K from N, without replacement
import numpy

K = 6
N = 16
bag = range(0,N-1)

draw = numpy.random.permutation(bag)[0:K].tolist()
print draw 

This gives us what we want. (The tolist() is there to convert the numpy result from an array back to a list, and is needed only if you want a list rather than an array).  Different runs give different permutations, and hence different handfuls (initial slices).  I used this technique to choose the K random input nodes to each of the N nodes in my random boolean network example (although it is hidden deep in the calculation of Con).

The bag doesn’t have to be a simple list of numbers: it can be a list of anything we want to choose from.  Let’s say we have an N×N grid, and want to choose K distinct positions, that is, K coordinates without replacement. We can create a bag of coordinates (here using a list comprehension), permute them, and slice the front off, thus:
# choose K from N, without replacement
import numpy

K = 6
N = 4
bag = [ [i,j] for i in range(0,N)  for j in range(0,N) ]
print bag

draw = numpy.random.permutation(bag)[0:K].tolist()
print draw 
Running this gives results like
[[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [1, 1], [1, 2], [1, 3], [2, 0], [2, 1], [2, 2], [2, 3], [3, 0], [3, 1], [3, 2], [3, 3]]
[[0, 3], [1, 3], [3, 2], [2, 3], [3, 3], [3, 1]]
and there are our random positions!  That feels more Pythonic to me.

No comments:

Post a Comment