JOIN
Generating random distributions

By bmerry
TopCoder Member

One of the features that I have found interesting about the Marathon Match format is that the data is hidden, yet you have a pretty good idea of what it is going to look like. This is because all the cases are randomly generated, and the problem always contains a detailed specification of the generation method. While there is sometimes a visualization tool that can generate test cases for you, other times you'll have to do it yourself. In this tutorial, we'll look at a few probability distributions and how to generate values based on them.

I'll be discussing the issues in terms of C/C++ (my language of choice). Accordingly, some of the initial, low-level stuff won't be applicable to other languages that have more user-friendly random number libraries. However, the later information will be applicable to all.

Uniform scalar and discrete distributions
Let's start with something simple: generating a single variable. The basic building block is the uniform discrete distribution, meaning that each of N possibilities is equally likely. The approach that is often used when you want an arbitrary number and aren't worried about things being exactly uniform is `rand() % N`, but there are a few problems with this. Firstly, some random number generators are badly designed, with the low-order bits repeating quite quickly. To be safe, I'd recommend that you use something like:

`(int) (N * (rand() / (RAND_MAX + 1.0)))`

This will first obtain a real number in the range [0, 1), then scale and truncate it. Another problem is that unless N is a factor of `RAND_MAX + 1`, either method will generate a small amount of bias, particularly if N is quite large. Think about trying to make a choice among 5 options using a 6-sided die: there is no way to map the 6 die rolls to the 5 options in an unbiased way. The solution is the metaphorical equivalent of rolling the die again if a 6 comes up: we identify a useful range of random outputs (a multiple of N), and if anything else comes up, we discard it. The simplest solution is to reject any `rand()` that is at least `RAND_MAX - RAND_MAX % N`. This uses as much of the range as possible, and since this is always at least half of it, the average number of calls to `rand()` is at most 2 and usually far less.

If your favourite language doesn't provide you with a way to generate a random real number in the range [0, 1), you can generate one as `rand() / (RAND_MAX + 1.0)`. This is not perfect, because it will only generate one of typically 231 values, but for most purposes that is good enough.

If I'm writing any kind of data generator, I start by writing utility functions to implement these algorithms, so that I don't need to worry about the details when I'm doing the interesting parts.

Non-uniform discrete distributions
Let's say that you have a loaded die, with specified probabilities for each outcome. This can be simulated by mapping each outcome to a suitable range of the [0, 1) interval, then generating a uniform random number. For example, suppose a die has probabilities of 0.1, 0.2, 0.1, 0.3, 0.15 and 0.15 of coming up 1, 2, 3, 4, 5 or 6. We can map these events to the ranges [0, 0.1), [0.1, 0.3), [0.3, 0.4), [0.4, 0.7), [0.7, 0.85) and [0.85, 1.0). We then generate a random number in [0, 1) with a uniform (unbiased) distribution, and check which range it lies in. If there are many possible events, a binary search can accelerate this lookup.

In some cases, the probabilities will change at run-time. A real-world instance of this is in streaming data compression, where the probability of seeing a piece of text is updated depending on the number of times it has already been seen. A binary indexed tree is ideal for this type of problem.

Non-uniform continuous distributions
The idea of the previous section can be generalized to continuous distributions. As an example, let's suppose that some number lies in the range [0, 1), but the probability of it being x is directly proportional to x (so values close to 1 are more likely). First, let's find out exactly what the probability density function is: it's going to have the form p(x) = ax, but we need to choose a so that the integral is 1. Doing a little calculus shows that a = 2. Next, we want the cumulative density function, which is the indefinite integral of the probability density function, in this case c(x) = x2. To generate a random number with a given distribution, the procedure is to start with a uniform random number in the range [0, 1), then pass it through the inverse of the cumulative distribution function. In this case, we just take sqrt(r), where r is a uniform random number in [0, 1).

This works when the cumulative density function is both known and reasonably easy to invert (although binary search can always be used for the inversion). Unfortunately, the most important distribution, the Gaussian or normal distribution, does not have a formula for its cumulative density function. Curiously, it is easier to generate two normally distributed random numbers. The following algorithm will do this. I won't provide a proof, but if you want to try to prove it yourself, think about integrating the bivariate Gaussian distribution in polar form.

1. Pick two random numbers x and y uniformly from the interval [-1, 1].
2. Let r2 = x2 + y2.
3. If r2 > 1, go back to step 1.
4. Let u = sqrt(-2 log(r2) / r2)x.
5. Let v = sqrt(-2 log(r2) / r2)y.
6. u and v both have the standard normal distribution.

Although you get two random numbers out, I usually discard one rather than try to keep it around for the next time I need a random number. The numbers also have the standard distribution, which has a mean of 0 and a standard deviation of 1. If you need a different mean and/or standard deviation, it is sufficient to scale and bias the number obtained.

Coming soon
So far we've seen how to generate random distributions with a single variable. In part 2, we'll look at generating random collections of elements, such as permutations, combinations and subsets.