To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?
The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.
We will discuss the half dozen basic methods that all of these techniques are derived from.To our mind all of the famous random variate generation/sampling techniques are derived from combinations of the following six fundamental methods:
- Physical sources.
- Empirical resampling.
- Pseudo random generators.
- Rejection Sampling.
- Transform methods.
The technical fights (such as: “is Gibbs sampling superior to, or even distinguishable from, Markov chain Monte Carlo?”) are all in the details, history and citation conventions. Each field and particular method accretes its own traditions. We will quickly discuss the fundamental methods we listed. As we will see: complexity goes up as we move through the list (so at some point things are no longer fundamental but instead derived, allowing us to end the list).
This is the most basic way (though not as practical in the computer age) to generate random variables. Observe the flip of a real coin, shuffle actual cards, mix numbered balls or count the number of ticks from an actual radioactive source. In all of these the randomness comes from physical principles (such chaotic dynamics for coin flips or quantum mechanics for radioactive decay).
These sources are “outside of computer science” so we will say the least about them.
This is what used to be called “tables” (which were themselves often generated from physical processes). The observation is: that sometimes
to run a simulation you need access to instances of random variables that are distributed in a very precise way- but you don’t have a usable description of the desired distribution. You would think that in this case you could do nothing. But the principle of empirical resampling is that you can approximately generate new samples by taking samples (with repetition or replacement) from an old sample. This is the cornerstone of Bootstrap methods.
As an example: suppose we were given the sample of numbers 5, 5, 10, 5, 5 which has mean equal to 6. Further suppose we have no
description of how these number were generated but we wanted to know if a mean of at least 8 is likely or unlikely for five more numbers drawn the same way. We can approximate this by drawing many samples of size five from this original sample (allow the same number to be in our new
sample multiple times) and get the bootstrap estimate of the probability of seeing mean of at least 8 as having a probability around 0.6%.
This may seem trivial- but it is very important.
Pseudo random generators
In the computer age, to avoid need for external tables or expensive and slow peripherals we tend to use pseudo random generators. That is the output of deterministic iterative procedures as equivalent to true random sources. The science of pseudo randomness has evolved from cobbled together procedures passing ad-hoc tests (such as in Knuth Volume 2) to more formal pseudo randomness based on important properties (like provably being k-wise independent) or complexity (being computationally indistinguishable from a truly random on a time or space bounded machine). Behind the canned routines of all of the basic “random generators” commonly available is a pseudo random source.
Good references for the modern theory include:
- “Pseudorandomness and Cryptographic Applications” Michael Luby 1996.
- “Modern Cryptography, Probabilistic Proofs and Pseudorandomness” Oded Goldreich, 1999.
The most basic form of a sequential pseudo random generator is a sequence of states s(1), s(2), s(3) … . Where s(i+1) = g(s(i)) where g() is our deterministic function that maps state to state. The observed random variables are then h(s(i)) where h() is some deterministic function maps state to observables. For example for the linear congruential generator found in glibc we have g(x) = (1103515245*x + 12345) modulo 2^32 and h(x) = x modulo 2^30 (x an integer from 0 to 2^32 – 1). An example application: this generator when divided by (2^30 – 1) might return numbers passably uniformly distributed in the interval [0,1]. Two such variates might be uses as a uniform sample from the unit square.
That a simple iterated deterministic system (like the modulo arithmetic or even a physical system like coin flipping) would even superficially appear random (let alone be safe to use as pseudo random source) turns out to be the main consequence of Ergodic theory (which we will touch on in a later article). The point is: it should not be obvious (without bringing in some more theory) why you should trust pseudo-random sources.
Another fundamental method is direct simulation or game play. If we wanted a random variable that was 1 with probability equal to the odds of being dealt a full house from a standard shuffled deck of 52 cards (and zero otherwise). We can generate such a variable by simulating shuffling a deck, drawing a hand and returning 1 if the hand draw is a full house (and returning 0 otherwise). Notice in this case we are combining many random variables to get a single result.
One of the most important simulation techniques is Markov chain Monte Carlo methods (related to Gibbs sampling, simulated annealing and many other variations). These method implement a complex procedure over a stream of random inputs to generate a more difficult to achieve sequence of random outputs.
For example: Let T be the set of pairs of non-negative integers x, y such that x + y ≤ 1000. We could implement a Markov chain on this set from a source of coin flips. Given a point (x,y) in T we take three coin flips and move to new point (x’,y’) (also in T) using the following procedure:
- Let m = 1 if the first flip is heads and m=0 if the first flip is tails.
- Let v = (1,0) if the second flip is heads and v=(0,1) if the second flip is tails.
- Let d = +1 if the third flip is heads and d = -1 if the third flip is tails.
- If (x,y) + m*d*v is in T let (x’,y’) = (x,y) + m*d*v, otherwise let (x’,y’) = (x,y) (stay put).
Repeating this procedure a large number of times produces a sequence of points (x,y) such that (x,y) is distributed uniformly on S (again this follows from ergodic principles). The correctness of this simulation of or game of following a Markov chain is a very fundamental method in generating more complicated random variates and something we will write more about in an article dealing with the ergodic principle (the relation of connectedness to showing averages over time equal averages over space).
For simple shapes (rectangle, triangles) there are more efficient ways to generate points uniformly at random. For squares we exploit independence and just generate the coordinates independently. For triangles we could rejection sample from a bounding rectangle. Or we could use a tranform method: write down a counting function that indexes all the points in the triangle and generate points by index (for example it is easy to work out there are 501501 points in our example S so if we generate a random integer uniformly from 1 to 501501 can just pick the point with given index as our sample).
For general convex shapes (in high dimensions) these methods become intractible and Markov chain methods are one of the few options remaining.
Rejection sampling is another way to convert one sequence of random variables into another. If we assume we can generate a random variable according to the distribution p(x) we can “rejection sample” to a new distribution using an “acceptance function” q(x) which returns a number in the interval [0,1]. Our procedure is to
repeat the following: generate x with probability p(x), generate a random variable y with uniformly in the interval [0,1] if y ≤ q(x) accept x as
our answer and quit (otherwise draw a new x and repeat).
When the distribution that rejection sampling draws with is such that if x and y had a ratio of being drawn of p(x)/p(y) then under the rejection procedure they have relative odds of (p(x)q(x))/(p(y)q(y)). An important special case is when q() is always 0 or 1, in this case we are drawing with relative odds proportional to p(x) from the subset of x with q(x)=1.
As an example: consider the problem of trying to draw a point (x,y) such that x^2 + y^x < 1 (the open unit disk) uniformly at random. The rejection sampling solution is: repeat the following until you have a success: generate x and y independently uniformly in the interval [-1,1], if x^2 + y^2 < then 1 accept them as our sample (otherwise repeat). This procedure is very fast as the unit disk that represents our acceptance region has area pi and the square we are generating trials from has area 4: so we over a 78% chance of success on each trial or expect to only have to run fewer that 1.28 trials (on average) to get a sample.
A transform method is used when we have the ability to generate instances of a random variable according to one distribution and we would like instances according to another distribution.
One method is used when we have access to the inverse of the cumulative distribution function of the distribution we are trying to generate. In this case we can use this function to convert uniform variants from the interval [0,1] into our target distribution. The commutative distribution function is the function cdf() where cdf(x) is the probability a random variate generated according to our distribution is less than or equal to x. The inverse function function icdf() where icdf(y) is such that cdf(icdf(y)) = y. For example the exponential distribution has an inverse cumulative distribution function icdf(y) = -ln(1-y)/lamda . So if y is
generated uniformly in the interval [0,1] then icdf(y) is a random variable generated according to the exponential distribution with parameter lambda.
A great example of transform methods is generating Gaussian random variables. We could directly use the inverse cumulative distribution function method described above- but to do this we would require a special function library to perform the required calculation of the inverse cummulative distribution (or inverse of erf()). Another way is the polar method: generate x,y uniformly from the open unit disk (by, for example rejection sampling as described earlier), set s = x^2 + y^2 and return x*sqrt(-2 ln(s)/s), y*sqrt(-2 ln(s)/s) as two independent Gaussian random variables. The trick being: the distribution function of r = sqrt(s) is of the form r*e^(-r*r/2) which leads to an elementary cumulative distribution function (unlike the original Gaussian density of the form e^(-r*r/2)) that is easy to invert.
Our thesis is: all major methods to generate random variables use aspects of the six methods we have listed here as fundamental. Or you should at least have a fluid understanding of at least these methods. You should be able to break down big “brand name” methods (like Gibbs sampling) roughly into their constituent parts (so you can reason about them). One example: notice how ratios of probabilities enter into Markov chain Monte Carlo methods (they cause step rejections); from this you can reason if your problem has bounded ratios it is a good candidate for direct application of the technique (and if it does not you need to add some more ideas, as was demonstrated in: “Sampling from Log-Concave Distributions,” Alan Frieze , Ravi Kannan , Nick Polson, Ann. Appl. Prob, 1994 ).
The first two methods we discuss (physical sources and empirical re-sampling) are of the class of solutions “already have the right answer.” Pseudo random generators are the primary way to negate the need for physical sources and resampling techniques. Simulation, rejection sampling and transform methods are the main tools for building new distributions out of old.
It is a matter of taste if a given trick fits into this ad-hoc taxonomy or not. You can invent new and better generation methods- but these methods are easily derived using ideas from the fundamental methods we mentioned here.
Categories: Computer Science Mathematics Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Getting some interesting feedback from good friends on this article. I wish I had more time to spend on:
* Pseudo physical systems like low digits from stock prices.
* Complexity and cryptography.
* Bias removal (see: “Fast Simulation of New Coins From Old”, Serban Nacu and Yuval Peres, The Annals of Applied Probability, 2005 vol. 15 (1A) pp. 93-115).
* Information theory issues (how many bits you need to generate a structure, versus how long you need to run to be correct): “How to recycle random bits”, Russell Impagliazzo and David Zuckerman, FOUNDATIONS OF COMPUTER SCIENCE, 1989 vol. 30).
* Expander graphs and quasi random structures.
Instead we are promising only a future write up on just one essential issue: how deterministic systems can even appear to be at all random (e.g. ergodic theory).