With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that:
- The required sample size is essentially independent of the total population size.
- The required sample size depends strongly on the strength of the effect you are trying to measure.
These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.
We will try to work through these assumptions and then discuss proper sample size.
The problem of population size.
Many technical people (including some good physical scientists) have a hard time understanding that to test the effect of a treatment on a population the sample size you need does not depend on the size of the overall population. You see vestiges of this discomfort when for a treatment or an opinion poll you seen something like the percent of the population polled listed as an important feature. The math says this is wrong, but there is absolutely no point in trotting out the math until we move to common ground (or a shared set of assumptions). After all math is just the pushing of axioms to their consequences- so if you don’t share the same axioms the math is irrelevant.
The statistical definition of a good sample from a population usually insists on independence or exchangeability. That is: if we draw a sample of 100 people from a population of 1,000,000 that each person in the sample had an equal chance of being in the sample independent of who else was in the sample or who else was left out of the sample. Or (in a slightly different form): if we had a good sample and then picked one of the sample members uniformly at random and exchanged them with one of the non sample members (also picked uniformly at random) we end up with a different but equally good sample.
The ideal notion of a sample is what is at odds with experience. People know that their sampling procedure is often unlikely to rise to the standard demanded by statistics. A doctor knows that the patients that walk into his office are not an exchangeable sample of the total population (his or her patients tend to live near them, may be grouped in age and share many other common traits). Political polls (which are typically done on land line telephones) have been famously bad samples dating back from “Dewey Beats Truman” mis-prediction (when telephones were more concentrated among the rich) to the current blindness of land-line telephone polls to the younger cell phone only sub-population. Another example is small molecule drug discovery where experiments are inevitably run where each molecule is related to previous experiments and earlier experiments have a strong bias towards cheaper reagent cost and low molecular weight.
These real world situations are non-uniform, non-stationary, not independent and non-exchangeable (to use the statistical terms). Since the usual statistical requirements are not met it should not be surprising the desired statistical theories do not apply. This isn’t a paradox it is just the real world failing to meet the desired assumptions. The intuition that if your sample is not in a trustable order that you must increase sample size to cover the entire population is essentially justified. But that misses the often cheaper alternative of fixing your sampling procedure so you don’t need to push to large sample sizes.
This is why statisticians are so concerned with experimental design and experimental procedure. If you can design your experiment so the standard statistical assumptions are met: you get the enormous benefits of the theory.
The problem of effect strength.
The basic problem of effect strength is simple, so in this case part of the confusion is actually from not having thought hard about it. The (incorrect) common intuition is that the size of an experimental sample needed is independent of the type of effect you are trying to measure. You see this implicitly when studies are commissioned that have no hope of success. Typically this would be A/B testing a minor change on a web page on a small population (perhaps too small to even see the change) or even running an expensive clinical trial with too few patients. These recurring errors come from the mistaken intuition that the effect strength isn’t important in designing your experiment.
To try and set your intuition consider the extreme cases. Suppose you are trying to test if an insecticide is “very very very deadly” to fleas. If the insecticide is strong then the only confounding effect would be natural death of fleas. You need an experiment size big enough that it is unlikely most of your fleas died of natural causes. So maybe use 5 fleas instead of 1. The measurement is: dose the fleas, if they all die you can conclude the insecticide is very deadly. If only half die: the experiment fails (the insecticide is deadly, but not very very deadly). The key to good statistical experiment design is committing to what you are trying to test before starting the experiment. This is the important thing. After that you can improve your technique by adding such methods as “control groups” (groups of untreated fleas to try to get an estimate of the base mortality rate) and repetitions (trying to control for systematic error, all the fleas in one cage dying due to some other cause, like spilled cleaning solution).
Now consider the same experiment and suppose you were trying to establish the same insecticide “kills all fleas and their eggs.” The word “all” made this something that can not be proven by empirical experiment. This is too strong a statement to be proven empirically. The only way to prove a categorical absolute is through logic (not empiricism or statistics). This sort of proof looks something like:
- Major Premise: Sulfuric acid completely dissolves organic matter.
- Minor Premise: Fleas are organic matter.
- Conclusion:Sulfuric acid completely dissolves fleas.
That is, you don’t run an experiment to prove an absolute- you combine other things you already accept as absolutes to infer additional absolutes. (We dodge the question how you acquire these initial absolutes, we just point out you can usually only create new absolutes by combining other absolutes).
One last variation: suppose we are asked to prove the insecticide “kills at least 99.9% of all fleas.” This can be done by sampling- but it requires a big sample as we are trying to prove that very very few fleas survive (or in technical terms- trying to measure a rare event).
This is deliberately silly, but the point is: easy effects can use small samples, hard effects require big samples and absolutes can’t be measured by sampling.
There is a second, very subtle problem regarding effect strengths: they cost more than you would reasonably suppose. Roughly this is because when you try to measure weaker (more subtle) effects you also increase your need for precision. That is it might make sense to try and measure a disease that affects 10% of the population to +-5% absolute error (that is accept any number from 15% through 5% as being a good enough estimate). But when you try to measure a disease that affects 1% of the population this wide interval ( 6% to -4%) no longer makes sense. You would likely be required to use a +-0.5% interval. You would need 10 times as much data to see the same number of affected patients (1% versus 10% incidence rate) and on top of that you are now requiring a high precision ( +- 0.5% instead of +-5%). You end up needing 100 times the data to get a similar confidence on the smaller measurement range (+-0.5%). In my opinion: this is always going to feel wrong- you expect to need 10 times the data but you actually need 100 times the data (due to your change in measurement range).
The rule of thumb.
The typical requirements of a random experiment are: you are given a very small number d>0 and a small number e>0 and you want a sample such that if p is the (unknown) true proportion of the world that has the property you are trying to measure then with probability no more than d your measured proportion q is bad (i.e. |p-q| > e). If somebody asks for d or e to be zero then you are pretty much forced to test the whole population (sampling won’t work). The larger d and e get the easier sampling becomes.
The rules to know in designing experiment sizes are:
- d is easy to lower (it doesn’t take much work to increase confidence).
- e is hard to lower (it takes a lot of work to increase precision).
A good formula is: if your sample is picked uniformly at random and is of size at least:
then with probability at least 1-d your sample measured proportion q is no further away than e from the unknown true population proportion p (what you are trying to estimate). This formula is technical consequence of Hoeffding’s inequality (it is a convenient form, you can get a slightly better bound by directly using the Chernoff bounds). The point is: you can plug this into a calculator. Need to estimate with 99% certainty the unknown true population rate to +- 10% precision? That is d=0.01, e=0.1 and a sample of size 265 should be enough. A stricter precision of +-0.05 causes the estimated sample size to increase to 1060, but increasing the certainty requirement to 99.5% only increases the estimated sample size to 300. This is what me mean by confidence is cheap and precision is expensive.
Notice the effect strength doesn’t directly enter into the formula- it only comes in the increase in precision needed to measure weak effects. This is why you want to avoid attempting to empirically measure weak effects (see Analytics Sabotage).
Also notice the faster than linear dependence on 1/e (a square in this case). This is very expensive (liner would be better) and unintuitive. The undesirable super-linear rate follows from the fact that if you flip a fair coin (50/50 odds of heads tails and independent outcomes on each flip) n-times the observed number of heads tends to be more than +-sqrt(n) away from the expected value of n/2. To get a linear rate it would have to be usually within a constant from n/2 (independent of n!) which is too perfect and way too much to hope for (also relevant is the law of the iterated logarithm). This is “hand wavy” but it is intended to help you reset your intuition (intuiting being a less precise, but less brittle way of reasoning than doing all the math).
Warning: this estimate tends to be high
The above estimate was designed to be a reliable and simple upper bound in the realm of algorithm analysis and O() notation. It turns out it is over-estimates sample sizes by quite a lot, so beyond the intuition you want to use an actual exact estimate from a binomial distribution (which we will write about at a later date).
A very technical paper on sampling from very large populations.
A long time ago we took some of the ideas here to the extreme. We considered the situation of “stream processing”- that is when you have so much data flowing by you that you do not even have enough storage to remember a significant fraction of it. This might arise in a networking application, a scientific instrument or a map-reduce node. The question is: can you still estimate rates in this situation? The answer is you can. The trick is even though you don’t have enough memory to store a significant fraction of what you saw (or even enough memory to store your sample design!) you can store the specification of a pseudorandom sample. A pseudorandom sample is like a sample except it has a very succinct description (a description small enough to store). The idea is to store only the description of the pseudorandom sample (not the sample itself) and the intersection of this sample and the stream (or mark “failure” if this intersection gets too large).
There are some additional technical details in that you have to simultaneously work with many guesses of the unknown true population rate (because each sample design only works properly for a narrow range of this unknown rate as you either miss too much or run out of memory when you use the wrong design). But overall the ideas in this paper are just the type of bounds we discussed here (but in a very compressed form for publication).
This sort of paper is what we theorists call “fun” (or really just sharpening our knives between real world problems):
Estimating the range of a function in an online setting, J. A. Mount, Information Processing Letters 72 31–35 (1999) .
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.