In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least:

This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick application of Hoeffding’s inequality and *because* it has a simple form it is possible to see that accuracy is very expensive (to halve the size of difference we are trying to measure we have to multiply the sample size by four) and the cheapness of confidence (increases in the required confidence or significance of a result cost only moderately in sample size).

However, for high-accuracy situations (when you are trying to measure two effects that are very close to each other) suggesting a sample size that is larger than is strictly necessary (as we are using an bound, not an exact formula for the required sample size). As a theorist or a statistician we like to error on the side of too large a sample (guaranteeing reliability), but somebody who is paying for each entry in a poll would want a smaller size.

This article shows a function that computes the exact size needed (using R).

The bound we gave in What is a large enough random sample? is correct: a sample of the stated size always at least achieves the desired accuracy and significance goals. Sample size and significance seem a bit abstract if you forget the underlying points: too small sample size and you can’t state a conclusion with any confidence and too large a sample size and you have spent too much money on your experiments. This is a central issue in measurement when you are measuring something serious (a clinical trial, an opinion poll or an A/B test) or even measuring something silly (the statistics of a game). In addition to knowing how to estimate significance of an experiment after the fact you need to know how to design an experiment to achieve significance; and that is largely picking a big enough sample size, the subject of this article.

We could get a better bound on sample size by using a more detailed version of the Chernoff bound that better accounts for small sample sizes. Or, as we will do here, we can say bounds are only useful if they are simple (so they give us usable intuition) and move on to an exact calculation.

The exact sample size needed is determined by a simple use of the binomial theorem (used to calculate how often a distribution of coin flips exhibits a given range of averages). The R code to find the exact sample size is given in the function binomsize(a,d) below:

library('gtools') estimate = function(a,d) { -log(d/2)/(2*a^2) } sig = function(a,n) { pbinom(floor(n*0.5)-floor(n*a),size=floor(n),prob=0.5) } binomsize = function(a,d) { r=c(1,2*estimate(a,d)) v=binsearch(function(n) {sig(a,n) - d},range=r,lower=min(r),upper=max(r)) v$where[[length(v$where)]] }

For example: binomsize(0.1,0.05) = 80 tells us that a sample size of 80 is enough to measure a difference in rates as small as 0.1 with a chance of mis-measurement of no more then 0.05. That is if you want to measure the popularity of a single candidate to with +=10% with no more than a 0.05 changes of being wrong, we need a sample size of at least 80 respondents. In a poll of 80 people if your candidate is marked as favorable by more than 60% of the time then with 19 chances out of 20 they are in fact the more popular candidate (also assuming your sample of 80 was truly representative). On the other hand, estimate(0.1,0.05) is 184.4, and is more than twice the minimum necessary size (though it is safe to use).

Our estimate was designed to always at least the true value (so it is a valid bound), but it is often much larger than the needed value. Will illustrate this with the command below which yield the plot that follows.

library('ggplot2') library('reshape2') d = data.frame(accuracy=10^seq(from=-0.9,to=-7,by=-0.1)) d$estimateSize = estimate(d$accuracy,0.05) d$binomialSize = sapply(d$accuracy,function(x){ binomsize(x,0.05)}) dmelt = melt(d,id.vars=c('accuracy')) ggplot(data=dmelt,aes(x=1/accuracy,y=value,color=variable)) + geom_line()

The difference doesn’t look so bad if we plot on a log-log scale “on which anything looks like a straight line” (F. J. Yndurain, 1996):

ggplot(data=dmelt,aes(x=1/accuracy,y=value,color=variable)) + geom_line() + scale_y_log10() + scale_x_log10()

In fact we can see the ratio of the estimate over the actual needed sample size is approaching e:

ggplot(data=d,aes(x=1/accuracy,y=estimateSize/binomialSize)) + geom_line() + scale_x_log10()

So we can use the following formula as a “good rule of thumb” (but not as a bound, as it is always not quite a large enough sample!): you should always have a sample size larger than:

Our advice is: use our bound or the rule of thumb to plan. But the time comes to run your test, quickly use the binomial formula given above.

For less conservative (more data efficient) sample sizes for rare events see our follow-up article: Sample size and power for rare events.

Categories: data science Statistics Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Great post!

In gene expression studies, one is often interested in the ratio of expression of a gene in two groups R1..Rn and S1…Sn. Here, the values R/S are log-transformed and averaged over samples n prior to calculation of the ratio, so R = mean(log(R))/mean(log(S)). Would this have an impact in the calculation of the appropriate sample size by your formula?

Cheers,

Andrej

@anspiess Thanks! Well the exact answer is to check the industry agreed distributional assumptions and then search for the right point in the cumulative distribution function. Ratio likely bring you to wanting an F-distribution table. The formula I am using is most closely aligned with Normal and Binomial distributions.

Also, I forgot to mention a lot of what we did could be done with the qbionom() call instead of the pbinom() call.

Very interesting post, and this is a useful topic for me.

Can someone explain, what to do with the sample size in the case of comparative phylogenetic analysis?

Let me explain.

Using Bayesian Inference I reconstruct phylogenetic relationships among 150 species. The result is a set of phylogenetic trees (about 30,000 trees, each with 150 tips). Each tree has it’s own branching (topology), every branch has its posterior probability.

Next, I want to analyze the course of evolution (with respect to a particular trait). From the resulting 30k trees I should choose random sample. It is called “taking phylogenetic uncertainty into account”. In other words, my trees are only hypotheses about evolutionary relationships. And I want to bring it into analysis.

Is it possible to somehow modify your formula, or this problem needs deep mathematical exploration?

I understand that asking such a question is not quite correct, but the subject is very topical for me.

@Alice Thanks for your interest and comment!

Well we can treat all of the tree manipulation and inference steps as a black box, and then what is left of the procedure looks like empirical re-sampling or a bootstrapping situation. At this point there are formulas and procedures for computing error bars and significances after the fact. The question related to this article would then be is there a before the fact way to estimate how many re-samples you need to commission to achieve a given significance. The bounds I gave is parametric or distribution based, so it depends on your acceptance procedures corresponding to binomial distribution (or being well approximated by one). In particular you would need an “a” that is behaving like a displacement along a binomial distribution to use the formula (the confidence “d” is easier to fit in than the “a”).

@jmount

Thank you very much, your answer is very helpful! I’ll try to discuss your recommendations with colleagues. Usually we choose sample size (number of trees) “taken from the sky”, trying to make it not too big and not too small. From the “biological” point of view it isn’t so awful as can seem at first sight. After all we know posterior probabilities of trees and we know a consensus tree (“compromise” among many possible trees). However, it isn’t right from the mathematical point of view. So, thank you once again.