A common question in analytics, statistics, and data science projects is: how much data do you need?
This question actually has very specific and clear answers!
A first good answer is “it is good to have a lot.”
Let’s dig deeper and get some additional more detailed quantitative answers.
We will phrase the problem in terms of a measurement of error. For example, let’s try to achieve a given square error in a regression problem. That is, how hard it it to estimate a number given a sample from a larger population?
A simplified version of our modeling problems is the following. There is an unobserved population of real values
j is an integer ranging from
p. We have access to a sample
i is an integer ranging from
yi was generated as a copy of
vj for an independent uniform random draw of
j from the integers
p, with re-use or replacement allowed. We want to estimate the average value of the
vs from our
ys. This is the usual formulation of working out an estimate on training data, and asking how well the estimate will work on future data drawn from the same population.
We are interested in how far off the visible sample estimate
sample_estimate := (1/m) sumi=1...m yi is from the unobserved true population mean
true_value := (1/p) sumj=1...p vj. We will quantify this as “square error” or
(true_value - estimate)^2. Square error is an example of a loss or criticism: smaller is better, and in this case zero represents perfection.
We are going to assume we know how much error we can tolerate, and translate this into how much data we need.
m training instances as worth
It is an important and standard fact that in the above problem, the expected square error is proportional to
1/m. The usual way to prove this is to use the linearity of expectation to show square errors tend to be additive for independent samples. But, let’s just take the conclusion as is.
Square error is great to calculate with, but it isn’t in natural business units. For example if the
vj are how many cars park on a given day, then the earlier square error is in “cars squared units.” This, is a bit of a nonsense unit. So it traditional to take the square-root of the square error to form root mean square error. This fixes the units, the error is now in the original units of our problem: cars.
So we can rephrase our above claim as samples of size
m tend to have a expected root mean square error proportional to
sqrt(1/m). Or: you need four times the data to halve the error. I often summarize this as:
mtraining instances as worth
More data is always better, but it has a diminishing return.
Twice the parameters require twice the training data
Usually we are fitting a much more detailed model than just estimating a non-conditional mean. For example, we can try to get detailed per-instance predictions of by using explanatory variables. One such method is linear regression. Our last note was in fact a long treatment to show the expected excess square error of a linear regression model is proportional to
m/(m - n), where
n is the number of variables or parameters in our linear regression model (counting the intercept term as a variable).
m/(m - n) = (2 m)/((2 m) - (2 n)). That is: if we double the number of parameters in our model from
2 n, to have the same expected excess generalization error we need to double the number of training instances
The number of training instances needed tends to be proportional to the number of model parameters.
Bigger, more complicated, models tend to need more data.
Partition models also need data size to grow as fast as the number of parameters
Linear models are not the only type of models. A large family of models, such as decision/regression trees and nearest neighbor models work by partitioning training data. This means they also require data proportional to the number of parameters.
We can make a hand-wavy argument for why this would be. Consider the following cartoon diagram as partition based model dividing the plane (or a world with 2 explanatory variables) into various regions.
In each region we have a number of training instances, shown as dots. The model’s prediction in the future is the average value associated with the training dots in a given region. So each region adds a parameter of our model we need to infer. For a given level of expected error we need at least a given number of dots in each region (just as in our original problem). Doubling the number of regions means we need to double the total number of dots to have most regions maintaining their target per-region count (in addition to using the data to pick the region boundaries).
The point is, many different types of models need training data proportional to the number of model parameters to be inferred.
Exceptions to the rule
There are classes of models that, by design, do not need a number of training instances proportional to the number of model parameters.
This is a bit technical (or “inside baseball”) compared to our earlier points. But we want to at least call out some exceptions.
The first examples are bagged models such as random forest models. These models are the sum of a great number of sub-models. Roughly one only needs an amount of data to satisfy the error arguments per-sub model, not proportional to the number of sub-models. This is because in bagged models the sub models are picked independently and the sum weights are are identical (which is why this sort of exemption doesn’t apply to boosted models). That being said: random forest models tend to need a lot of training data, as the underlying sub-models already need a lot of training data. The how data is needed rule is broken due to the fact the parameters in these models have a lot of redundancy.
The most famous super-large models are of a different structure: deep learning neural nets, such as the current large language models such as GPT4 and others. These models need an obscene amount of training data, which the large companies building these models happen to have access to. Pre-estimating how much data is needed for a given performance level is harder to work out as these models tend to be more complicated and less thoroughly characterized than linear or tree based models. However, one can plot the relation by measuring out of sample performance given different training set sizes.
We should expect a linear relation between the number of effective parameters in the model (parameters actually doing something). This is because if the demands were much worse, statistics would have ways to fix that. And if the demands were much less, this would be loudly celebrated big news. So I presume we would see the usual: loss or criticism in natural business units tends to shrink at rate proportional to
1 / sqrt(m) (which is also the central point of PAC and VC theories).
The common question “how much training data do I need?” actually has some fairly good answers. Some of these answers include:
- More tends to be better.
- Halving error usually requires four times the data. Or: “be willing to pay
- The amount of training data needed is often proportional to the number of modeling parameters. Or: twice as many parameters usually requires twice the data.