I’d like to write a bit about measuring effect sizes and Cohen’s d.
Introduction
For our note let’s settle on a single simple example problem.
We have two samples of real numbers
a_1, ..., a_n
andb_1, ..., b_n
. All thea_i
are mutually exchangeable or generated by an independent and identically distributed random process. All theb_i
are also such.We want to know are the a_i exchangeable (essentially the same as) the b_i or not?
For a concrete example: suppose the a_i are the profits observed on one investment strategy and the b_i are the profits observed on an alternate strategy. We would like to know if the strategies are different, and if so which one has been historically better.
Now the above is throughly solved in testing the difference of means (such as ANOVA) and also as an A/B testing question. However, let’s look at this directly in very basic terms.
Terms
Some basic concepts that can carry the desired analysis are the following:
-
The mean or average. For
x = x_1, ..., x_n
the mean is defined asmean(x) := (1/n) sum_{i=1 ... n} x_i
. This can be thought of as one measure of what a “typical”x_i
is. There are other measures of typicality, such as the median, but let’s stick to mean. -
The standard deviation, or
sd
. The standard deviation is a measure of how much individual vary. It is defined assd(x) := sqrt((1/(n-1)) sum_{i=1 ... n} (x_i - mean(x))^2)
. The standard deviation is essentially the square-root of the expected (or average) distance squared between examples and the mean. This is an estimate of how far individuals usually are from “typical.” -
The standard error, or
se
. The standard error is a measure of how much uncertainty we have in our estimate of the mean. Usually it is estimated asse(x) = sd(x) / sqrt(n)
. The idea is:
ifa = a_1, ..., a_n
andb = b_1, ..., b_n
were in fact exchangeable (generated by indistinguishable means), then
we would expect our observemean(b) - mean(a)
to be typically around the size ofse(a)
orse(b)
(se(a)
andse(b)
themselves should be similar under the assumptionsa
andb
are identical, low standard deviation, andn
is large). -
The statistical significance or
p
-value. The statistical significancep
can be defined for any summary of data. Iff()
is our summary,t
is a given threshold, then the one-sided significancep
off(), t
is defined such thatp = PR[f(b) - f(a) >= t]
under the assumptiona
andb
are independently identically distributed. The idea is: observingp
values is unlikely under chance, so data that achieves a lowp
may in fact be due to a cause. How one estimates or calculatesp
is, of course, of interest. However, the properties of the value are determined by the definition.
The Detailed Question
The detailed question is: how do different summaries characterize the observed difference between a
and b
? We would like to know this in two cases:
- Case 0:
a
andb
were generated the same way, there is no essential difference. This situation is often called a null hypothesis. It is usually something we hope isn’t the case, or a situation we are trying to disprove or eliminate. -
Case 1:
a
andb
are generated in different ways. Usually we are hoping to confirmb
is better thana
, in the sense thatmean(b) > mean(a)
. Though, we can have the case thata
andb
are generated in different ways, but still have the same expected value or ideal mean.
What we want is: a measure to what degree is the observed mean(a)
greater than the observed mean(b)
. It is typical to use Case 0 as a stand-in for “the means are not different” (though that isn’t competely precise, as different distributions may have the same mean).
Typical measures include:
- mean(b) – mean(a)
, the observed difference in means. Let’s call this “
d
“. -
(mean(b) - mean(a))/se
, the observed difference in means scaled by the standard error.
For this note it isn’t important exactly howse
is estimated, though there are details and variations. Typically a pooled estimate such asse = se(a_1, ..., a_n, b_1, ..., b_n)
is used. This summary is usually called a standard score (“z
“) or “t
“. For this note it isn’t important exactly howse
is estimated, though there are details and variations. Typically a pooled estimate such asse = se(a_1, ..., a_n, b_1, ..., b_n)
is used. This is a very common statistic or summary. Let’s call this “d/se
“. -
(mean(b) - mean(a))/sd
, the observed difference in means scaled by the standard deviation.
For this note it isn’t important exactly howsd
is estimated, though there are details and variations. Typically a pooled estimate such assd = sd(a_1, ..., a_n, b_1, ..., b_n)
is used. This summary is usually called an effect size, the most famous of which being Cohen’s d. This is among the most useful statistics, and often rediscovered as it is under taught outside the statistically oriented sciences. Let’s call this “d/sd
“. - Any sort of
p
or significance for a threshold difference in summary statistics. For a difference in any of the above summaries (or many more) we can calculate the significance orp
-value, and use this as measure of interest. We advise against this, but it is worth discussing. Let’s call any of these (they all behave the same due to their shared definition) “p
“.
Our Summary
Having set up our notation we can characterize our four summary families (d
, d/se
, d/sd
, p
).
Under moderate assumptions (bounded and non-zero standard deviation) as our sample size n
gets large we expect the following for each of these summaries.
. | No actual difference (null hypothesis) | mean(b) > mean(a) (what we are hoping for) |
sd(b) – sd(a) (difference of standard deviations) | distributed with mean 0 | distributed with constant mean, constant standard deviation |
se(b) – se(a) (differenece of standard errors) |
distributed with mean 0 |
distributed with mean approaching 0, standard deviation approaching 0 |
d (raw difference in means) | distributed mean 0 | distributed with constant mean, standard deviation approaching 0 |
d / sd (effect size) |
distributed mean 0 |
distributed with mean converging to a constant (the effect size), and standard deviation approaching 0 |
d / se (standard score) | distributed mean 0 | distributed with mean going to infinity |
p (significance) |
distributed uniformly in the interval [0, 1] |
distribution converges to zero |
And this brings us to our point. All of the cells marked in red carry only a single bit of information at the surface level (that is extractable without knowledge of the sample size n
): “there is a difference”. In contract, the effect size (in green) converges to a quantity that has conventional detailed interpretations.
A standard table of effect sizes is given below (source: Wikipedia) is copied below.
Effect size | d | Reference |
---|---|---|
Very small | 0.01 | [9] |
Small | 0.20 | [8] |
Medium | 0.50 | [8] |
Large | 0.80 | [8] |
Very large | 1.20 | [9] |
Huge | 2.0 | [9] |
Also note for p
there is no strong guarantee p
is large in the null hypothesis case, as p
is not a concentrated distribution in this situation.
In my experience, “standard scores” (z
, t
) and significances (p
) are more commonly taught to statistical outsiders than effect sizes. The social sciences however are particularly strong in using effect sizes well, as Jacob Cohen was both a psychologist and statistician (ref). Effect sizes, such as Cohen’s d, are very useful; so they keep getting profitably re-introduced by practitioners in many other fields.
For more on measures of correspondence, please see here.