Menu Home

Effect Sizes Say More Than Standard Scores

I’d like to write a bit about measuring effect sizes and Cohen’s d.

Introduction

For our note let’s settle on a single simple example problem.

We have two samples of real numbers a_1, ..., a_n and b_1, ..., b_n. All the a_i are mutually exchangeable or generated by an independent and identically distributed random process. All the b_i are also such.

We want to know are the a_i exchangeable (essentially the same as) the b_i or not?

For a concrete example: suppose the a_i are the profits observed on one investment strategy and the b_i are the profits observed on an alternate strategy. We would like to know if the strategies are different, and if so which one has been historically better.

Now the above is throughly solved in testing the difference of means (such as ANOVA) and also as an A/B testing question. However, let’s look at this directly in very basic terms.

Terms

Some basic concepts that can carry the desired analysis are the following:

  • The mean or average. For x = x_1, ..., x_n the mean is defined as mean(x) := (1/n) sum_{i=1 ... n} x_i. This can be thought of as one measure of what a “typical” x_i is. There are other measures of typicality, such as the median, but let’s stick to mean.
  • The standard deviation, or sd. The standard deviation is a measure of how much individual vary. It is defined as sd(x) := sqrt((1/(n-1)) sum_{i=1 ... n} (x_i - mean(x))^2). The standard deviation is essentially the square-root of the expected (or average) distance squared between examples and the mean. This is an estimate of how far individuals usually are from “typical.”
  • The standard error, or se. The standard error is a measure of how much uncertainty we have in our estimate of the mean. Usually it is estimated as se(x) = sd(x) / sqrt(n). The idea is:
    if a = a_1, ..., a_n and b = b_1, ..., b_n were in fact exchangeable (generated by indistinguishable means), then
    we would expect our observe mean(b) - mean(a) to be typically around the size of se(a) or se(b) (se(a) and se(b) themselves should be similar under the assumptions a and b are identical, low standard deviation, and n is large).
  • The statistical significance or p-value. The statistical significance p can be defined for any summary of data. If f() is our summary, t is a given threshold, then the one-sided significance p of f(), t is defined such that p = PR[f(b) - f(a) >= t] under the assumption a and b are independently identically distributed. The idea is: observing p values is unlikely under chance, so data that achieves a low p may in fact be due to a cause. How one estimates or calculates p is, of course, of interest. However, the properties of the value are determined by the definition.

The Detailed Question

The detailed question is: how do different summaries characterize the observed difference between a and b? We would like to know this in two cases:

  • Case 0: a and b were generated the same way, there is no essential difference. This situation is often called a null hypothesis. It is usually something we hope isn’t the case, or a situation we are trying to disprove or eliminate.
  • Case 1: a and b are generated in different ways. Usually we are hoping to confirm b is better than a, in the sense that mean(b) > mean(a). Though, we can have the case that a and b are generated in different ways, but still have the same expected value or ideal mean.

What we want is: a measure to what degree is the observed mean(a) greater than the observed mean(b). It is typical to use Case 0 as a stand-in for “the means are not different” (though that isn’t competely precise, as different distributions may have the same mean).

Typical measures include:

  • mean(b) – mean(a), the observed difference in means. Let’s call this “d“.
  • (mean(b) - mean(a))/se, the observed difference in means scaled by the standard error.
    For this note it isn’t important exactly how se is estimated, though there are details and variations. Typically a pooled estimate such as se = se(a_1, ..., a_n, b_1, ..., b_n) is used. This summary is usually called a standard score (“z“) or t. For this note it isn’t important exactly how se is estimated, though there are details and variations. Typically a pooled estimate such as se = se(a_1, ..., a_n, b_1, ..., b_n) is used. This is a very common statistic or summary. Let’s call this “d/se“.
  • (mean(b) - mean(a))/sd, the observed difference in means scaled by the standard deviation.
    For this note it isn’t important exactly how sd is estimated, though there are details and variations. Typically a pooled estimate such as sd = sd(a_1, ..., a_n, b_1, ..., b_n) is used. This summary is usually called an effect size, the most famous of which being Cohen’s d. This is among the most useful statistics, and often rediscovered as it is under taught outside the statistically oriented sciences. Let’s call this “d/sd“.
  • Any sort of p or significance for a threshold difference in summary statistics. For a difference in any of the above summaries (or many more) we can calculate the significance or p-value, and use this as measure of interest. We advise against this, but it is worth discussing. Let’s call any of these (they all behave the same due to their shared definition) “p“.

Our Summary

Having set up our notation we can characterize our four summary families (d, d/se, d/sd, p).
Under moderate assumptions (bounded and non-zero standard deviation) as our sample size n gets large we expect the following for each of these summaries.

. No actual difference (null hypothesis) mean(b) > mean(a) (what we are hoping for)
sd(b) – sd(a) (difference of standard deviations) distributed with mean 0 distributed with constant mean, constant
standard deviation
se(b) – se(a)
(differenece of standard errors)
distributed
with mean 0
distributed
with mean approaching 0, standard deviation approaching 0
d  (raw difference in means) distributed mean 0 distributed with constant mean, standard
deviation approaching 0
d
/ sd (effect size)
distributed
mean 0
distributed
with mean converging to a constant (the effect size), and standard deviation
approaching 0
d / se (standard score) distributed mean 0 distributed with mean going to infinity
p
(significance)
distributed
uniformly in the interval [0, 1]
distribution
converges to zero

And this brings us to our point. All of the cells marked in red carry only a single bit of information at the surface level (that is extractable without knowledge of the sample size n): “there is a difference”. In contract, the effect size (in green) converges to a quantity that has conventional detailed interpretations.

A standard table of effect sizes is given below (source: Wikipedia) is copied below.

Effect size d Reference
Very small 0.01 [9]
Small 0.20 [8]
Medium 0.50 [8]
Large 0.80 [8]
Very large 1.20 [9]
Huge 2.0 [9]

Also note for p there is no strong guarantee p is large in the null hypothesis case, as p is not a concentrated distribution in this situation.

In my experience, “standard scores” (z, t) and significances (p) are more commonly taught to statistical outsiders than effect sizes. The social sciences however are particularly strong in using effect sizes well, as Jacob Cohen was both a psychologist and statistician (ref). Effect sizes, such as Cohen’s d, are very useful; so they keep getting profitably re-introduced by practitioners in many other fields.

For more on measures of correspondence, please see here.

Categories: data science Opinion Statistics Tutorials

Tagged as:

John Mount

%d bloggers like this: