# Effect Sizes Say More Than Standard Scores

I’d like to write a bit about measuring effect sizes and Cohen’s d.

## Introduction

For our note let’s settle on a single simple example problem.

We have two samples of real numbers `a_1, ..., a_n` and `b_1, ..., b_n`. All the `a_i` are mutually exchangeable or generated by an independent and identically distributed random process. All the `b_i` are also such.

We want to know are the a_i exchangeable (essentially the same as) the b_i or not?

For a concrete example: suppose the a_i are the profits observed on one investment strategy and the b_i are the profits observed on an alternate strategy. We would like to know if the strategies are different, and if so which one has been historically better.

Now the above is throughly solved in testing the difference of means (such as ANOVA) and also as an A/B testing question. However, let’s look at this directly in very basic terms.

## Terms

Some basic concepts that can carry the desired analysis are the following:

• The mean or average. For `x = x_1, ..., x_n` the mean is defined as `mean(x) := (1/n) sum_{i=1 ... n} x_i`. This can be thought of as one measure of what a “typical” `x_i` is. There are other measures of typicality, such as the median, but let’s stick to mean.
• The standard deviation, or `sd`. The standard deviation is a measure of how much individual vary. It is defined as `sd(x) := sqrt((1/(n-1)) sum_{i=1 ... n} (x_i - mean(x))^2)`. The standard deviation is essentially the square-root of the expected (or average) distance squared between examples and the mean. This is an estimate of how far individuals usually are from “typical.”
• The standard error, or `se`. The standard error is a measure of how much uncertainty we have in our estimate of the mean. Usually it is estimated as `se(x) = sd(x) / sqrt(n)`. The idea is:
if `a = a_1, ..., a_n` and `b = b_1, ..., b_n` were in fact exchangeable (generated by indistinguishable means), then
we would expect our observe `mean(b) - mean(a)` to be typically around the size of `se(a)` or `se(b)` (`se(a)` and `se(b)` themselves should be similar under the assumptions `a` and `b` are identical, low standard deviation, and `n` is large).
• The statistical significance or `p`-value. The statistical significance `p` can be defined for any summary of data. If `f()` is our summary, `t` is a given threshold, then the one-sided significance `p` of `f(), t` is defined such that `p = PR[f(b) - f(a) >= t]` under the assumption `a` and `b` are independently identically distributed. The idea is: observing `p` values is unlikely under chance, so data that achieves a low `p` may in fact be due to a cause. How one estimates or calculates `p` is, of course, of interest. However, the properties of the value are determined by the definition.

## The Detailed Question

The detailed question is: how do different summaries characterize the observed difference between `a` and `b`? We would like to know this in two cases:

• Case 0: `a` and `b` were generated the same way, there is no essential difference. This situation is often called a null hypothesis. It is usually something we hope isn’t the case, or a situation we are trying to disprove or eliminate.
• Case 1: `a` and `b` are generated in different ways. Usually we are hoping to confirm `b` is better than `a`, in the sense that `mean(b) > mean(a)`. Though, we can have the case that `a` and `b` are generated in different ways, but still have the same expected value or ideal mean.

What we want is: a measure to what degree is the observed `mean(a)` greater than the observed `mean(b)`. It is typical to use Case 0 as a stand-in for “the means are not different” (though that isn’t competely precise, as different distributions may have the same mean).

Typical measures include:

• mean(b) – mean(a), the observed difference in means. Let’s call this “`d`“.
• `(mean(b) - mean(a))/se`, the observed difference in means scaled by the standard error.
For this note it isn’t important exactly how `se` is estimated, though there are details and variations. Typically a pooled estimate such as `se = se(a_1, ..., a_n, b_1, ..., b_n)` is used. This summary is usually called a standard score (“`z`“) or `t`. For this note it isn’t important exactly how `se` is estimated, though there are details and variations. Typically a pooled estimate such as `se = se(a_1, ..., a_n, b_1, ..., b_n)` is used. This is a very common statistic or summary. Let’s call this “`d/se`“.
• `(mean(b) - mean(a))/sd`, the observed difference in means scaled by the standard deviation.
For this note it isn’t important exactly how `sd` is estimated, though there are details and variations. Typically a pooled estimate such as `sd = sd(a_1, ..., a_n, b_1, ..., b_n)` is used. This summary is usually called an effect size, the most famous of which being Cohen’s d. This is among the most useful statistics, and often rediscovered as it is under taught outside the statistically oriented sciences. Let’s call this “`d/sd`“.
• Any sort of `p` or significance for a threshold difference in summary statistics. For a difference in any of the above summaries (or many more) we can calculate the significance or `p`-value, and use this as measure of interest. We advise against this, but it is worth discussing. Let’s call any of these (they all behave the same due to their shared definition) “`p`“.

## Our Summary

Having set up our notation we can characterize our four summary families (`d`, `d/se`, `d/sd`, `p`).
Under moderate assumptions (bounded and non-zero standard deviation) as our sample size `n` gets large we expect the following for each of these summaries.

 . No actual difference (null hypothesis) mean(b) > mean(a) (what we are hoping for) sd(b) – sd(a) (difference of standard deviations) distributed with mean 0 distributed with constant mean, constant standard deviation se(b) – se(a) (differenece of standard errors) distributed with mean 0 distributed with mean approaching 0, standard deviation approaching 0 d  (raw difference in means) distributed mean 0 distributed with constant mean, standard deviation approaching 0 d / sd (effect size) distributed mean 0 distributed with mean converging to a constant (the effect size), and standard deviation approaching 0 d / se (standard score) distributed mean 0 distributed with mean going to infinity p (significance) distributed uniformly in the interval [0, 1] distribution converges to zero

And this brings us to our point. All of the cells marked in red carry only a single bit of information at the surface level (that is extractable without knowledge of the sample size `n`): “there is a difference”. In contract, the effect size (in green) converges to a quantity that has conventional detailed interpretations.

A standard table of effect sizes is given below (source: Wikipedia) is copied below.

Effect size d Reference
Very small 0.01 [9]
Small 0.20 [8]
Medium 0.50 [8]
Large 0.80 [8]
Very large 1.20 [9]
Huge 2.0 [9]

Also note for `p` there is no strong guarantee `p` is large in the null hypothesis case, as `p` is not a concentrated distribution in this situation.

In my experience, “standard scores” (`z`, `t`) and significances (`p`) are more commonly taught to statistical outsiders than effect sizes. The social sciences however are particularly strong in using effect sizes well, as Jacob Cohen was both a psychologist and statistician (ref). Effect sizes, such as Cohen’s d, are very useful; so they keep getting profitably re-introduced by practitioners in many other fields.

For more on measures of correspondence, please see here.

Categories: data science Opinion Statistics Tutorials

Tagged as: