Professor Sourav Chatterjee recently published a new coefficient of correlation called XICOR (refs: JASA, R package, Arxiv, Hacker News, and a Python package (different author)).
The basic formula (in the tie-free case) is:
- Take X and Y as n-vectors of observations of random variable.
- Compute the ranks r(i) of the Y observations.
- Sort the data tuples (X(1), Y(1), r(1)), … (X(n), Y(n), r(n)) such that X(1) ≤ … X(n).
- Compute XICOR := 1 – 3 sum(i=1, n-1) |r(i+1) – r(i)| / (n^2 – 1).
At first glance, the formula may seem odd. The papers spend most of their time studying the large data/asymptotic foundations of the formula, and applying it to data. Why the formula is “surface sensible” is simple math, well known to statisticians. Let’s spend some time expanding this step here.
What XICOR is Doing
Y = f(X) case
First what is XICOR doing? The basic idea is: if Y is a monotone function of X then |r(i+1) – r(i)| ≤ 1. This is because if if Y is a monotone function of X then moving up one rank in X either increases Y’s rank by 1 (if Y is an increasing function of X), or decreases Y’s rank by 1 (if is a decreasing function of X), or (if Y rank ties are handled in a given way) is zero (if Y is not changing).
So if Y is a function of X, such as Y = sin(X), then we expect sum(i=1, n-1) |r(i+1) – r(i)| to be small (no more than n-1). This places the XICOR near 1.
Y independent of X case
If Y is independent of X we instead expect sum(i=1, n-1) |r(i+1) – r(i)| to be large. This follows from the standard geometric fact that two points picked uniformly form the unit interval have an expected absolute distance of 1/3 (worked later in the appendix). So we expect |r(i+1) – r(i)| to be about n/3 for each i.
Summing up (using linearity of expected values) we expect 3 sum(i=1, n-1) |r(i+1) – r(i)| to be about 3 (n-1) / (n/3) and take the XICOR to a number near zero. In fact a more detailed calculation shows the expected value of sum(i=1, n-1) |r(i+1) – r(i)| is (n^3 -n)/3, which exactly cancels the declared terms in the XICOR taking the expected value of the XICOR to identically zero.
An that is how summing rank differences detects a functional relation between X and Y. If there is a relation, then for most places in an X-sorted observation set the Y-rank is changing be no more than 1. So summing such a rank change detects the relation, and this is the XICOR formula.
E[|r(i+1) – r(i)|] in the independent case
In the independent X is a random order with respect to Y. So E[|r(i+1) – r(i)|] is equal to the E_pi n E_i |pi(i+1) – pi(i)| where pi is a permutation of 1 … n chosen uniformly at random. By the linearity of expectation and the symmetry of the permutation, we have all E_i |pi(i+1) – pi(i)| = |pi(2) – pi(1)|, so we only have to estimate E_pi |pi(2) – pi(1)|. This in turn is equal to: (1 / (n * (n-1))) sum(a = 1 … n) sum(b = 1 … n, b !=a) |a – b|.
This sum can be evaluated as follows. sum(a = 1 … n) sum(b = 1 … n, b !=a) |a – b| = sum(a = 1 … n) ((sum(b = 1 … a-1) a – b) + (sum(b = a+1 … n) b – a)). Now we use the fact that sums of polynomials are always themselves just polynomials, possibly of one higher degrees. So to solve for this sum we just need to know its value for around 4 values of n. This is easy to do in Python, and comes out to (n^3 – n)/3, which is exactly the form used in the paper.
E[|a – b|] for a, b independent uniformly at random in [0, 1]
This is calculus similar to the sum solved above.
E[|a – b|] = integral(a = 0 to 1) integral(b = 0 to 1) |a – b| db da.
This can be re-written to eliminate the absolute value as integral(a = 0 to 1) (integral(b = 0 to a) (a – b) db) (integral(b = a to 1) (b – a) db) da. And now we just have nested integrals over polynomials. Some work shows this evaluates to 1/3 as claimed.