Menu Home

How often do the L1 and L2 norms agree?

Turns out that I am still on a recreational mathematics run. Here is one I have been working on, arising from trying to explain norms and data science.

Barry Rowlingson and John Mount asked the following question.

Generate vectors v1 and v2 in Rn with each coordinate generated IID normal mean zero, standard deviation 1. This is a common way to generate vectors with a uniform spherical distribution. Let pn denote the probability that (||v1||1 ≥ ||v2||1) = (||v1||2 ≥ ||v2||2). What is limn → ∞ pn?

It turns out the answer is: 1/2 + arctan(1/sqrt(π - 3)) / π ≅ 0.8854404657887897. I’ve taken to calling this the “L1L2 AUC” or concordance. This is not the first value I guessed.

The rather long (and brutal) argument chain to establish this can be found here. Along the way we had to solve for the expected L1 norm of a vector with unit L2 norm, and also work out P[(X + Y ≥ 0) = (X ≥ 0)] (for X, Y independent mean zero normal random variables with known variances, we call this the sign tilting lemma).

It was great to get the old “conjecture and prove/disprove” engine spinning again.

Categories: Mathematics

Tagged as:

John Mount

3 replies

  1. I’ll use v,w to denote vector names and subscripts for coordinates.

    1. It’s equivalent to ask for the probability that (||v||_1\ge||w||_1)=(||v||_2^2\ge||w||_2^2). As the coordinates are independent of each other, we can calculate their contributions to the difference between the L1 norm and the squared L2 norm (which are additive) one at a time. That is, sample the coordinates in the order (v_1,w_1), (v_2,w_2), (v_3,w_3), … and we ask for the probability that the 2D vector (|v_1|-|w_1|,|v_1|^2-|w_1|^2)+(|v_2|-|w_2|,|v_2|^2-|w_2|^2)+…+(|v_n|-|w_n|,|v_n|^2-|w_n|^2) is in the first or third quadrant (the first coordinate keeps track of which L1 norm is larger; the second keeps track of which L2 norm is larger). These n vectors are iid from the distribution of (X-Y, X^2-Y^2) where X,Y are independent and half-normal distributed, so by rescaling (which doesn’t affect which quadrant you’re in) and the multidimensional central limit theorem, the answer is the same as for the multivariate normal distribution with the same covariance matrix, which (by using the moments of the chi and chi squared distributions) is {{1-2/pi,sqrt(2/pi)},{sqrt(2/pi),2}}.

    Using the probability that the centered multivariant normal with covariance matrix {{a,b},{b,c}} is in the first or third quadrant is arccos(-b/sqrt(ac))/pi, so the final answer is arccos(-1/sqrt(pi-2))/pi.

    (This agrees with your answer: arctan(1/sqrt(pi-3))=pi/2-arctan(sqrt(pi-3)), so 1/2+arctan(1/sqrt(pi-3))/pi=1-arctan(sqrt(pi-3))/pi. Also, cos(arctan(x))=1/sqrt(x^2+1), so this becomes 1-(arccos(1/sqrt(pi-2))/pi)=(pi-arccos(1/sqrt(pi-2)))/pi=arccos(-1/sqrt(pi-2))/pi.)

    2. I also decided to see what happens if I replace the half normal distribution with a normal distribution with the same mean and variance but then also replaced the L1 norm with the sum of the coordinates to see if it gives the same answer. (The point is that part of the reason this problem takes a little work is the absolute values, and we solved that above by conditioning on all coordinates being positive (or equivaletly, reflecting them all to be positive). If we have two random positive nD vectors v,w, then ||v||_1>=||w||_1 iff v-w has positive dot product with (1,1,…,1), and ||v||_2^2>=||w||_2^2 iff ||v||_2^2-||w||_2^2>=0 iff >= 0. If v,w were normal then v-w,v+w would be independent so we could sample them independently and hope for a geometric solution, though I haven’t found one. But for the above problem they have to be half normal, not normal, so the question is whether making the replacement above would give the same answer.)

    For convenience, I rescaled the mean to 1, so the variance becomes pi/2-1, so the standard deviation is about 0.7555. Using the same approach as in the previous solution, for a general normal distribution with mean 1 and standard deviation sigma, the covariance matrix (after some rescaling) becaomes {{2sigma^2,4sigma^2},{4sigma^2,8sigma^2+2sigma^4}} or, after more rescaling, {{1,2},{2,4+sigma^2}}.

    For sigma^2=pi/2-1, the answer becomes arccos(-1/sqrt(1+(pi/2-1)/4))/pi which is about 0.8850, which is close to the same answer but not quite (it’s close enough that I don’t have a way to predict ahead of time whether it would be greater or less).

    It turns out that the standard deviation required to get the same answer is 2sqrt(pi-3) or about 0.7526.

    1. While I think your computations of a_n and v_n can be more standardized by using (To compute v_n, to compute the covariance term, use the polarization identity) (I haven’t checked that you can actually do this, but if you write out the integrals it looks like you end up with more things that look like beta integrals), if all you are about is the limit then you can use the delta method (in particular, the thing you assume converges to normal in your ipynb does indeed converge to normal). This is pretty much the same observation as before, where the coordinates are independent of each other.

      Indeed, apply the 2D delta method to the sequence of random variables given by taking averages of n iid copies of (X,X^2) where X comes from a halfnormal distribution and apply it with the function f(x,y)=x/sqrt(y). To get the variance of the resulting distribution, the covariance matrix is the same as in item 1 of the previous comment (actually, in the previous comment, the covariance matrix should have been doubled, though due to scale invariance I ignored that. But this time the scale matters, so the covariance matrix in this case is {{1-2/pi,sqrt(2/pi)},{sqrt(2/pi),2}}), and we apply this symmetric bilinear form (as a quadratic form) to the vector {{1},{-sqrt(pi/2)/2}}.

      1. flash1234567890, thanks for the great notes. I also really like how you called out good signpost theorems to shoot for (polarization, and delta method). I did have an initial plan of attack, before the proof moved to “calculate as long as I can before I admit I have to linearize.” I was hoping to get to to a place where the McDiarmid’s concentration inequality would do the work. As is, I think one can extract a coherent outline of: write converting normals to L2 ball, to L1 ball as a multiplicative system and go.

%d bloggers like this: