A recent run of too many articles on the same topic (exhibits: A, B and C) puts me in a position where I feel the need to explain my motivation. Which itself becomes yet another article related to the original topic. The explanation I offer is: this is the way mathematicians think. To us mathematicians the tension is that there are far too many observable patterns in the world to be attributed to mere chance. So our dilemma is: for which patterns/regularities should we derive some underlying law and which ones are not worth worrying about. Or which conjectures should try to work all the way to proof or counter-example?Mathematicians are famously abstract and concerned with notations and distinctions that do not exist outside of mathematics. However, I choose “patterns in nature” (instead of mathematics itself) as the guiding dilemma. This is because I do not (and I feel most mathematicians also do not) in fact worry much about the foundations mathematics itself. A lot has been written about reliability of the foundations of mathematics (Whitehead and Russell’s Principia Mathematica, foundational set theory, constructivism and the Association des collaborateurs de Nicolas Bourbaki‘s foundational work). And even though Gödel’s incompleteness theorems indicate there can not be an axiomitization of mathematics itself that is both complete and useful (as hoped for in Whitehead and Russell or in Hilbert’s research program) mathematics is in-fact much more stable and amenable to axiomatic analysis than any other field of study. So most mathematicians are comfortable with the state of mathematics itself.
It is regularities in calculations that disturb us. Are they really there? Is there a counter-example? Is there a proof? How much can we depend on them?
For a specific calculation we return to our recent fixation: common implementations of logistic regression tools such as R‘s glm() package. This is an important and useful statistical implementation. But what can we count on from the algorithm and its implementation?
Does it minimize traditional square error/loss?
No: minimizing loss is different than maximizing likelihood in this case.
Does it always converge from the standard start?
No: there are easy to construct examples that fail to converge from the standard all-zero start.
- Can we prove anything?
This situation is in fact close the theorist/mathematician’s view of mathematical knowledge. The view being mathematics is thin web of certainty with great reach and scope. The filaments of mathematical certainty reach almost everywhere, but are thin and do not cover. You are always a small move away from unsupported empty space (as we saw above in being able to generate easy examples that caused popular software to fail).
In fact this is why you even bother with proofs: you suspect even small changes are not safe. You want to establish what is safe, what is unsafe and how to discern the difference. In trying to prove the standard full-step Newton-Raphson method safe we have already found easy examples that break R’s glm() implementation. That is something you want to examine in your experimental lab, not happening quietly in production. So it is something you want to characterize and axiomatize.
Each of the articles I mentioned was written was trying either to prove or dispute possible regularities noticed in practice or while working on the previous article. In fact we used an ugly annealing technique to find the bad examples for our article. This is compatible with the “calculate and conjecture” style often used in mathematics (and heavily automated by Metropolis and Ulam). But during the search for bad examples we noticed a new regularity: no problem ever got worse on the first optimization step (even those problems that actually diverged). The question is: is this regularity a fact about the world or did we just do a poor job implementing the annealing/genetic-algorithm search we used to find counter-examples? Annealing has its own deep theory and can be implemented poorly or done well. The pure form of the dilemma: prove the conjecture or find a counter-example.
In some cases we use a principle like taste to walk away from a problem (the problem may not be worth solving). But there is always a nagging doubt that somebody else will solve the problem. And that in the work of solving it they will find another regularity to conjecture about. It could be that the next regularity they see can be generalized into something important.
But any time spent on search is time not spent on theory:
“If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search. … I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.”
Nikola Tesla, New York Times (19 October 1931)
And ego/taste is also a problem. I have spent a lot of time trying to prove problems are in fact difficult to justify having worked on them. And sometimes the proof of difficulty is itself in now way obvious.
But enough philosophy: the point is the regularity is like a stone in our shoe. We need to prove it or get rid of it. In our case: does the full-step Newton-Raphson method of solving logistic regression always improve on the first step from the origin? It is actually hard to walk away from this question because we have a lot of empirical evidence this is the case- but can’t count this as a guarantee (as we have no measure of the quality and comprehensiveness of our empirical evidence). What if right after we quit working on the problem somebody exhibits a simple counter-example or a simple proof?
So we work a bit more. If we can’t completely work the problem we try to at least eliminate much of the uncertainty and its attendant tension and excitement. Can we find a method to put the question to bed?
Suppose we have m data items that are pairs (x(i),y(i)) where each x(i) is a column vector in R^n with first entry equal to 1 and each y(i) is either 1 or 0. The first step of full step Newton-Raphson method of solving a logistic regression is to compute Equation 1:
w = inverse(sum(i=1...m :(1/4) x(i) transpose(x(i)))) * (sum(i=1...m : (y(i)-1/2) transpose(x(i)))).
This is a simplification of the update step from the simpler derivation of logistic regression. It is specialized to the first step where the initial weight estimate is zero and all probability estimates are 1/2.
What we are trying to optimize is what we (through some abuse of terminology) we will call the unscaled perplexity (or perplexity, if we forget to include the “unscaled” qualifier). If p is our vector of current model estimates the unscaled perplexity is defined as Equation 2:
unscaledPerplexity(p) = sum(i=1...m | y(i)=1 : -log(p(i)) ) + sum(i=1...m | y(i)=0 : -log(1-p(i)) ).
This equation is penalizing the model for claiming events than actually happened are modeled as rare. For example if for the i-th datum we have y(i)=1 then the sum includes a term of -log(p(i))) which is large when p(i) is near zero. Similarly if the correct answer is y(i)=0 the penalty is -log(1-p(i)) which is large with p(i) is near 1. The p estimates themselves are defined as
p(i) = s(w.x(i)) where
s(z) = exp(z)/(1+exp(z)).
The unscaled perplexity of the standard zero start is Equation 2 evaluated the p estimates you would get at w=0. This assigns all p(i)=1/2 (independent of x(i)). So the initial un-scaled perplexity is exactly m*log(2). We want to show the unscaled perplexity after the first step is less than this quantity. We want to show (at least informally) that Equation 2 is no more than m*log(2) when we plug in the “p” derived from Equation 1’s “w”.
The argument is as follows: Equation 1 is recognizable as a standard regression picking w such that w.x(i) is as close as possible to 4(y(i)-1/2). Formally w is such that the following square error is minimized in Equation 3:
squareError(w) = sum(i=1..m: ( w.x(i) - 4(y(i)-1/2) )^2 ).
Because we insisted all x(i) have the first entry 1 (representing the constant traditionally present in these fitting problems) we know our solution is at least as good as picking a w such w.x(i) = 4(q – 1/2) where q = sum(i = 1…m : y(i))/m. So the square error of the best solution should be able to place w.x(i) at least as close to 4(y(i) – 1/2) as 4(q – 1/2) is on average.
To actually prove a solid result we would relate how the vector p in R^m such that p(i) = s(w.x(i)) (s(z) as in Equation 2) being close to the vector 4(y-1/2) in the sense of the square norm
||v - 4(y-1/2)||^2 is small implies what we need to know. But this would take some tedious math, so we will use only a heuristic or hand-waving argument. Suppose additionally that not only is the square norm
||p - 4(y-1/2)||^2 small but for each “i” p(i) is near 4(y(i)-1/2) (at least as close as q is and sometimes closer). This additional condition will not always be true; there can be “i” such that p(i) is far from 4(y(i)-1/2), but they have to be rare since on average the relation is true. The additional (unsupported) assumption lets us think at example data-points (specific i’s) which can be easier than thinking in terms of norms (though when we work in norms we bring in theorems like the Cauchy–Schwarz inequality, so the norm based proofs do tend to be quite short and clear).
To characterize even the single term of our sum, and then our whole problem, we need try a number of techniques. So we must know or invent methods. To experience the flavor of the work let us first look at a single term of our unscaled perplexity sum. Such a term is always: -log(s(z)) or -log(1-s(z)) (depending on y(i)) for some z. One technique to analyze functions is to look at the Maclaurin series series (Taylor series around zero). The Maclaurin series of a function is the approximation of a function as a series of powers, so using it we can often see some structure. The Maxima symbolic algebra system shows the Maclaurin series of s(z) is:
(%i11) taylor(exp(z)/(1+exp(z)),z,0,4); 3 1 z z (%o11)/T/ - + - - -- + . . . 2 4 48
Paydirt! The function
g(z) = 1/2+z/4 is the inverse of the function
f(z) = 4(z - 1/2). (I feel it is a remarkable regularity of the mathematics of functions that I don’t have to say if I mean left or right inverse as they are the same:
g(f(z))=z.) So our strange least-squares estimate given in Equation 1 parks
w.x(i) pretty much at the exact right place to have
q. That is if
w.x(i) is nearly
s(w.x(i)) is nearly
q (which is the best constant estimate of all the y(i), being their average).
In fact the Maclaurin series of our loss term
-log(1-s(w.x(i))) (depending if y(i) is 1 or 0 respectively) looks like this:
(%i14) taylor(-log(exp(z)/(1+exp(z))),z,0,4); 2 4 z z z (%o14)/T/ log(2) - - + -- - --- + . . . 2 8 192 (%i15) taylor(-log(1-exp(z)/(1+exp(z))),z,0,4); 2 4 z z z (%o15)/T/ log(2) + - + -- - --- + . . . 2 8 192
So an estimate that was at least as good as 4(q-1/2) would sum over all m data points for a total unscaled perplexity of:
(%i18) expand(-m*(q*subst(4*(q-1/2),x,taylor(log(exp(x)/(1+exp(x))),x,0,2)) + (1-q)*subst(4*(q-1/2),x,taylor(log(1-exp(x)/(1+exp(x))),x,0,2)))); 2 m (%o18) - 2 m q + 2 m q + log(2) m - - 2
Or: the perplexity after one step is just the
m*log(2) perplexity (as seen at the starting zero point) plus roughly a term of:
m*(-2q^2 + 2q - 1/2). Notice this term is never positive for any q in the interval [0,1] (and is negative for all points except q=1/2). This means unless q is near 1/2 we expect a large drop in perplexity after the first optimization step (which is what we are trying to prove). Thus we could (with some more care) rigorously prove for q sufficiently far away from 1/2 (likely only a small increment) that the first optimization step does improve. So we have a partial result, or a greater understanding of the problem. We can say: the dc term can usefully dominate error in the first step unless q is near 1/2. You are not going to find a counter-example with q very far from 1/2.
At this point we can either strengthen our code that searches for counter-examples to restrict itself to the q=1/2 case (brining to mind Tesla’s quote: we have proven most of our previous counter-example search was wasted as it did not concentrate on q near 1/2) or work more rigorously on the q=1/2 case to try and prove the proposition that the first optimization step decreases perplexity.
We have learned some things. But the dilemma remains: do we believe there is a counter-example or proof just around the corner? Do we want to work through two pages of this kind of argument? Are we willing to slog through tens of pages of this kind of argument? Can we find new techniques to argue? And we are also back to taste. If we are working on this problem (which may not be important) it means we are not working on something else. We can’t really know the value of the solution until we see it. Conceivably the solution method developed to solve an unimportant problem can itself have deep and important consequences. So if we drop the work because it seems unimportant we could miss something. However as Rota wrote:
There is a ratio by which you can measure how good a mathematician is, and that how many crackpot ideas he must have in order to have one good one. If this ratio equals ten to one then he is a genius. For the average mathematician, it may be one hundred to one. This is a feature of creative thinking that the man in the street fails to appreciate.
(Gian-Carlo Rota, “Indiscrete Thoughts”, Birkhauser, 1997; sorry about the single-gendered pronouns representing any mathematician, they are in the source).
On the negative (fixated) side Underwood Dudley goes on to characterize those who can not give up a crackpot idea as in fact being crackpots in “A Budget of Trisections”, Springer, 1987. And there is the extreme example of mathematician Paul Erdös’s amphetamine use to maintain concentration:
… Graham bet Erdös $500 that he couldn’t stop taking amphetamines for a month. Erdös accepted the challenge, and went cold turkey for thirty days. After Graham paid up–and wrote the $500 off as a business expense–Erdös said, “You’ve showed me I’m not an addict. But I didn’t get any work done. I’d get up in the morning and stare at a blank piece of paper. I’d have no ideas, just like an ordinary person. You’ve set mathematics back a month.”
(from “The Man Who Loved Only Numbers”, Paul Hoffman).
The matter of mathematical taste, or when to walk away from a problem, is of central importance to mathematical thought. It is definitely the institutional bias to stay on a problem a bit longer than would seem natural. And there is the issue of ego: moving on from a result you wanted makes means admitting you didn’t win. This is the nature of the motivations and dilemmas we wished to exhibit here.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.