Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)

Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems.From a theoretical point of view the logistic generalized linear model is an easy problem to solve. The quantity being optimized (deviance or perplexity) is log-concave. This in turn implies there is a unique global maximum and no local maxima to get trapped in. Gradients always suggest improving directions. However, the standard methods of solving the logistic generalized linear model are the Newton-Raphson method or the closely related iteratively reweighted least squares method. And these methods, while typically very fast, do not guarantee convergence in all conditions.

If you do not like Newton-Raphson techniques, many other optimization techniques can be used:

- stochastic gradient descent
- conjugate gradient
- EM (see “Direct calculation of the information matrix via the EM.” D Oakes, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999 vol. 61 (2) pp. 479-482).

Or you can try to solve a different, but related, problem: “Exact logistic regression: theory and examples”, C R CR Mehta and N R NR Patel, Statist Med, 1995 vol. 14 (19) pp. 2143-2160.

A dominating problem with logistic regression comes from a feature of training data: subsets of outcomes that are separated or quasi-separated by subsets of the variables (see, for example: “Handling Quasi-Nonconvergence in Logistic Regression: Technical Details and an Applied Example”, J M Miller and M D Miller; “Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives”, P J Green, Journal of the Royal Statistical Society, Series B (Methodological), 1984 pp. 149-192; and FAQ What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them?). Most practitioners will encounter this situation and the correct fix is some form of regularization or shrinkage (*not* eliminating separating variables- as they tend to be the most influential ones). In fact most practitioners have the intuition that these are the only convergence issues in standard logistic regression or generalized linear model packages. This is not the case.

The Newton-Raphson/Iteratively-Reweighted-Least-Squares solvers can fail for reasons of their own, independent of separation or quasi-separation. Consider the responses to the following request for help: Whassup with glm()?. Professor Andrew Gelman asks why the following R code diverges:

```
```y <- rep (c(1,0),c(10,5))
glm (y ~ 1, family=binomial(link="logit"), start=5)

Clearly some of the respondents are thinking in terms of separation and numeric overflow. But the problem was to merely compute an average (the data as a function only of the constant 1!) and the start point of 5 is so small a number that even exp(5) will not trigger over-flow or under-flow. What went wrong is: the Newton-Raphson style solver merely, for reasons of its own, refused to work. 5 is a numerically fine start estimate- but it is outside of the Newton-Raphson convergence region. This is a surprise to many practitioners- but Newton-Raphson style methods are only guaranteed to converge if you start sufficiently close to the correct answer. It is likely the case that for most logistic regression models the typical start (all coefficients zero: yielding a prediction of 1/2 for all data) is close enough to the correct solution to converge. But without additional theorems and lemmas there is no reason to suppose this is *always* the case.

The “Whassup” example demonstrates the problem is present in R‘s standard optimizer (confirmed in version 2.15.0). The take-away is to be very suspicious if you see *any* of the following messages in R:

- “glm.fit: fitted probabilities numerically 0 or 1 occurred”
- any “failed to converge” message
- astronomical coeffcients
- residual deviance larger than null deviance.

In any of these cases model fitting has at least partially failed and you need to take measures (such as regularized fitting).

R’s optimizer likely has a few helping heuristics, so let us examine a trivial Newton-Raphson method (always takes the full Newton-Raphson step, with no line-search or other fall-back techniques) applied to another problem. In this case (to make prettier graphs) we will consider fitting y as a function of the constant 1 and a single variable x. Our data is given by the following four rows:

x | y | |
---|---|---|

1 | 1.00 | TRUE |

2 | 0.00 | TRUE |

3 | 1.00 | FALSE |

4 | 0.00 | FALSE |

(or in R notation: `p = data.frame(x=c(1,0,1,0),y=c(T,T,F,F))`

).

The unique optimal model is to admit y is independent of x and set all coefficients to zero (R solves this correctly when given the command: `glm(y~x,data=p,family=binomial(link='logit'))`

). This model has a residual deviance of 5.5452 (which is also the null deviance). The following figure plots the perplexity (the un-scaled deviance) of different models as a function of choice of wC (the constant coefficeint) and wX (the coefficient associated with x):

The minimal perplexity is at the origin (the encoding of the optimal model) and perplexity grows as we move away from the origin (yielding the ovular isolines).

For our next figure we plot the behavior of a single full step of a Newton-Raphson method (generated by a deliberately trivial implementation of The Simpler Derivation of Logistic Regression). For each point in the plane we initialize the model with the coefficients represented by the point (wC and wX) and then take a single Newton-Raphson step. Plotting the single step behavior lets us draw some conclusions about the iterated optimizer without getting deep into the theory of iterated systems. If the step does not increase the perplexity (as we would expect during good model fitting) we color the point red, otherwise we color the point blue. The intuition is that most of the blue points represent starts that would cause the fitter to diverge (they increase perplexity and likely move to chains of points that also have this property).

Divergence is easy to show for any point that lies outside of an isoline of the first graph where this isoline is itself completely outside of the red region of the second graph. These points show an increase in perplexity (as they are outside of the red region) and thus stay outside of their original perplexity isoline (and remain outside of the red region) and therefore will never decrease their perplexity no matter how many Newton-Raphson steps you apply.

So, the acceptable optimization starts are only in and near the red region of the second graph. Starts far outside of this region are guaranteed to not converge to the unique optimal point under Newton-Raphson steps. R confirms the problem with the following bad start: `glm(y~x,data=p,family=binomial(link='logit'),start=c(-4,6))`

. Sufficiently sophisticated code can fallback to gradient-alone methods when Newton-Raphson’s method fails. The problem is fixable, because optimizing logistic divergence or perplexity is a very nice optimization problem (log-concave). But most common statistical packages do not invest effort in this situation.

And most practitioners are unfamiliar with this situation because:

- They
*rightly*do not concern themselves with the implementation details, as these are best left to the software implementors. - They are
*very*likely to encounter issues arise from separation, which will mask other issues.

The good news is that Newton-Raphson failures are not silent. You will see a large residual deviance and many of the other diagnostics we called out. The fix for a Newton-Raphson failure is to either use a more robust optimizer or guess a starting point in the converging region. This is not hopeless as coefficients from other models such as linear regression and naive Bayes are likely useable. Also one can group variables and levels to solve simpler models and then use these solutions to build better optimization starting points.

Some comfort can be taken in that: the reason statistical packages can excuse not completely solving the optimization problem is: Newton-Raphson failures are rare in practice (though possible). It would be nice if all packages included robust fallback code (such as not accepting Newton-Raphson steps that degrade solution quality and switching to gradient alone methods in this case) but that is not the current state of the market.

Really what we have done here (and in What does a generalized linear model do?) is treat statistical modeling as a college math exercise. What we have done and what we recommend: is try trivial cases and see if you can simplify the published general math to solve the trivial case directly. Usually nobody *fully* understands the general case (beyond knowing the methods and the proofs of correctness) and any real understanding is going to come from familiarity from working basic exercises and examples. Instead of appealing to big hammer theorems- work some small examples.

Extra credit: find a simple non-separated logistic regression that diverges on the first Newton-Raphson step from the origin, or failing that a proof that no such problem exists. We don’t have such an example (though suspect there is a divergent example) and have some messy Java code for experimenting with single Newton-Raphson steps: ScoreStep.java.

Categories: data science Expository Writing Mathematics Statistics Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Found a bounded example that diverges from a default all-zero start on the fourth Newton step (not the first). The way it does it is it essentially drives some probabilities to zero and one (which essentially removes them from the Hessian matrix as contributions to the Hessian are weighted by p(1-p)) and causes degeneracy at that point. The problem does have a bounded “better than null deviance” solution (but it also drives probabilities near zero and one). In R code:

The first example (starting at the default 0,0) example runs away. The second (with a near start hint) converges. Both yield warnings from R- but the outcomes are different. The weights are all positive integers- so they are just a shorthand for data replication.

LikeLike

Just a note: we are using the word “robust” in its standard English sense in this article: performs well for arbitrary inputs. We do not mean “robust” in the technical statistical sense (immune to variations in modeling assumptions and outliers).

LikeLike

We can write down the Newton-Raphson update rule explicitly for single parameter models and plug in Professor Gelman’s example (x representing the modeling parameter and start, q representing the ratio of true to false to be modeled):

We see the jump from bad (5) to worse (-49).

A nasty exercise would be asking to prove that iterating f(x,q) from x starting at 0 and any fixed q strictly in-between zero and one converges to x = logit(q) (i.e. show standard logistic regression is capable of computing averages from the standard zero start in all cases). With a huge Taylor series slog we can show starts from zero pick an x such that

`p = exp(x)/(1+exp(x))`

is closer to q than the original 1/2, but it is a lot of work for very little benefit.LikeLike

I would like to comment on the problem discussed about modeling an intercept only logistic model with 10 1s and 5 0 values. The problem is not the fault of the Newton-Raphson algorithm, or of IRLS as implemented in R’s glm function. I am author of a text called Logistic Regression Models (2009, Chapman & Hall/CRC – 656 pages) and of a forthcoming one: Hilbe and Robinson, Methods of Statistical Model Estimation (Chapman & Hall/CRC). For the latter book we developed an R irls() function, among others, that is very similar to glm, but in many respects is more comprehensive and robust. for one thing, It easily estimates the problem data. I show this below, and also model the data using both Stata glm and its MLE logit commands. They give identical results as the irls function. The problem is not the Newton-Naphson or IRLS, but the manner in which R’s glm function was written.

# CREATE DATA

> y y

[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

# USE R glm FUNCION – FAILS

> glm(y ~ 1, family=binomial(link=”logit”))

Coefficients:

(Intercept)

1.501e+15

Degrees of Freedom: 14 Total (i.e. Null); 14 Residual

Null Deviance: 19.1

Residual Deviance: 360.4 AIC: 362.4

Warning message:

glm.fit: fitted probabilities numerically 0 or 1 occurred

# USE IRLS FUNCTION FROM

# Hilbe & Robison, Methods of Statistical Model Esstimation, Chapman & Hall/CRC

# and in the # MSME package associated with the book on CRAN. Not yet

# published, but will be on CRAN this year.

# y INTO DATA FRAME, LOAD LIBRARY, RUN MODEL

> testy library(msme)

> irls.logit summary(irls.logit)

Call:

irls(formula = y ~ 1, data = testy, family = “binomial”, link = “logit”)

Deviance Residuals:

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.4820 -1.4820 0.9005 0.1062 0.9005 0.9005

Coefficients:

Estimate SE Z p LCL UCL

(Intercept) 0.6931 0.5477 1.266 0.2057 -0.3804 1.767

Null deviance: NA on 14 d.f.

Residual deviance: 19.09543 on 14 d.f.

AIC: 21.09543

Number of iterations: 1

# NOTE INTERCEPT VALUE IS 0.6931, WITH SE = 0.5477; PLUS z, p AND CONFIDENCE

# INTERVALS. CONVERGE WAS RAPID AND ACCURATE

# ODDS RATIO IS 2.0

> exp(0.6931)

[1] 1.999906

STATA

. tab y

y | Freq. Percent Cum.

————+———————————–

0 | 5 33.33 33.33

1 | 10 66.67 100.00

————+———————————–

Total | 15 100.00

* THERE IS A 10:5 OR 2:1 RATIO OF y==1 TO y==0, MEANING THAT THE ODDS RATIO IS 2.

* SO THE MODEL SHOULD WORK.

* MODEL THE Y DATA USING STATAs IRLS-BASED glm COMMAND

. glm y, fam(bin) nolog

Generalized linear models No. of obs = 15

Optimization : ML Residual df = 14

Scale parameter = 1

Deviance = 19.09542505 (1/df) Deviance = 1.363959

Pearson = 15 (1/df) Pearson = 1.071429

Variance function: V(u) = u*(1-u) [Bernoulli]

Link function : g(u) = ln(u/(1-u)) [Logit]

AIC = 1.406362

Log likelihood = -9.547712524 BIC = -18.81728

——————————————————————————

| OIM

y | Coef. Std. Err. z P>|z| [95% Conf. Interval]

————-+—————————————————————-

_cons | .6931472 .5477226 1.27 0.206 -.3803693 1.766664

——————————————————————————

* STATA MLE COMMAND (FUNCTION) – CONVERGES NICELY AS WELL.

. logit y, nolog

Logistic regression Number of obs = 15

LR chi2(0) = 0.00

Prob > chi2 = .

Log likelihood = -9.5477125 Pseudo R2 = 0.0000

——————————————————————————

y | Coef. Std. Err. z P>|z| [95% Conf. Interval]

————-+—————————————————————-

_cons | .6931472 .5477226 1.27 0.206 -.3803693 1.766664

——————————————————————————

Best, Joseph Hilbe

LikeLike

@Joseph Hilbe Thanks for the comment, code and examples. Probably we are talking past each other a bit. We do agree that careful algorithms that incorporate Newton-Raphson or IRLS ideas can solve the optimization problem. It is after all an easy class of problem (unimodal, log-concave and so on). However, I still feel the case that a rigid full-step Newton-Raphson algorithm (with no line-search fallback) is less than safe. Also probably the blogging software mangled your comment a bit (if so I am sorry about that) but R’s glm() only fails on the 10 1s, 5 0s example when given a non-zero start (like 5 or -5). With a zero start it works (and as I show in my follow-up article we should expect even simple Newton Raphson algorithms to succeed in this zero-start case).

LikeLike

Thank you Professor Andrew Gelman for your mention of this article cycle ( http://www.win-vector.com/blog/2012/08/what-does-a-generalized-linear-model-do/ , http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/ , http://www.win-vector.com/blog/2012/08/newton-raphson-can-compute-an-average/ and http://www.win-vector.com/blog/2012/09/the-mathematicians-dilemma/ ). I especially liked the majorization ideas you and your commenters brought in ( “A Tutorial on MM Algorithms”, David R. Hunter, Kenneth Lange; “Monotonicity of Quadratic-Approximation Algorithms”, Dankmar Bohning, Bruce G. Lindsay, Ann. Inst. Statist. Math, Vol. 40, No. 4, pp 641-664, 1988)

LikeLike