# A pathological glm() problem that doesn’t issue a warning

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a bit pathological, but one would hope for a diagnostic or warning message from the fitter.Consider the following synthetic data set and glm() logistic regression fit (using “R version 3.0.0 (2013-04-03) — “Masked Marvel”” on OSX Mountain Lion):

```> d <- data.frame(x=c(rep(1,200),rep(0,25)),y=c(rep(1,24),rep(0,176),rep(1,25)))
> table(y=d\$y,x=d\$x)
x
y     0   1
0   0 176
1  25  24
> print(summary(m))

Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = d)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.5056  -0.5056  -0.5056  -0.5056   2.0593

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    18.57    1304.53   0.014    0.989
x             -20.56    1304.53  -0.016    0.987

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 235.84  on 224  degrees of freedom
Residual deviance: 146.77  on 223  degrees of freedom
AIC: 150.77

Number of Fisher Scoring iterations: 17
```

Notice that no coefficient achieved significance and the error bars are in fact gigantic. It is almost like when logistic regression tries to run a coefficient to infinity on linearly separable data. In fact that is what is going on: for x=0 the y data is all in one class (separated or pseudo-separated) and any model of the form dc-term = B and x-term = -B -1.9924 is a good model for large positive B (always is correct on the x=1 distribution of y’s, and gets better at reproducing the x=0 distribution of y’s as B goes to infinity). An ideal optimizer would run B to +infinity. Likely the Newton method in glm() failed because the Hessian became numerically ill-conditioned or the gradient became near zero (but in a region where the loss function was flat, so not a good indication of an optimum). But that is the problem: glm() didn’t inform us of any issue. It should have run to infinity (bad), but instead it just stopped without diagnostic signaling (also bad).

The model is in fact good, it is the error bars that are a problem. It should not be hard to bound coefficients that are running to infinity away from zero. The standard error estimates (even if right) are in this case not able to show the coefficients are nowhere near zero (the usual use of the standard error estimates from a model summary!). It has always been a strange feature of logistic regression that it has problems with (and has to be defended from) data that is “too good.” Many other methods (like decision trees) do not have this issue.

The solution is simple: add a regularization term in the optimizer (or add reasonable prior if you are a Bayesian). Regularizing of course spoils the coefficient error-bar calculations; but you could either work out the math for error bars on a regularized estimate- or empirically estimate error bars by some sort of Bootstrap or empirical re-sampling scheme.

Unfortunately trying to regularize by adding some fuzzy data does not work until we add a fairly significant perturbation to the data:

```> d\$wt <- 1
> fuzz <- data.frame(x=c(1,1,0,0),y=c(1,0,1,0))
> fuzz\$wt <- 0.5
> d2 <- rbind(d,fuzz)
Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
> summary(m2)

Call:
glm(formula = y ~ x, family = binomial(link = "logit"), data = d2,
weights = wt)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9878  -0.5099  -0.5099  -0.5099   2.0516

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    3.932      1.428   2.754  0.00589 **
x             -5.906      1.444  -4.090 4.31e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 239.37  on 228  degrees of freedom
Residual deviance: 153.95  on 227  degrees of freedom
AIC: 151.75

Number of Fisher Scoring iterations: 6
```

But to even try these fixes you would have to know you have a problem. Right now the only sign of a problem are the enormous error bars.

Categories: Statistics

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 3 replies ›

1. mark leeds says:

Hi John: I agree that some kind of warning should be issued but the fact that
the standard errors are not only large but identical kind of tells you that the
maximization of the likelihood probably failed.

My guess is that it’s not so much that Y = 0 has no zero values for X but rather that
this is true AND there are approximately the same number of X=0 and X=1 for Y=1.
Therefore, no matter how negative x becomes, the likelihood doesn’t increase.
It’s as if there’s a region where the likelihood is perfectly flat for x is greater > xstar.

But I agree that optim ( or whatever glm uses ) should give some kind of warning
that it didn’t converge. Interesting example.

2. Z says:

I agree that it would be very helpful if R issued a warning in this situation. For some datasets a warning about probabilities that are numerically 0 or 1 is given but this can also occur in many other (possibly harmless) situations. A better warning is given by the glm function in the safeBinaryRegression package. A Bayesian approach is implemented in bayesglm from package arm. Finally, bias-reduced estimation can be carried out via brglm from the package of the same name which is typically what I do in practice.

3. @Z Thanks. Yeah the “fitted values that are 0 or 1” type warning was what I had been expecting from a problem like this.