Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.When modeling you often encounter what are commonly called categorical variables, which are called factors in R. Possible values of categorical variables or factors are called levels. We will demonstrate some of the needed in interpreting level based summaries, using R as an example.

Consider the following data frame:

```
```d <- data.frame(x=c('a','a',
'b','b','b','b','b','b','b','b','b','b', 'b', 'b',
'c','c','c','c','c','c','c','c','c','c'),
y=c(-5,5,
1,1,1,1,1,1,1,1,1,1,1,0.8,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1))

When we use such a frame in a model (such as a `lm()`

) the categorical variable `x`

needs to be converted into a numerical representation to be usable. The common way to do this is to introduce each observed value or level into a new zero/one indicator variable. For our input `x`

we would expect three new numeric indicator varaibles: `xa`

, `xb`

and `xc`

. Then the values of `x`

are encoded in these three variables as given in the following table:

xa | xb | xc | |
---|---|---|---|

x==’a’ | 1.00 | 0.00 | 0.00 |

x==’b’ | 0.00 | 1.00 | 0.00 |

x==’c’ | 0.00 | 0.00 | 1.00 |

This sort of bookkeeping rapidly gets painful, so R has a lot of code deep down in its implementation of factors and contrasts to perform this encoding for you. However no matter how much you mess with `contrasts()`

, `C()`

or other commands (setting `treatment`

to things like `contr.sum`

and so on) you can not get over the fact that R really really wants to use 2 variables to encode 3 levels (and it will fight mightily to prevent you from fixing this). Largely this is because R is trying to prevent a linear dependence from making it to the fitter. Notice in our above table all encodings obey the linear relation `xa + xb + xc = 1`

. This means one of these variables is redundant and something like the reasonable linear model ` xa + xb + xc`

is equivalent to the unreasonable model `1000000 xa + 1000000 xb + 1000000 xc - 999999`

. Now my strong preference would be to use modeling code that can deal with this by using simple methods like ridge regression. But R, like most traditional stat programs, doesn’t trust its linear algebra system to deal with linear dependencies- so they don’t allow the above encoding and force us to use variations of the encoding seen in the next table:

xb | xc | |
---|---|---|

x==’a’ | 0.00 | 0.00 |

x==’b’ | 1.00 | 0.00 |

x==’c’ | 0.00 | 1.00 |

In this encoding both `b`

and `c`

are written as difference from the “hidden value” `a`

. We no longer have any redundancy as `a`

‘s encoding ([0,0]) no longer totals to 1. However, this leads to a new subtle problem in fit reporting. Consider fitting a linear model for ` y ~ a x + b`

on the example data:

```
```m <- lm(y~x,data=d)
summary(m)
Call:
lm(formula = y ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.0000 0.0000 0.0083 0.0167 5.0000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.541e-15 1.091e+00 0.000 1.000
xb 9.833e-01 1.179e+00 0.834 0.414
xc -1.000e+00 1.196e+00 -0.836 0.412
Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002, Adjusted R-squared: 0.2336
F-statistic: 4.505 on 2 and 21 DF, p-value: 0.02355

The default factor encoding chose `a`

as the reference or hidden level. The coefficients on remaining two levels (`xb`

and `xc`

) now represent how much boost you get relative to the hidden level. In this case there are large error bars around the hidden level (it only occurred twice and with wildly different y values). So we have horrid t-values and no reported evidence that the effects of `b`

and `c`

are significantly different than `a`

. But this is misleading. This report does not tell us if there is a significant difference between `xb`

and `xc`

, the only reason we can’t tell if they are far from the hidden estimate of `xa`

is uncertainty in estimating the (hidden) coefficient of `xa`

itself. Uncertainties like this are not distances and don’t obey the triangle inequality or transitivity- not being able to show either of `xb`

or `xc`

is far from `xa`

does not mean we can not show `xb`

is far from `xc`

. It just wasn’t in this report.

To fix this we apply a function to “re-level” the factor `x`

so the hidden level to the most popular value (i.e. choosing which level is hidden and choosing it in such a way we expect the error bars to be small):

```
```setRefToMostCommonLevel <- function(f) {
f <- as.factor(f)
t <- table(f)
relevel(f,ref=as.integer(which(t>=max(t))[[1]]))
}
d$x <- setRefToMostCommonLevel(d$x)
d$x
m <- lm(y~x,data=d)
summary(m)

With the most popular level (`b`

) as the reference we get the following result:

```
```Call:
lm(formula = y ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.0000 0.0000 0.0083 0.0167 5.0000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9833 0.4456 2.207 0.03860 *
xa -0.9833 1.1789 -0.834 0.41362
xc -1.9833 0.6609 -3.001 0.00681 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002, Adjusted R-squared: 0.2336
F-statistic: 4.505 on 2 and 21 DF, p-value: 0.02355

We now have `p ≤ 0.00681`

significance (which is good, smaller is better for reported significances) that the outcome (`y`

) difference between the level `c`

the hidden level `b`

is unlikely under the null-hypothesis.

Of course what was hiding the significance report were large error bars on the reference level. So we could try running the following code to select the level with the least variance in outcome to minimize the size of the reference error bars. So here is code that picks the level with minimum sample variance (a trade-off between number of observations and variation on associated `y`

values):

```
```setRefToLeastVarLevel <- function(f,y) {
f <- as.factor(f)
t <- sapply(levels(f),function(fi) var(y[f==fi]))
if(all(sapply(t,is.na))) {
setRefToMostCommonLevel(f)
} else {
relevel(f,
ref=as.integer(which((!sapply(t,is.na))&(t<=min(t,na.rm=T)))[[1]]))
}
}
d$x <- setRefToLeastVarLevel(d$x,d$y)
d$x
m <- lm(y~x,data=d)
summary(m)

This picks `c`

as the reference level of minimum variance resulting in the following:

```
```Call:
lm(formula = y ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.0000 0.0000 0.0083 0.0167 5.0000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0000 0.4881 -2.049 0.05320 .
xb 1.9833 0.6609 3.001 0.00681 **
xa 1.0000 1.1957 0.836 0.41237
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002, Adjusted R-squared: 0.2336
F-statistic: 4.505 on 2 and 21 DF, p-value: 0.02355

Another approach is to look for summaries that are invariant under the choice of reference level. One such summary, though it is an clinical effect size not a significance, is the standard deviation of the levels (with the missing level added into the vector). This statistic is the root mean square coefficient size and is invariant over level choice (due to our appending a zero to represent the coefficient of the reference level):

```
```sd(c(m$coefficients[grep('^x',names(m$coefficients))],0))

```
```[1] 0.9916783

Unfortunately there is no easy `relevel()`

invariant significance term, as when the reference level estimate is itself uncertain all reported differences are also uncertain and when the references level estimate is certain we can have many other levels return significant or insignificant.

The lesson is you have to be a bit careful in reading the significance summaries on factor levels. Model summary reports are only on how each level relates to the hidden level (not the value of the level relative to zero). So the items in the report could be facts about the levels you see or could be facts about the hidden level- you can’t tell which. This is fairly traditional in statistics.

We feel a better approach would be for the fitter to not use hidden levels (add the redundant level back in) and then enforce a bias for small coefficients by an explicit regularization term. In fact many forms of regularization don’t really make sense until you get more explicit control of the level encoding. Another fix would be to add a linear constraint that states the expected impact of the fit level coefficients is zero over the training data (which would eat one degree of freedom and make handing missing values easier, this would be written as `ca xa + cb xb + cc xc = 0`

; where ca,cb and cc are the number of training rows with each of the given levels). But this thinking in term of effects is at odds with the traditional design of factors and contrasts in R. So it would be a lot of visible effort to try and trick R into consistently working in this way (and all of these approaches would complicate computation of significance levels a bit).

We advise: take just a little care and experiment with `relevel()`

when appropriate. Always be wary when reading reported model coefficient significance tables.

Categories: Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Really useful article. I have been dealing with this issue recently.

“Another fix would be to add a linear constraint that states the expected impact of the fit level coefficients is zero over the training data (which would eat one degree of freedom and make handing missing values easier, this would be written as ca xa + cb xb + cc xc = 0; where ca,cb and cc are the number of training rows with each of the given levels).”

Can you provide an R code example of this? Thanks.

@James Thanks and good question. No I don’t have any code to force the expected impact of the levels to all be zero. The math is easy (linear constraint in addition to least squares). How to trick R into doing this (and still have something meaningful for significances and so on) doesn’t seem to be straightforward.

@jmount

I had a go at it myself. I’d be interested to hear your thoughts on how this should be interpreted.

d$a = as.numeric(d$x == ‘a’)

d$b = as.numeric(d$x == ‘b’)

d$c = as.numeric(d$x == ‘c’)

ka = table(d$x)[‘a’]

kb = table(d$x)[‘b’]

kc = table(d$x)[‘c’]

# initial model:

# y = xa*a + xb*b + xc*c

# constraint:

# ka*xa + kb*xb + kc*xc = 0

# rearrange:

# xc = (-ka/kc)*xa + (-kb/kc)*xb

# substitute for xc in initial model:

# y = xa*a + xb*b + ((-ka/kc)*xa + (-kb/kc)*xb)*c

# rearrange:

# y = xa*(a + (-ka/kc)*c) + xb*(b + (-kb/kc)*c)

# encode new variables:

d$ac = d$a + (-ka/kc)*d$c

d$bc = d$b + (-kb/kc)*d$c

m2 = lm(y~ac+bc, data=d)

summary(m2)

# try with a as reference level rather than c:

d$ba = d$b + (-kb/ka)*d$a

d$ca = d$c + (-kc/ka)*d$a

m3 = lm(y~ba+ca, data=d)

summary(m3)

# try with b as reference level:

d$ab = d$a + (-ka/kb)*d$b

d$cb = d$c + (-kc/kb)*d$b

m4 = lm(y~ab+cb, data=d)

summary(m4)

# note that standard errors and significance levels are invariant regardless of the reference level

@James

Just awesome. Making the indicator variables explicit and explicitly managing the algebra is a great approach. If you have blog of your own you would like to link to just let me know.