A quick R mini-tip: don’t use `data.matrix`

when you mean `model.matrix`

. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding).

For some modeling tasks you end up having to prepare a special expanded data matrix before calling a given machine learning algorithm. For example the `randomForest`

package advises:

For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.

Which means you may want to prepare a matrix of exactly the values you want to use in your prediction (versus using the more common and convenient formula interface). As R supplies good tools for this, this is not a big problem. Unless you (accidentally) use `data.matrix`

as shown below.

```
```d <- data.frame(x=c('a','b','c'),y=c(1,2,3))
print(data.matrix(d))
## x y
## [1,] 1 1
## [2,] 2 2
## [3,] 3 3

Notice in the above `x`

is converted from its original factor levels `'a'`

, `'b'`

and `'c'`

to the numeric quantities `1`

, `2`

and `'3`

. The problem is this introduces an unwanted order relation in the `x`

-values. For example any linear model is going to be forced to treat the effect of `x='b'`

as being between the modeled effects of `x='a'`

and `x='c'`

even if this is not an actual feature of the data. Now there are cases when you want ordinal constraints and there are ways (like GAMs) to learn non monotone relations on numerics. But really what has happened is you have not used a rich enough encoding of this factor.

Usually what you want is `model.matrix`

which we demonstrate below:

```
```print(model.matrix(~0+x+y,data=d))
## xa xb xc y
## 1 1 0 0 1
## 2 0 1 0 2
## 3 0 0 1 3
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

The `0+`

notation is telling R to not add an “intercept” column (a column whose value is always `1`

, a useful addition when doing linear regression). In this case it also had the side-effect of allowing us to see all three derived indicator variables (usually all but one are shown, more on this later).

What `model.matrix`

has done is used the idea of indicator variables (implemented through `contrasts`

) to re-encode the single string-valued variable `x`

as a set of indicators. The three possible values (or levels) of `x`

(`'a'`

, `'b'`

and `'c'`

) are encoded as three new variables: `xa`

, `xb`

and `xc`

. These new variables are related to the original `x`

as follows:

`x` |
`xa` |
`xb` |
`xc` |

`'a'` |
`1` |
`0` |
`0` |

`'b'` |
`0` |
`1` |
`0` |

`'c'` |
`0` |
`0` |
`1` |

It is traditional to suppress one of the derived variables (in this case `xa`

) yielding the following factor as a set of indicators representation:

`x` |
`xb` |
`xc` |

`'a'` |
`0` |
`0` |

`'b'` |
`1` |
`0` |

`'c'` |
`0` |
`1` |

And this is what we see if we don’t add the `0+`

notation to our `model.matrix`

formula:

```
```print(model.matrix(~x+y,data=d))
## (Intercept) xb xc y
## 1 1 0 0 1
## 2 1 1 0 2
## 3 1 0 1 3
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

When you use the formula interface R performs these sort of conversions for you under the hood. This is why seemingly pure numeric models (like `lm()`

) can use string-valued variables. R performing these conversions spares the analyst a lot of messy column bookkeeping. Once you get used to this automatic conversion you really come to miss it (such as in `scikit-learn`

‘s random forest implementation).

The traditional reason to suppress the `xa`

variable is there is a redundancy when we use all the indicator variables. Since the indicators always sum to one we can always infer one missing indicator by subtracting the sum of all the others from one. In fact this redundancy is an linear dependence among the indicators- and this can actually cause trouble for some naive implementations of linear regression (though we feel this is much better handled using L2 regularization ideas).

At some point you are forced to “production harden” your code and deal directly with level encodings yourself. The built-in R encoding scheme is not optimal for factor with large numbers of distinct levels, rare levels (whose fit coefficients won’t achieve statistical significance during training), missing values, or the appearance of new levels after training. It is a simple matter of feature engineering to deal with each of these situations.

Some recipes for working with problem factors include:

- Modeling Trick: Impact Coding of Categorical Variables with Many Levels (deals with factors with large numbers of levels).
- Section 4.1.1 of Practical Data Science with R (deals with missing variables by introducing additional “masking variables”).

Novel values (factor levels not seen during training, but that show up during test or application) can be treated as missing values, or you can work out additional feature engineering ideas to improve model performance.

Categories: Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.