Menu Home

R Tip: Put Your Values in Columns

Today’s R tip is: put your values in columns.

Some R users use different seemingly clever tricks to bring data to an analysis.

Here is an (artificial) example.

chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  

Notice: one of the variables came from a vector in the environment, not from the primary data.frame. chamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

mtcars$chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  

The only difference is we took the time to place the derived vector into the data frame we are working with (assigned to mtcars$chamber_sizes instead of the global environment in the first line). This is a very organized way to work, and as you see it does not take much effort.

Or use only existing values, as we show below.

form <- hp ~ I(disp/cyl)
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)  I(disp/cyl)  
#       2.937        4.104  

This is something we teach: with some care you can reliably treat variables as strings, and this is in no way inferior to complex systems such as stats::formula or rlang::quosure. The fact that these objects cary around an environment in addition the names is in fact a barrier to reliable code, not an unmitigated advantage.

I am not alone in this opinion.

If the formula was typed in by the user interactively, then the call came from the global environment, meaning that variables not found in the data frame, or all variables if the data argument was missing, will be looked up in the same way they would in ordinary evaluation. But if the formula object was precomputed somewhere else, then its environment is the environment of the function call that created it. That means that arguments to that call and local assignments in that call will define variables for use in the model parent (that is, enclosing) environment of the call, which may be a package namespace. These rules are standard for R, at least once one knows that an environment attribute has been assigned to the formula. They are similar to the use of closures described in Section 5.4, page 126.

Where clear and trustworthy software is a priority, I would personally avoid such tricks. Ideally, all the variables in the model frame should come from an explicit, verifiable data source, typically a data frame object that is archived for future inspection (or equivalently, some other equally well-defined source of data, either inside or outside R, that is used explicitly to construct the data for the model).

Software for Data Analysis (Springer 2008), John M. Chambers, Chapter 6, section 9, page 221.

Chambers’ critique applies equally to stats::formula or rlang::quosure, and roughly he is calling over-use an anti-pattern.

This is why we say from the user point of view variables can be treated as mere names or strings. With some care you can ensure all your values are coming from a single data.frame. And if that is the case, variables are column names.

Going to extra effort to carry around bound variables (variable names, plus an environment resolving the name to a value) is silly and a big source of reference leaks. Roughly: if you don’t know the value of a variable then pass it as a name or string (as that is all an unbound variable or symbol is), if you do know the value then use that value (the variable is serving little purpose at that point). Being able to replace variables with values is the hallmark of referential transparency, which is the family of expressions that are well-behaved in the sense that replacing the expressions with their referred to values does not change observable program behavior. There is code that breaks when you replace variables with values, but that should be considered to be a limitation of such code (not a merit).

Categories: Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Thanks for reminding us of these details. You always have interesting and thought-provoking posts. Note that your third example has a copy/paste boo-boo, there is no chamber_size in that model. Numerically, it is the same of course, but the label should be I(disp/cyl).

    1. Thank you so much for the correction. I had in fact messed up the copy/paste and I have now updated the article. Just more evidence I need to minimize the amount of copying and pasting in my life.

  2. For those that are into that sort of thing: R has a complete quasiquotation mechanism prior to rlang. R has bquote() (see help(bquote)). The issue is: this sort of mechanism is oriented to substituting in values, not to substituting in names for further evaluation (at least not on the left-hand sides of assignments). This is also part of why I wrote wrapr::let() which is enforces only name for name substitution, and so has very reliable and predictable semantics.

  3. Another perspective:

    Many modelling and graphical functions have a formula argument and a data argument. If variables in the formula were required to be in the data argument life would be a lot simpler, but this requirement was not made when formulas were introduced. Authors of modelling and graphics functions are thus required to implement a limited form of dynamic scope, which they have not done in an entirely consistent way.

    Thomas Lumley, “Standard nonstandard evaluation rules”, March 19, 2003

%d bloggers like this: