Here is simple modeling problem in R
.
We want to fit a linear model where the names of the data columns carrying the outcome to predict (y
), the explanatory variables (x1
, x2
), and per-example row weights (wt
) are given to us as string values in variables.
Lets start with our example data and parameters. The point is: we are assuming the data and parameters come to us as arguments and are not known at the time of writing the script or program.
# our inputs
d <- data.frame(
x1 = c(1, 2, 3, 4),
x2 = c(0, 0, 1, 1),
y = c(3, 3, 4, -5),
wt = c(1, 2, 1, 1))
knitr::kable(d)
x1 | x2 | y | wt |
---|---|---|---|
1 | 0 | 3 | 1 |
2 | 0 | 3 | 2 |
3 | 1 | 4 | 1 |
4 | 1 | -5 | 1 |
outcome_name <- "y"
explanatory_vars <- c("x1", "x2")
name_for_weight_column <- "wt"
For everything except the weights this is easy, as the linear regression function lm()
is willing to take strings in its first argument “formula
” (and also, there are tools for building up formula objects).
# start our generic solution
formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))
print(formula_str)
## [1] "y ~ x1 + x2"
model <- lm(formula_str,
data = d)
print(model)
##
## Call:
## lm(formula = formula_str, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## 9.75 -4.50 5.50
format(model$terms)
## [1] "y ~ x1 + x2"
However, once we try to add weights we have problems.
lm(formula_str,
data = d,
weights = name_for_weight_column)
## Error in model.frame.default(formula = formula_str, data = d, weights = name_for_weight_column, : variable lengths differ (found for '(weights)')
This is a bit disappointing, as much of the point of working in R
is being able to write parameterized scripts and programs over the R
functions. So we really want to be able to take names of columns from an external source.
The reason is the following (taken from help(lm)
):
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
This means the weights argument is not treated as a value, but instead the name typed in is captured through “non standard evaluation” (NSE). The data frame environment, and formula environment are searched for a column or value named “name_for_weight_column
”, and not for one named “wt
”.
This is a big hindrance to using lm()
programmatically. The NSE notation style can be convenient when working interactively, but is often a burden when programming.
However, R
has some ways out of the problem.
The first solution is bquote()
which is part of R
itself (in the base
package).
eval(bquote(
lm(
formula_str,
data = d,
weights = .(as.name(name_for_weight_column)))
))
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the .()
notation indicates to replace the .()
-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).
Another solution is the let()
function from the wrapr
package (a user extension package, not part of R
itself).
wrapr::let(
c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column),
lm(
formula_str,
data = d,
weights = NAME_FOR_WEIGHT_COLUMN)
)
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the left-hand sides of the named vector are symbols to be replaced and the right had sides refer to values to replace them with. The specification c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column)
means to replace NAME_FOR_WEIGHT_COLUMN
with wt
(the value referred to by name_for_weight_column
). This is a substitution principle based on named substitution targets.
Another solution is the !!
notation from the rlang
package (a user extension package, not part of R
itself).
rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = !!as.name(name_for_weight_column))
))
##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept) x1 x2
## 9.429 -3.857 3.571
In the above, the !!
notation indicates to replace the !!
-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).
Note the argument types and/or the internals of lm()
do not currently appear to allow the use of the newer rlang
double curly brace notation (a notation that can replace !!rlang::enquo()
, and possibly other forms).
rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = {{name_for_weight_column}})
))
## Error in model.frame.default(formula = formula_str, data = d, weights = ~"wt", : invalid type (language) for variable '(weights)'
We only mention the “{{}}
” notation as rlang
familiar readers are likely to wonder about using it.
All of the above solutions are using meta-programming tools. These are tools that let programs treat programs as data. As such they are very powerful. In fact they are big hammers for such a simple problem as specifying a column name that is already stored as data. However, the above tools should not be blamed for the awkwardness of having a need for them.
The need for these meta-programming tools comes from the over-reliance on NSE interfaces. Or more precisely: the lack of a value oriented interface (possibly in addition to the NSE interface). NSE designs emphasize specifying things in program text instead of in data. This violation of “The Rule of Representation: Fold knowledge into data so program logic can be stupid and robust” is what causes the problems.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Unless I’m mistaken, you can also use the following, I believe:
That is a good point, but it doesn’t actually work.
d$name_for_weight_column
evaluates toNULL
as there is no column named “name_for_weight_column
“.However that reminded me to check
d[[name_for_weight_column]]
which does work the way you probably intended. It turns out that sometimes works, but isn’t reliable. It turns out if thedata.frame
has a column with the same name as the variable name we are using to refer to thedata.frame
things break. The reason being: in thelm()
execution environmentd
refers to the column and not thedata.frame
in this case.The whole issue is:
lm()
is exerting so much control on how things are evaluated that a lot of the normal expectations of working inR
are violated. Mostly we don’t hit them, but they are lurking there.So thanks for the comment! You correctly pointed out I needed to check more.
Below is an example showing right and wrong answers.
Yes, you are right. I actualy had tried:
first, and it worked before “simplifying” it to:
which I did not check Great work, anyway!
Assuming you are creating functions for your programming, why not just pass name_for_weight_column as an argument whose value is a string, and then do weights = d[[substitute(name_for_weight_column)]] inside the function?
I apologize, but I don’t think I understand your proposed solution.
Thank you John for this insightful blog post!
So far, I always wanted to ask “why not simply pass by string?” – but who dares to criticize the holy grail of tidyverse?
I totally agree with your criticism on the over reliance on NSE and the data-over-program text rule substantiates it.
To be clear, this isn’t the
tidyverse
‘s fault/problem. The NSE is inlm()
and we can’t pass a string as it doesn’t seem to accept one.There is a much simpler way to achieve this using call.
As you can see you have a total control programmatically, as all the variables start as strings and then become expressions. Combine this with environment in eval and you have another level of control of the data.
If you look inside the code of lm, or any other base R model functions, you will see that this is pretty standard R practice. I used this approach in my own package midasr and found it works quite well.
That is a good solution, thank you for sharing it.
The issue with the solution in the prior reply is that “wt” cannot be a name otherwise used in the formula. Here is another base R solution that suffers from the same shortcoming, but which I find to be slightly more straightforward. Starting with the print statement:
print(formula_str)
# Explicitly connect the environment of the formula with the formula
e <- new.env()
# Place the values of the variables from d into e with the same names
for (nm in names(d)) assign(nm, d[[nm]], e)
# Using your preferred name for the weights, place their values in e as well
assign("name_for_weight_column", d[[name_for_weight_column]], e)
# Create the formula
fmla <- as.formula(formula_str, env = e)
# This call gives your desired results.
lm(fmla, weights = name_for_weight_column)
What about passing matrices in?
outc <- as.matrix(d[outcome_name])
expv <- as.matrix(d[explanatory_vars])
lm(outc ~ expv, weights = d[[name_for_weight_column]])
There’s probably other complications to this I’m not thinking through, but it seems to work on the surface.