Menu Home

Programming Over lm() in R

Here is simple modeling problem in R.

We want to fit a linear model where the names of the data columns carrying the outcome to predict (y), the explanatory variables (x1, x2), and per-example row weights (wt) are given to us as string values in variables.

Lets start with our example data and parameters. The point is: we are assuming the data and parameters come to us as arguments and are not known at the time of writing the script or program.

# our inputs
d <- data.frame(
  x1 = c(1, 2, 3, 4), 
  x2 = c(0, 0, 1, 1),
  y = c(3, 3, 4, -5), 
  wt = c(1, 2, 1, 1))
knitr::kable(d)
x1 x2 y wt
1 0 3 1
2 0 3 2
3 1 4 1
4 1 -5 1
outcome_name <- "y"
explanatory_vars <- c("x1", "x2")  
name_for_weight_column <- "wt"

For everything except the weights this is easy, as the linear regression function lm() is willing to take strings in its first argument “formula” (and also, there are tools for building up formula objects).

# start our generic solution
formula_str <- paste(
  outcome_name, 
  "~", 
  paste(explanatory_vars, collapse = " + "))
print(formula_str)
## [1] "y ~ x1 + x2"
model <- lm(formula_str, 
            data = d)

print(model)
## 
## Call:
## lm(formula = formula_str, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2  
##        9.75        -4.50         5.50
format(model$terms)
## [1] "y ~ x1 + x2"

However, once we try to add weights we have problems.

lm(formula_str, 
   data = d, 
   weights = name_for_weight_column)
## Error in model.frame.default(formula = formula_str, data = d, weights = name_for_weight_column, : variable lengths differ (found for '(weights)')

This is a bit disappointing, as much of the point of working in R is being able to write parameterized scripts and programs over the R functions. So we really want to be able to take names of columns from an external source.

The reason is the following (taken from help(lm)):

All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

This means the weights argument is not treated as a value, but instead the name typed in is captured through “non standard evaluation” (NSE). The data frame environment, and formula environment are searched for a column or value named “name_for_weight_column”, and not for one named “wt”.

This is a big hindrance to using lm() programmatically. The NSE notation style can be convenient when working interactively, but is often a burden when programming.

However, R has some ways out of the problem.

The first solution is bquote() which is part of R itself (in the base package).

eval(bquote(
  lm(
    formula_str,
    data = d,
    weights = .(as.name(name_for_weight_column)))
))
## 
## Call:
## lm(formula = formula_str, data = d, weights = wt)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       9.429       -3.857        3.571

In the above, the .() notation indicates to replace the .()-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).

Another solution is the let() function from the wrapr package (a user extension package, not part of R itself).

wrapr::let(
  c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column),
  lm(
    formula_str,
    data = d,
    weights = NAME_FOR_WEIGHT_COLUMN)
)
## 
## Call:
## lm(formula = formula_str, data = d, weights = wt)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       9.429       -3.857        3.571

In the above, the left-hand sides of the named vector are symbols to be replaced and the right had sides refer to values to replace them with. The specification c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column) means to replace NAME_FOR_WEIGHT_COLUMN with wt (the value referred to by name_for_weight_column). This is a substitution principle based on named substitution targets.

Another solution is the !! notation from the rlang package (a user extension package, not part of R itself).

rlang::eval_tidy(rlang::quo(
  lm(
    formula_str,
    data = d,
    weights = !!as.name(name_for_weight_column))
))
## 
## Call:
## lm(formula = formula_str, data = d, weights = wt)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       9.429       -3.857        3.571

In the above, the !! notation indicates to replace the !!-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).

Note the argument types and/or the internals of lm() do not currently appear to allow the use of the newer rlang double curly brace notation (a notation that can replace !!rlang::enquo(), and possibly other forms).

rlang::eval_tidy(rlang::quo(
  lm(
    formula_str,
    data = d,
    weights = {{name_for_weight_column}})
))
## Error in model.frame.default(formula = formula_str, data = d, weights = ~"wt", : invalid type (language) for variable '(weights)'

We only mention the “{{}}” notation as rlang familiar readers are likely to wonder about using it.

All of the above solutions are using meta-programming tools. These are tools that let programs treat programs as data. As such they are very powerful. In fact they are big hammers for such a simple problem as specifying a column name that is already stored as data. However, the above tools should not be blamed for the awkwardness of having a need for them.

The need for these meta-programming tools comes from the over-reliance on NSE interfaces. Or more precisely: the lack of a value oriented interface (possibly in addition to the NSE interface). NSE designs emphasize specifying things in program text instead of in data. This violation of “The Rule of Representation: Fold knowledge into data so program logic can be stupid and robust” is what causes the problems.

Categories: Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

11 replies

  1. Unless I’m mistaken, you can also use the following, I believe:

    lm(formula_str, 
       data = d, 
       weights = d$name_for_weight_column)
    
    1. That is a good point, but it doesn’t actually work. d$name_for_weight_column evaluates to NULL as there is no column named “name_for_weight_column“.

      However that reminded me to check d[[name_for_weight_column]] which does work the way you probably intended. It turns out that sometimes works, but isn’t reliable. It turns out if the data.frame has a column with the same name as the variable name we are using to refer to the data.frame things break. The reason being: in the lm() execution environment d refers to the column and not the data.frame in this case.

      The whole issue is: lm() is exerting so much control on how things are evaluated that a lot of the normal expectations of working in R are violated. Mostly we don’t hit them, but they are lurking there.

      So thanks for the comment! You correctly pointed out I needed to check more.

      Below is an example showing right and wrong answers.

      d <- data.frame(
        x1 = c(1, 2, 3, 4), 
        x2 = c(0, 0, 1, 1),
        y = c(3, 3, 4, -5), 
        wt = c(1, 2, 1, 1))
      outcome_name <- "y"
      explanatory_vars <- c("x1", "x2")  
      name_for_weight_column <- "wt"
      
      formula_str <- paste(
        outcome_name, 
        "~", 
        paste(explanatory_vars, collapse = " + "))
      
      # unweighted answer (so we can see what it is)
      lm(formula_str, 
         data = d)
      #> 
      #> Call:
      #> lm(formula = formula_str, data = d)
      #> 
      #> Coefficients:
      #> (Intercept)           x1           x2  
      #>        9.75        -4.50         5.50
      
      # weighted answer, the correct answer
      eval(bquote(
        lm(
          formula_str,
          data = d,
          weights = .(as.name(name_for_weight_column)))
      ))
      #> 
      #> Call:
      #> lm(formula = formula_str, data = d, weights = wt)
      #> 
      #> Coefficients:
      #> (Intercept)           x1           x2  
      #>       9.429       -3.857        3.571
      
      # this sets weights to NULL, not giving the correct answer
      lm(formula_str, 
         data = d,
         weights = d$name_for_weight_column)
      #> 
      #> Call:
      #> lm(formula = formula_str, data = d, weights = d$name_for_weight_column)
      #> 
      #> Coefficients:
      #> (Intercept)           x1           x2  
      #>        9.75        -4.50         5.50
      d$name_for_weight_column
      #> NULL
      
      # this works, but only because there is no column named d!
      lm(formula_str, 
         data = d,
         weights = d[[name_for_weight_column]])
      #> 
      #> Call:
      #> lm(formula = formula_str, data = d, weights = d[[name_for_weight_column]])
      #> 
      #> Coefficients:
      #> (Intercept)           x1           x2  
      #>       9.429       -3.857        3.571
      d[[name_for_weight_column]]
      #> [1] 1 2 1 1
      
      d$d <- 2
      # now it fails
      lm(formula_str, 
         data = d,
         weights = d[[name_for_weight_column]])
      #> Error in d[[name_for_weight_column]]: subscript out of bounds
      d[[name_for_weight_column]]
      #> [1] 1 2 1 1
      
  2. Yes, you are right. I actualy had tried:

    weights = d[, name_for_weight_column]
    

    first, and it worked before “simplifying” it to:

    d$name_for_weight_column
    

    which I did not check Great work, anyway!

    1. Assuming you are creating functions for your programming, why not just pass name_for_weight_column as an argument whose value is a string, and then do weights = d[[substitute(name_for_weight_column)]] inside the function?

      1. I apologize, but I don’t think I understand your proposed solution.

        d <- data.frame(
          x1 = c(1, 2, 3, 4),
          x2 = c(0, 0, 1, 1),
          y = c(3, 3, 4, -5),
          wt = c(1, 2, 1, 1))
        outcome_name <- "y"
        explanatory_vars <- c("x1", "x2")
        name_for_weight_column <- "wt"
        
        f <- function(d,
                      outcome_name,
                      explanatory_vars,
                      name_for_weight_column) {
          formula_str <- paste(
            outcome_name,
            "~",
            paste(explanatory_vars, collapse = " + "))
          lm(formula_str,
             data = d,
             weights = d[[substitute(name_for_weight_column)]])
        }
        
        # fn answer (turns to be unweighted)
        f(d,
          outcome_name,
          explanatory_vars,
          name_for_weight_column)
        #> 
        #> Call:
        #> lm(formula = formula_str, data = d, weights = d[[substitute(name_for_weight_column)]])
        #> 
        #> Coefficients:
        #> (Intercept)           x1           x2  
        #>        9.75        -4.50         5.50
        
        formula_str <- paste(
          outcome_name,
          "~",
          paste(explanatory_vars, collapse = " + "))
        
        # unweighted solution
        lm(formula_str,
           data = d)
        #> 
        #> Call:
        #> lm(formula = formula_str, data = d)
        #> 
        #> Coefficients:
        #> (Intercept)           x1           x2  
        #>        9.75        -4.50         5.50
        
        # desired weighted solution
        lm(formula_str,
           data = d,
           weights = wt)
        #> 
        #> Call:
        #> lm(formula = formula_str, data = d, weights = wt)
        #> 
        #> Coefficients:
        #> (Intercept)           x1           x2  
        #>       9.429       -3.857        3.571
        
        # and the column coincidence issue
        d$d <- 2
        
        f(d,
          outcome_name,
          explanatory_vars,
          name_for_weight_column)
        #> Error in d[[substitute(name_for_weight_column)]]: subscript out of bounds
        
  3. Thank you John for this insightful blog post!
    So far, I always wanted to ask “why not simply pass by string?” – but who dares to criticize the holy grail of tidyverse?
    I totally agree with your criticism on the over reliance on NSE and the data-over-program text rule substantiates it.

    1. To be clear, this isn’t the tidyverse‘s fault/problem. The NSE is in lm() and we can’t pass a string as it doesn’t seem to accept one.

  4. There is a much simpler way to achieve this using call.

    #create the data
    set.seed(1313)
    dt % mutate(y = x1+x2+ rnorm(10))
    dt % mutate(wt = runif(10)) %>% mutate(wt = wt/sum(wt))
    
    #create the call
    
    exp <-  call("lm", formula = NULL, data = NULL, weight = NULL)
    exp$formula <- parse(text = "y~x1+x2")[[1]]
    exp$data <- as.name("dt")
    exp$weight <- as.name("wt")
    
    #eval the call
    
    eval(exp)
    
    #Call:
    #lm(formula = y ~ x1 + x2, data = dt, weights = wt)
    
    #Coefficients:
    #(Intercept)           x1           x2
    
    0.2191       1.2041      -0.1593
    

    As you can see you have a total control programmatically, as all the variables start as strings and then become expressions. Combine this with environment in eval and you have another level of control of the data.

    If you look inside the code of lm, or any other base R model functions, you will see that this is pretty standard R practice. I used this approach in my own package midasr and found it works quite well.

  5. The issue with the solution in the prior reply is that “wt” cannot be a name otherwise used in the formula. Here is another base R solution that suffers from the same shortcoming, but which I find to be slightly more straightforward. Starting with the print statement:

    print(formula_str)
    # Explicitly connect the environment of the formula with the formula
    e <- new.env()
    # Place the values of the variables from d into e with the same names
    for (nm in names(d)) assign(nm, d[[nm]], e)
    # Using your preferred name for the weights, place their values in e as well
    assign("name_for_weight_column", d[[name_for_weight_column]], e)
    # Create the formula
    fmla <- as.formula(formula_str, env = e)
    # This call gives your desired results.
    lm(fmla, weights = name_for_weight_column)

  6. What about passing matrices in?

    outc <- as.matrix(d[outcome_name])
    expv <- as.matrix(d[explanatory_vars])
    lm(outc ~ expv, weights = d[[name_for_weight_column]])

    There’s probably other complications to this I’m not thinking through, but it seems to work on the surface.

%d bloggers like this: