Menu Home

Very Non-Standard Calling in R

Our group has done a lot of work with non-standard calling conventions in R.

Our tools work hard to eliminate non-standard calling (as is the purpose of wrapr::let()), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe). And even so, we still get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in R.

Please read on for a recent example.

Consider the following calls to stats::lm(). And notice the third example fails (throws an error).

# works
lm("y ~ x", 
   data = data.frame(
     x=1:5, 
     y = c(1, 1, 2, 2, 2)), 
   weights = numeric(5)+1)
#> 
#> Call:
#> lm(formula = "y ~ x", data = data.frame(x = 1:5, y = c(1, 1, 
#>     2, 2, 2)), weights = numeric(5) + 1)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         0.7          0.3


# works
f1 <- function(w = NULL) {
  lm(as.formula("y ~ x"), 
     data = data.frame(
       x=1:5, 
       y = c(1, 1, 2, 2, 2)), 
     weights = w)
}
f1(numeric(5)+1)
#> 
#> Call:
#> lm(formula = as.formula("y ~ x"), data = data.frame(x = 1:5, 
#>     y = c(1, 1, 2, 2, 2)), weights = w)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         0.7          0.3


# fails
f2 <- function(w = NULL) {
  lm("y ~ x", 
     data = data.frame(
       x=1:5, 
       y = c(1, 1, 2, 2, 2)), 
     weights = w)
}
f2(numeric(5)+1)
#> Error in eval(extras, data, env): object 'w' not found

According the stats::lm() documentation (help(lm)) the first argument must be:

an object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

A string appears to be coerce-able into a formula, so all three examples should work. However, typing “print(lm)” reveals the issue: stats::lm() doesn’t take the “weights” argument in a standard way (as the value of a function parameter). It instead grabs it through a sequence of match.call() and eval() steps. It is a complicated way to get the value, which works until it does not work. Somehow passing in the formula as a string interferes with how the value of weights is found. I think we can now see the benefits of isolation and independence of concerns in code.

This over-use of direct environment copying and manipulation is what leads to a great many data-leaks in stats::lm() and stats::glm(). This is in addition to their weird habit of keeping a copy of all of the training data (which loses quite a few of the merits of these methods). Our group dealt with these issues a long time ago, so we are somewhat familiar with stats::lm() and stats::glm().

Of course, one could (as the stats::lm() documentation mentions) call stats::lm.fit(). However, stats::lm.fit() does not seem to accept weights and its own documentation (help(lm.fit)) starts ominously:

These are the basic computing engines called by lm used to fit linear models. These should usually not be used directly unless by experienced users.

Having just finished teaching a four day intensive course covering data science in Python, I can’t help but remark that users of sklearn.linear_model.LinearRegression() don’t need to worry about issues such as the above. Some of the notational flair of R comes at the cost of significant opportunities for user confusion.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. I agree that R is full of these kind of warts, but IMHO, asking strings to be expressions isn’t the solution. That’s still non-standard evaluation, just with an underpowered quoting operation (i.e. the literal quotation marks that enclose a string).

    The solution would be for R to be more strict in the inputs it accepts. Like JavaScript and HTML, R tries way too hard to guess what the user wants in order to avoid crashing with an error, which leads to overly complex code with many corner cases.

    As for Python, its philosophical stubbornness around non-standard evaluation is a big reason why dplyr still lacks a successful Python equivalent, despite some admirable attempts. (And as an aside: sklearn only became capable of weighted linear regression very recently – a testament to how little sklearn targets traditional statistics practitioners).

%d bloggers like this: