Our group has done a *lot* of work with non-standard calling conventions in `R`

.

Our tools work hard to *eliminate* non-standard calling (as is the purpose of `wrapr::let()`

), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe). And even so, we *still* get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in `R`

.

Please read on for a recent example.

Consider the following calls to `stats::lm()`

. And notice the third example fails (throws an error).

# works lm("y ~ x", data = data.frame( x=1:5, y = c(1, 1, 2, 2, 2)), weights = numeric(5)+1) #> #> Call: #> lm(formula = "y ~ x", data = data.frame(x = 1:5, y = c(1, 1, #> 2, 2, 2)), weights = numeric(5) + 1) #> #> Coefficients: #> (Intercept) x #> 0.7 0.3 # works f1 <- function(w = NULL) { lm(as.formula("y ~ x"), data = data.frame( x=1:5, y = c(1, 1, 2, 2, 2)), weights = w) } f1(numeric(5)+1) #> #> Call: #> lm(formula = as.formula("y ~ x"), data = data.frame(x = 1:5, #> y = c(1, 1, 2, 2, 2)), weights = w) #> #> Coefficients: #> (Intercept) x #> 0.7 0.3 # fails f2 <- function(w = NULL) { lm("y ~ x", data = data.frame( x=1:5, y = c(1, 1, 2, 2, 2)), weights = w) } f2(numeric(5)+1) #> Error in eval(extras, data, env): object 'w' not found

According the `stats::lm()`

documentation (`help(lm)`

) the first argument must be:

an object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

A string appears to be coerce-able into a formula, so all three examples should work. However, typing “`print(lm)`

” reveals the issue: `stats::lm()`

doesn’t take the “`weights`

” argument in a standard way (as the value of a function parameter). It instead grabs it through a sequence of `match.call()`

and `eval()`

steps. It is a complicated way to get the value, which works until it does not work. Somehow passing in the formula as a string interferes with how the value of `weights`

is found. I think we can now see the benefits of isolation and independence of concerns in code.

This over-use of direct environment copying and manipulation is what leads to a great many data-leaks in `stats::lm()`

and `stats::glm()`

. This is in addition to their weird habit of keeping a copy of all of the training data (which loses quite a few of the merits of these methods). Our group dealt with these issues a long time ago, so we are somewhat familiar with `stats::lm()`

and `stats::glm()`

.

Of course, one could (as the `stats::lm()`

documentation mentions) call `stats::lm.fit()`

. However, `stats::lm.fit()`

does not seem to accept weights and its own documentation (`help(lm.fit)`

) starts ominously:

These are the basic computing engines called by lm used to fit linear models. These should usually not be used directly unless by experienced users.

Having just finished teaching a four day intensive course covering data science in Python, I can’t help but remark that users of `sklearn.linear_model.LinearRegression()`

don’t need to worry about issues such as the above. Some of the notational flair of `R`

comes at the cost of significant opportunities for user confusion.

Categories: Opinion Programming Rants

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

I agree that R is full of these kind of warts, but IMHO, asking strings to be expressions isn’t the solution. That’s still non-standard evaluation, just with an underpowered quoting operation (i.e. the literal quotation marks that enclose a string).

The solution would be for R to be more strict in the inputs it accepts. Like JavaScript and HTML, R tries way too hard to guess what the user wants in order to avoid crashing with an error, which leads to overly complex code with many corner cases.

As for Python, its philosophical stubbornness around non-standard evaluation is a big reason why dplyr still lacks a successful Python equivalent, despite some admirable attempts. (And as an aside: sklearn only became capable of weighted linear regression very recently – a testament to how little sklearn targets traditional statistics practitioners).

To do a bit more than complain yet again about this, we are now sharing newer cleanup code in the next

`wrapr`

release.