# What does it mean to write “vectorized” code in R?

One often hears that `R` can not be fast (false), or more correctly that for fast code in `R` you may have to consider “vectorizing.”

A lot of knowledgable `R` users are not comfortable with the term “vectorize”, and not really familiar with the method.

“Vectorize” is just a slightly high-handed way of saying:

`R` naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.

In this article we will make the above clear by working through a non-trivial example of writing vectorized code.

For our example problem we will take on the task of computing the PRESS statistic (a statistic we use in our new `RcppDynProg` package). The “predicted residual error sum of squares” or PRESS statistic is an estimate of the out of sample quality of a fit. We motivate the PRESS statistic as follows.

Suppose we are fitting a simple linear model. In such a case it is natural to examine the model residuals (differences between the actual values and the matching predictions):

```d <- data.frame(
x =  1:6,
y = c(1, 1, 2, 2, 3, 3))

lm(y ~ x, data = d)\$residuals
#         1          2          3          4          5          6
# 0.1428571 -0.3142857  0.2285714 -0.2285714  0.3142857 -0.1428571
```

However, because the model has seen the data it is being applied to, these residuals are not representative of the residuals we would see on new data (there is a towards-zero bias in this estimate). An improved estimate is the PRESS statistic: for each point the model is fit on all points except the point in question, and then the residuals are estimated. This is a form of cross-validation and is easy to calculate:

```xlin_fits_lm <- function(x, y) {
n <- length(y)
d <- data.frame(x = x, y = y)
vapply(
seq_len(n),
function(i) {
m <- lm(y ~ x, data = d[-i, ])
predict(m, newdata = d[i, ])
}, numeric(1))
}

d\$y - xlin_fits_lm(d\$x, d\$y)
#   0.3000000 -0.4459459  0.2790698 -0.2790698  0.4459459 -0.3000000
```

Notice these values tend to be further from zero. They also better represent how an overall model might perform on new data.

At this point, from a statistical point of view, we are done. However, re-building an entire linear model for each point is computationally inefficient. Each of the models we are calculating share many training points, so we should be able to build a much faster hand-rolled calculation.

That faster calculation is given in the rather formidable function `xlin_fits_R()` found here. This function computes the summary statistics needed to solve the linear regression for all of the data. Then for each point in turn, it subtracts that point’s contribution out of the summaries and then performs the algebra needed to solve for the model and then apply it. The advantage of this approach is that taking the point out and solving the model is only a very small (constant) number of steps independent of how many points are in the summary. So this code is, in principle, very efficient.

And in fact timings on a small problem (300 observations) show while the simple “`xlin_fits_lm()` call `lm()` a bunch of times” takes 0.28 seconds, the more complicated `xlin_fits_R()` takes 0.00043 seconds, a speedup of over 600 times!

However this code is performing a separate calculation for each scalar data-point. As we mentioned above, this is fighting `R`, which is specialized for performing calculations over large vectors. The exact same algorithm written in `C++`, instead of `R`, takes 0.000055 seconds, almost another multiple of 10 faster!

The timings are summarized below. This sort of difference, scalar oriented `C++` being so much faster than scalar oriented `R`, is often distorted into “`R` is slow.”

This is just not the case. If we adapt the algorithm to be vectorized we get an `R` algorithm with performance comparable to the `C++` implementation!

Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (`xlin_fits_R()`) is a bit complicated, but the vectorized version (`xlin_fits_V()`) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in `R`) to working over a small number of vectors (fast in `R`).

Let’s take a look at the code transformation. We are not saying that `xlin_fits_R()` or `xlin_fits_V()` are easy to understand; we felt pretty slick when we derived them and added a lot of tests to confirm they calculate the same thing as `xlin_fits_lm()`. What we are saying is that the transform from `xlin_fits_R()` to `xlin_fits_V()` is simple: just cross out the for-loop and all of the “`[k]`” indices!
Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a `for`-loop in `xlin_fits_R()` to perform the same steps for each index, we were in fact fighting the language. Crossing out the `for`-loop and indices mechanically turned the scalar code into faster vector code.

And that is our example of how and why to vectorize code in `R`

Categories: Programming Statistics Tutorials Uncategorized

Tagged as: ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 4 replies ›

1. Blaine Bateman says:

This is a very nice example. I wonder if you have generalized it to multiple regression (i.e. X as a multi-column feature set)?

1. John Mount says:

Thanks!

Interesting idea.

It would be easy to do that with the RCpp code. You would want to call an actual package based solver per system, and if you thought piecewise linear with no continuity constraint was ugly in 1d- just wait. However it is not obvious how to do the same in R through vectorization. The vectorization depended on the fact that one can solve a 2 by 2 linear system easily with “oblivious code” (code that take the exact same set of operations no matter what value are in the variables). The obliviousness (kind of a poor man’s Single Instruction Multiple Data SIMD mode) is what made things easy in R.

1. Blaine Bateman says:

Thanks for the quick reply! I’ll give this some more thought. I realized you had done the math for the 1d x vs y, just “hoped” you might have somehow done it for the bigger case! I’ll ponder if I can modify your code to get it to work, even if not fully vectorized, as that could still be very beneficial. I use multiple regression often in my consulting and getting good uncertainty estimates is of high value.

Thanks again for the post and your feedback.

2. John Mount says:

Please stay in touch. The problem is piecewise linear in more than 1d needs one more good idea (basis functions like MARS or barycentric thinking).