There are a number of easy ways to avoid illegible code nesting problems in R
.
In this R tip we will expand upon the above statement with a simple example.
At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.
head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")]) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25
One popular way to break up nesting is to use magrittr
‘s “%>%
” in combination with dplyr
transform verbs as we show below.
library("dplyr") mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% head # mpg cyl wt # 1 18.7 8 3.44 # 2 14.3 8 3.57 # 3 16.4 8 4.07 # 4 17.3 8 3.73 # 5 15.2 8 3.78 # 6 10.4 8 5.25
Note: the above code lost (without warning) the row names that are part of mtcars
. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.
Many R
users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.
result <- mtcars result <- filter(result, cyl == 8) result <- select(result, mpg, cyl, wt) head(result)
The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable. I recommend introducing and re-using a result name (in this case “result
“), and not re-using the starting data name (in this case “mtcars
“). This extra care makes the entire block restartable which is another benefit when developing and debugging.
I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr
verbs, to base R
operators).
. <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] result <- . head(result) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25
The dot intermediate convention is very succinct, and we can use it with base R
transforms to get a correct (and performant) result. The dot intermediates convention is particularly neat when you don’t intend to take the result further into your calculation (such as when you only want to print it) as it does not require you to think up an evocative result name. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
Also, contrary to what many repeat, base R
is often faster than the dplyr
alternative.
library("dplyr") library("microbenchmark") library("ggplot2") timings <- microbenchmark( base = { . <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }) print(timings) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 122.948 136.948 167.2253 159.688 179.924 349.328 100 ## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100 autoplot(timings)
Durations for related tasks, smaller is better.
In this case the base R
is 15 times faster (possibly due to magrittr
overhead and the small size of this example). We also see, with some care, base R
can be quite legible. dplyr
is a useful tool and convention, however it is not the only allowed tool or only allowed convention.
Categories: Coding Opinion Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Thank you for this series, it is highly educational. Proceeding with the trend of base R alternatives to dplyr, could you do joins via
merge
next? There are several resources online that discuss merge but I couldn’t find any good comparisons to dplyr’s joins. I’m curious to see your take on it.Thanks!
I think two places where
dplyr
shines are the joins and the grouped summaries. We may touch on both in the future from aSQL
point of view, but nothing other thandata.table
is as convenient and respectful ofR
types/classes.At 320000 rows dplyr seems to be at least 2 times faster. So if the size of your data is bigger then would it still make sense to use base R functions that might not scale well?
First thank you for your interesting comment and point. Also, I admit: I am the one who first brought up
dplyr
and timings. But trust me, if I don’t bring them up: they will be brought up for me.If speed of in-memory operations is the primary concern, I would suggest that is the data range where one might switch over to
data.table
.I have no problem with people using
dplyr
. And, I have no problem with testing and discussing interesting alternatives (as you initiated here). What I do have problem with is the subset ofdplyr
aficionados who criticize any non-use ofdplyr
(be it baseR
itself or other packages). Understand- once anR
package promoter has successfully argued thatR
is inadequate it is plausible that they are chasing users toPython
/Pandas
/scikit
(also good tools), instead of towards theR
package of their choosing. Again, I am not saying that is what is happening here, just something that unfortunately informs my writing at this point.Just a minor remark. Since the blog post is about comparing speed in chaining operations, one should probably use data.table chaining in the benchmark above. Not that it makes much of a difference – as expected, data.table still outperforms dplyr by a huge margin.
Thanks for your point,
petrovski
. One ofdata.table
‘s strengths is chaining operations (though in this case the example has no real transformations, so it doesn’t show well).The relative timings probably depends a bit on architecture and compiler details. For my Mac Mini the multi-stage change seems to completely abrogate
data.table
‘s speed advantage for this (very trivial operations and trivial scale) example.However it does give me a chance to point out one could consider “
][
” asdata.table
‘s own pipe operator and further re-write thedata.table
block as:It can really depend on what’s being done. With your example of large data, I wasn’t able to get base R to outperform dplyr, even with using more standard notation. However, just a couple days ago, I had to rewrite some tests which heavily relied on dplyr for basic subsetting. They took way too long.
Most cases were just one
filter
call followed by apull
. By refactoring the filters intois_foo <- mydata[["column"]] == x
and such, the tests went from taking 4 minutes to 1 second. The dataset being tested is almost 3 million rows. In most cases, it was just onefilter
call followed by apull
.I’ll add that I agree with John: if you want performance, use
data.table
. I only avoided it in my case because my colleagues understand base R anddplyr
, and I want maintainability over performance.I have also run into issues where a small variation of a
dplyr
pipeline causes it to take a minute instead of milliseconds.filter()
(especially in the presence of an active grouping) is the most common danger step.It should be obvious from my stumbling around that I am not a regular
data.table
user. The reason is most of my current contracting work has been with data living inSpark
/Hadoop
orPostgreSQL
. So I have been working a lot on database first methodologies.You made a very good point on the the “land a column of boolean decisions to use later as a index. That is a very powerful and fast
R
pattern that I feel is still under-used and under-appreciated. For example: I use it here to invert permutations (combining it with the “write some indices on the left of an assignment” trick). Vector notations are incredibly powerful and can do some things that are not convenient to express otherwise.Thank you for pointing this out! This actually was the problem. I threw
ungroup
at the end of certain pipelines, and now the tests are back down to 1.8 seconds.You are welcome- here is a reference:
dplyr
issues 3294. I believe it got “closed no-fix” (but that isn’t how this project keeps records).Excellent post.
with data.table it should be done like this:
mtcarsd[cyl==8, .N],
much faster, the selection of c(“mpg”, “cyl”, “wt”) is needless.
I know selecting columns does not affect row counts. The same possible change applies to all of the examples (not just
data.table
). This is just a sequence of operations to simulate a small workflow without bringing in actual concerns. So please consider it a notional example. A slightly more realistic example can be found here. Often I include a row-count just to force execution on remote systems that have lazy eval.