There are a number of easy ways to avoid illegible code nesting problems in R.
In this R tip we will expand upon the above statement with a simple example.
At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.
head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")])
# mpg cyl wt
# Hornet Sportabout 18.7 8 3.44
# Duster 360 14.3 8 3.57
# Merc 450SE 16.4 8 4.07
# Merc 450SL 17.3 8 3.73
# Merc 450SLC 15.2 8 3.78
# Cadillac Fleetwood 10.4 8 5.25
One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.
library("dplyr")
mtcars %>%
filter(cyl == 8) %>%
select(mpg, cyl, wt) %>%
head
# mpg cyl wt
# 1 18.7 8 3.44
# 2 14.3 8 3.57
# 3 16.4 8 4.07
# 4 17.3 8 3.73
# 5 15.2 8 3.78
# 6 10.4 8 5.25
Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.
Many R users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.
result <- mtcars result <- filter(result, cyl == 8) result <- select(result, mpg, cyl, wt) head(result)
The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable. I recommend introducing and re-using a result name (in this case “result“), and not re-using the starting data name (in this case “mtcars“). This extra care makes the entire block restartable which is another benefit when developing and debugging.
I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base R operators).
. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
result <- .
head(result)
# mpg cyl wt
# Hornet Sportabout 18.7 8 3.44
# Duster 360 14.3 8 3.57
# Merc 450SE 16.4 8 4.07
# Merc 450SL 17.3 8 3.73
# Merc 450SLC 15.2 8 3.78
# Cadillac Fleetwood 10.4 8 5.25
The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. The dot intermediates convention is particularly neat when you don’t intend to take the result further into your calculation (such as when you only want to print it) as it does not require you to think up an evocative result name. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
Also, contrary to what many repeat, base R is often faster than the dplyr alternative.
library("dplyr")
library("microbenchmark")
library("ggplot2")
timings <- microbenchmark(
base = {
. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
nrow(.)
},
dplyr = {
mtcars %>%
filter(cyl == 8) %>%
select(mpg, cyl, wt) %>%
nrow
})
print(timings)
## Unit: microseconds
## expr min lq mean median uq max neval
## base 122.948 136.948 167.2253 159.688 179.924 349.328 100
## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100
autoplot(timings)

Durations for related tasks, smaller is better.
In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.
Categories: Coding Opinion Statistics Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Thank you for this series, it is highly educational. Proceeding with the trend of base R alternatives to dplyr, could you do joins via
mergenext? There are several resources online that discuss merge but I couldn’t find any good comparisons to dplyr’s joins. I’m curious to see your take on it.Thanks!
I think two places where
dplyrshines are the joins and the grouped summaries. We may touch on both in the future from aSQLpoint of view, but nothing other thandata.tableis as convenient and respectful ofRtypes/classes.At 320000 rows dplyr seems to be at least 2 times faster. So if the size of your data is bigger then would it still make sense to use base R functions that might not scale well?
do.call("rbind", replicate(10000, mtcars, simplify = FALSE)) -> mtcars library("dplyr") library("microbenchmark") library("ggplot2") timings <- microbenchmark( base = { . <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow } ) print(timings) Unit: milliseconds expr min lq mean median uq max neval base 19.822568 31.43256 40.37011 33.433080 43.39094 107.66600 100 dplyr 7.613895 7.96304 18.08439 8.683566 21.03271 86.27648 100First thank you for your interesting comment and point. Also, I admit: I am the one who first brought up
dplyrand timings. But trust me, if I don’t bring them up: they will be brought up for me.If speed of in-memory operations is the primary concern, I would suggest that is the data range where one might switch over to
data.table.library("dplyr") library("microbenchmark") library("ggplot2") library("data.table") mtcarsb <- mtcars[rep(seq_len(nrow(mtcars)), 10000), ,] mtcarsd <- as.data.table(mtcarsb) timings <- microbenchmark( base = { . <- mtcarsb . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcarsb %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }, data.table = { nrow(mtcarsd[cyl==8, c("mpg", "cyl", "wt")]) } ) print(timings) ## Unit: milliseconds ## expr min lq mean median uq max neval ## base 36.613135 49.23939 57.30778 52.667116 56.821898 165.6374 100 ## dplyr 9.089105 13.33665 19.61850 14.351512 25.811134 105.6308 100 ## data.table 3.505766 4.90764 10.43046 5.971567 6.818303 115.2896 100 autoplot(timings)I have no problem with people using
dplyr. And, I have no problem with testing and discussing interesting alternatives (as you initiated here). What I do have problem with is the subset ofdplyraficionados who criticize any non-use ofdplyr(be it baseRitself or other packages). Understand- once anRpackage promoter has successfully argued thatRis inadequate it is plausible that they are chasing users toPython/Pandas/scikit(also good tools), instead of towards theRpackage of their choosing. Again, I am not saying that is what is happening here, just something that unfortunately informs my writing at this point.Just a minor remark. Since the blog post is about comparing speed in chaining operations, one should probably use data.table chaining in the benchmark above. Not that it makes much of a difference – as expected, data.table still outperforms dplyr by a huge margin.
library("dplyr") library("microbenchmark") library("ggplot2") library("data.table") mtcarsb <- mtcars[rep(seq_len(nrow(mtcars)), 10000), ,] mtcarsd <- as.data.table(mtcarsb) timings <- microbenchmark( base = { . <- mtcarsb . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcarsb %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }, data.table = { mtcarsd[cyl==8, ][, c("mpg", "cyl", "wt") ][, .N] } ) print(timings) autoplot(timings) print(timings) Unit: milliseconds expr min lq mean median uq base 22.39194 34.02887 46.84050 39.47613 43.55339 dplyr 18.20177 19.57842 26.37421 20.93748 23.37392 data.table 10.17765 10.83761 22.33043 12.31881 24.79852Thanks for your point,
petrovski. One ofdata.table‘s strengths is chaining operations (though in this case the example has no real transformations, so it doesn’t show well).The relative timings probably depends a bit on architecture and compiler details. For my Mac Mini the multi-stage change seems to completely abrogate
data.table‘s speed advantage for this (very trivial operations and trivial scale) example.However it does give me a chance to point out one could consider “
][” asdata.table‘s own pipe operator and further re-write thedata.tableblock as:mtcarsd[cyl==8, ][ , c("mpg", "cyl", "wt") ][ , .N ]It can really depend on what’s being done. With your example of large data, I wasn’t able to get base R to outperform dplyr, even with using more standard notation. However, just a couple days ago, I had to rewrite some tests which heavily relied on dplyr for basic subsetting. They took way too long.
Most cases were just one
filtercall followed by apull. By refactoring the filters intois_foo <- mydata[["column"]] == xand such, the tests went from taking 4 minutes to 1 second. The dataset being tested is almost 3 million rows. In most cases, it was just onefiltercall followed by apull.I’ll add that I agree with John: if you want performance, use
data.table. I only avoided it in my case because my colleagues understand base R anddplyr, and I want maintainability over performance.I have also run into issues where a small variation of a
dplyrpipeline causes it to take a minute instead of milliseconds.filter()(especially in the presence of an active grouping) is the most common danger step.It should be obvious from my stumbling around that I am not a regular
data.tableuser. The reason is most of my current contracting work has been with data living inSpark/HadooporPostgreSQL. So I have been working a lot on database first methodologies.You made a very good point on the the “land a column of boolean decisions to use later as a index. That is a very powerful and fast
Rpattern that I feel is still under-used and under-appreciated. For example: I use it here to invert permutations (combining it with the “write some indices on the left of an assignment” trick). Vector notations are incredibly powerful and can do some things that are not convenient to express otherwise.Thank you for pointing this out! This actually was the problem. I threw
ungroupat the end of certain pipelines, and now the tests are back down to 1.8 seconds.You are welcome- here is a reference:
dplyrissues 3294. I believe it got “closed no-fix” (but that isn’t how this project keeps records).Excellent post.
with data.table it should be done like this:
mtcarsd[cyl==8, .N],
much faster, the selection of c(“mpg”, “cyl”, “wt”) is needless.
I know selecting columns does not affect row counts. The same possible change applies to all of the examples (not just
data.table). This is just a sequence of operations to simulate a small workflow without bringing in actual concerns. So please consider it a notional example. A slightly more realistic example can be found here. Often I include a row-count just to force execution on remote systems that have lazy eval.