Introduction
Beginning R
users often come to the false impression that the popular packages dplyr
and tidyr
are both all of R
and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in R
). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in R
for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.
dplyr
and tidyr
We will start with a (very) brief outline of the primary capabilities of dplyr
and tidyr
.
dplyr
dplyr
embodies the idea that data manipulation should be broken down into a sequence of transformations.
For example: in R
if one wishes to add a column to a data.frame
it is common to perform an "in-place" calculation as shown below:
d <- data.frame(x=c(-1,0,1))
print(d)
## x
## 1 -1
## 2 0
## 3 1
d$absx <- abs(d$x)
print(d)
## x absx
## 1 -1 1
## 2 0 0
## 3 1 1
This has a couple of disadvantages:
- The original
d
has been altered, so re-starting calculations (say after we discover a mistake) can be inconvenient. - We have to keep repeating the name of the
data.frame
which is not only verbose (which is not that important an issue), it is a chance to write the wrong name and introduce an error.
The "dplyr
-style" is to write the same code as follows:
suppressPackageStartupMessages(library("dplyr"))
d <- data.frame(x=c(-1,0,1))
d %>%
mutate(absx = abs(x))
## x absx
## 1 -1 1
## 2 0 0
## 3 1 1
# confirm our original data frame is unaltered
print(d)
## x
## 1 -1
## 2 0
## 3 1
The idea is to break your task into the sequential application of a small number of "standard verbs" to produce your result. The verbs are "pipelined" or sequenced using the magrittr
pipe "%>%
" which can be thought of as if the following four statements were to be taken as equivalent:
f(x)
x %>% f(.)
x %>% f()
x %>% f
This lets one write a sequence of operations as a left to right pipeline (without explicit nesting of functions or use of numerous intermediate variables). Some discussion can be found here.
Primary dplyr
verbs include the "single table verbs" from the dplyr 0.5.0
introduction vignette:
filter()
(andslice()
)arrange()
select()
(andrename()
)distinct()
mutate()
(andtransmute()
)summarise()
sample_n()
(andsample_frac()
)
These have high-performance implementations (often in C++
thanks to Rcpp) and often have defaults that are safer and better for programming (not changing types on single column data-frames, not promoting strings to factors, and so-on). Not really discussed in the dplyr 0.5.0
introduction are the dplyr::*join()
operators which are in fact critical components, but easily explained as standard relational joins (i.e., they are very important implementations, but not novel concepts).
Fairly complex data transforms can be broken down in terms of these verbs (plus some verbs from tidyr
):
Take for example a slightly extended version of one of the complex work-flows from dplyr 0.5.0
introduction vignette.
The goal is: plot the distribution of average flight arrive delays and flight departure (all averages grouped by date) for dates where either of these averages is at least 30 minutes. The first step is writing down the goal (as we did above). With that clear, someone familiar with dplyr
can write a pipeline or work-flow as below (we have added the gather
and arrange
steps to extend the example a bit):
library("nycflights13")
suppressPackageStartupMessages(library("dplyr"))
library("tidyr")
library("ggplot2")
summary1 <- flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30) %>%
gather(key = delayType,
value = delayMinutes,
arr, dep) %>%
arrange(year, month, day, delayType)
## Adding missing grouping variables: `year`, `month`, `day`
dim(summary1)
## [1] 98 5
head(summary1)
## Source: local data frame [6 x 5]
## Groups: year, month [2]
##
## year month day delayType delayMinutes
## <int> <int> <int> <chr> <dbl>
## 1 2013 1 16 arr 34.24736
## 2 2013 1 16 dep 24.61287
## 3 2013 1 31 arr 32.60285
## 4 2013 1 31 dep 28.65836
## 5 2013 2 11 arr 36.29009
## 6 2013 2 11 dep 39.07360
ggplot(data= summary1, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='n'),
subtitle = "produced by: dplyr/magrittr/tidyr packages")

Once you get used to the notation (become familiar with "%>%
" and the verbs) the above can be read in small pieces and is considered fairly elegant. The warning message indicates it would have been better documentation to have the initial select()
have been "select(year, month, day, arr_delay, dep_delay)
" (in addition I feel that group_by()
should always be written as close to summarise()
as is practical). We have intentionally (beyond minor extension) kept the example as is.
But dplyr
is not un-precedented. It was preceeded by the plyr
package and many of these transformational verbs actually have near equivalents in the R
name-space base::
:
dplyr::filter()
~base::subset()
dplyr::arrange()
~base::order()
dplyr::select()
~base::[]
dplyr::mutate()
~base::transform()
We will get back to these substitutions after we discuss tidyr
.
tidyr
tidyr
is a smaller package than dplyr
and it mostly supplies the following verbs:
complete()
(a bulk coalsece function)gather()
(a un-pivot operation, related tostats::reshape()
)spread()
(a pivot operation, related tostats::reshape()
)nest()
(a hierarchical data operation)unnest()
(opposite ofnest()
, closest analogy might bebase::unlist()
)separate()
(split a column into multiple columns)extract()
(extract one column)expand()
(complete an experimental design)
The most famous tidyr
verbs are nest()
, unnest()
, gather()
, and spread()
. We will discuss gather()
here as it and spread()
are incremental improvements on stats::reshape()
.
Note also the tidyr
package was itself preceded by a package called reshape2
, which supplied pivot
capabilities in terms of verbs called melt()
and dcast()
.
The flights example again
It may come as a shock to some: but one can roughly "line for line"" translate the "nycflights13" example from the dplyr 0.5.0
introduction into common methods from base::
and stats::
that reproduces the sequence of transforms style. I.e., transformational style is already available in "base- R
".
By "base-R
" we mean R
with only its standard name-spaces (base
, util
, stats
and a few others). Or "R
out of the box" (before loading many packages). "base-R
" is not meant as a pejorative term here. We don’t take "base-R
" to in any way mean "old-R
", but to denote the core of the language we have decided to use for many analytic tasks.
What we are doing is separating the style of programming taught "as dplyr
" (itself a signficant contribution) from the implementation (also a significant contribution). We will replace the use of the magrittr
pipe "%>%
" with the Bizarro Pipe (an effect available in base-R
) to produce code that works without use of dplyr
, tidyr
, or magrittr
.
The translated example:
library("nycflights13")
library("ggplot2")
flights ->.;
# select columns we are working with
.[c('arr_delay', 'dep_delay', 'year', 'month', 'day')] ->.;
# simulate the group_by/summarize by split/lapply/rbind
transform(., key=paste(year, month, day)) ->.;
split(., .$key) ->.;
lapply(., function(.) {
transform(., arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
)[1, , drop=FALSE]
}) ->.;
do.call(rbind, .) ->.;
# filter to either delay at least 30 minutes
subset(., arr > 30 | dep > 30) ->.;
# select only columns we wish to present
.[c('year', 'month', 'day', 'arr', 'dep')] ->.;
# get the data into a long form
# can't easily use stack as (from help(stack)):
# "stack produces a data frame with two columns""
reshape(.,
idvar = c('year','month','day'),
direction = 'long',
varying = c('arr', 'dep'),
timevar = 'delayType',
v.names = 'delayMinutes') ->.;
# convert reshape ordinals back to original names
transform(., delayType = c('arr', 'dep')[delayType]) ->.;
# make sure the data is in the order we expect
.[order(.$year, .$month, .$day, .$delayType), , drop=FALSE] -> summary2
# clean out the row names for clarity of presentation
rownames(summary2) <- NULL
dim(summary2)
## [1] 98 5
head(summary2)
## year month day delayType delayMinutes
## 1 2013 1 16 arr 34.24736
## 2 2013 1 16 dep 24.61287
## 3 2013 1 31 arr 32.60285
## 4 2013 1 31 dep 28.65836
## 5 2013 2 11 arr 36.29009
## 6 2013 2 11 dep 39.07360
ggplot(data= summary2, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='n'),
subtitle = "produced by: base/stats packages plus Bizarro Pipe")

print(all.equal(as.data.frame(summary1),summary2))
## [1] TRUE
The above work-flow is a bit rough, but the simple introduction of a few light-weight wrapper functions would clean up the code immensely.
The ugliest bit is the by-hand replacement of the group_by()
/summarize()
pair, so that would be a good candidate to wrap in a function (either full split/apply/combine style or some specialization such as grouped ordered apply).
The reshape
step is also a bit rough, but I like the explicit specification of idvars
(without these the person reading the code has little idea what the structure of the intended transform is). This is why even though I prefer the tidyr::gather()
implementation to stats::reshape()
I chose to wrap tidyr::gather()
into a more teachable "coordinatized data" signature (the idea is: explicit grouping columns were a good idea for summarize()
, and they are also a good idea for pivot
/un-pivot
).
Also, the use of expressions such as ".$year
" is probably not a bad thing; dplyr
itself is introducing "data pronouns" to try and reduce ambiguity and would write some of these expressions as ".data$year
". In fact dplyr
also allows notations such as "mtcars %>% select(.data["disp"])
" ; so such notation does have its place.
Conclusion
R
itself is very powerful. That is why additional powerful notations and powerful conventions can be built on top of R
. R
also, for all its warts, has always been a platform for statistics and analytics. So: for common data manipulation tasks you should expect R
does in fact have some ready-made tools.
It is often said "R
is its packages", but I think that is missing how much R
packages owe back to design decisions found in "base-R
".
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Note: I have gotten some appropriate and correct criticism for not having traced important influences. I want to apologize on having presented a myopic view of “context.”
My goal was to remind people of the existence of
R
itself and point out with the proper conventions you already can writeR
in transformational style (and you need conventions in any big language).That being said I was remiss not to mention
data.table
a high-performance package that has been around for about 11 years, is very much preferred by some largeR
groups (Google), already has a high-performance work-alike fordata.frame
, and already has a transformational query language (group, update, join, and so on; please see here).Professor Hadley Wickham has taken issue with my writing:
Evidently that is not the case and the notation was to be considered as relevant in a specific context. I have corrected the current copy of the article.
The notation was in fact suggested or recommended to me by him in an issue report that was already closed after a comment of mine that included the text “all my questions are now answered.” This is why I in good faith thought I could describe it as “recommended.”
(Also: I tend to move code fluidly between scripts, functions, and packages. When I indicated “I was not asking about those things”: I meant I already knew how to properly wrap code, not that I was not interested in functions and packages. But as we see- readers do not always take away exactly what writers may have intended.)