seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
Tools such as seplyr, wrapr or rlang are needed when you (the data scientist temporarily working on a programming sub-task) do not know the names of the columns you want your code to be working with. These are situations where you expect the column names to be made available later, in additional variables or parameters.
For an example: suppose we have following data where for two rows (identified by the “id” column) we have two measurements each (identified by the column names “measurement1” and “measurement2”).
library("wrapr") d <- build_frame( 'id', 'measurement1', 'measurement2' | 1 , 'a' , 10 | 2 , 'b' , 20 ) print(d) # id measurement1 measurement2 # 1 1 a 10 # 2 2 b 20
Further suppose we wished to have each measurement in its own row (which is often required, such as when using the ggplot2 package to produce plots). In this case we need a tool to convert the data format. If we are doing this as part of an ad-hoc analysis (i.e. we can look at the data and find the column names at the time of coding) we can use tidyr to perform the conversion:
library("tidyr") gather(d, key = value_came_from_column, value = value_was, measurement1, measurement2) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20
Notice, however, all column names are specified in gather() without quotes. The names are taken from unexecuted versions of the actual source code of the arguments to gather(). This is somewhat convenient for the analyst (they can skip writing a few quote marks), but a severe limitation imposed on the script writer or programmer (they have problems taking the names of columns from other sources).
seplyr now supplies a standard value oriented interface for gather(). With seplyr we can write code such as the following:
library("seplyr") gather_se(d, key = "value_came_from_column", value = "value_was", columns = c("measurement1", "measurement2")) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20
This sort of interface is handy when the names of the columns are coming from elsewhere, in variables. Here is an example of that situation:
# pretend these assignments are done elsewhere # by somebody else key_col_name <- "value_came_from_column" value_col_name <- "value_was" value_columns <- c("measurement1", "measurement2") # we can use the above values with # code such as this gather_se(d, key = key_col_name, value = value_col_name, columns = value_columns) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20
There are ways to use gather() with “to be named later” column names directly, but it is not simple as it neeedlessly forces the user to master a number of internal implementation details of rlang and dplyr. From documentation and “help(gather)” we can deduce at least 3 related “pure tidyeval/rlang” programming over gather() solutions:
# possibly the solution hinted at in help(gather) gather(d, key = !!key_col_name, value = !!value_col_name, dplyr::one_of(value_columns)) # concise rlang solution gather(d, key = !!key_col_name, value = !!value_col_name, !!!value_columns) # fully qualified rlang solution gather(d, key = !!rlang::sym(key_col_name), value = !!rlang::sym(value_col_name), !!!rlang::syms(value_columns))
In all cases the user must prepare and convert values for use. Really this is showing gather() does not conveniently expect parametric columns (column names supplied by variables or parameters), but will accept a work-around if the user re-codes column names in some way (some combination of quoting and de-quoting). With “gather_se()” the tool expects to take values and the user does not have to make special arrangements (or remember special notation) to supply them.
Our advice for analysts is:
- If your goal is to work with data: use a combination of wrapr::let() (a preferred user friendly solution that in fact pre-dates rlang) and seplyr (a data-friendly wrapper over dplyr and tidyr functions).
- If your goal is to write an article about rlang: then use rlang.
- If you are interested in more advanced data manipulation, please check out our cdata package (video introduction here). The cdata equivilant
of the above transform is.
library("cdata") control_table <- build_unpivot_control( nameForNewKeyColumn = key_col_name, nameForNewValueColumn = value_col_name, columnsToTakeFrom = value_columns) rowrecs_to_blocks(d, control_table, columnsToCopy = "id") # id value_came_from_column value_was # 1 1 measurement1 a # 2 1 measurement2 10 # 3 2 measurement1 b # 4 2 measurement2 20
- If you need high in-memory performance: try data.table.
In addition to wrapping a number of dplyr functions and tidyr::gather()/tidyr::spread(), seplyr 0.5.8 now also wraps tidyr::complete() (thanks to a contribution from Richard Layton).
We hope you try seplyr out both in your work and in your teaching.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.