I know I write a lot about coding in
R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.
(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)
Rwithout data is like going to the theater to watch the curtain go up and down.
Usually you come to
R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.
A simple example
Let’s start with a typical
dplyr example. Suppose we wish to select two columns (in this case
c("name", "height")) from a
data.frame (in this case
dplyr::starwars). This is accomplished easily as we show below.
library("dplyr") starwars %>% select(name, height) # # A tibble: 87 x 2 # name height # <chr> <int> # 1 Luke Skywalker 172 # 2 C-3PO 167 # 3 R2-D2 96 # 4 Darth Vader 202 # 5 Leia Organa 150 # 6 Owen Lars 178 # 7 Beru Whitesun lars 165 # 8 R5-D4 97 # 9 Biggs Darklighter 183 # 10 Obi-Wan Kenobi 182 # # ... with 77 more rows
In practice we recommend coding only after you have decided on what you are going to do, and what parameters specify what your steps.
Once you get to coding, in our opinion intent is much clearer if you organize your make things explicit. For example, if you are working with
magrittr pipes: make the pipe input argument explicit with “.” (please see R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines). And if you are workign with
dplyr::select(): make the argument roles explicit. We suggest collecting the column names into a separate group to show their role is different than the role of the incoming
data.frame. At first this explicitness unfortunately reduces legibility as our code then looks like the following.
starwars %>% select(., one_of( c("name", "height") ))
Note this is not a criticism of
one_of(), it is a discomfort of needing something like
one_of(). And I fully admit: the popular
dplyr style of not including the first argument in pipelines does not have the legibility problem; I myself introduced that problem by insisting on an explicit data argument. However, I have found that explicit arguments make it much easier for students to learn how to use
dplyr functions simultaneously inside and outside pipelines. I also feel the explicit documentation of arguments has a number of down-stream advantages.
Minimize your reliance on implicit convention. What is obvious to you when writing the code may not be obvious to others, and may be something you don’t remember later. Along these lines we have a mini-style guide for effectively using
dplyr with and without pipelines here.
Our specific legibility issue is just a matter of the nested “
one_of(c("", ...))” construct being a bit clumsy. If we use an adapted version of
select() that expects the list of columns to come in as a vector (as is typical for values in
R) and use a vector constructor that does not need the quotes (such as
qc(), please see R Tip: Use qc() For Fast Legible Quoting) we get a pipeline that is both very explicit (so more self-documenting) and quite convenient and legible:
library("wrapr") library("seplyr") starwars %>% select_se(., qc(name, height))
select_se() stands for “select standard evaluation”, meaning it is an adaption of
select() that expects to be supplied the set of columns as a vector value. This function has a two-argument interface (data and vector of columns) and is simple to describe and reason about.
qc() itself is a non-standard (or name capturing) interface. This is all
qc() does, so it documents the user’s intent to capture names. If one does not mind the quotes one can avoid
qc() entirely and write code such as the following.
columns <- c("name", "height") select_se(starwars, columns)
The above is simple, as it should be.
select_se() is a function that expects two values and we call it supplying two values. This may seem less magical than “
starwars %> select(name, height)” (which involves piping, hidden function arguments, and name capture), and if so that is a good thing. Selecting a few columns is a basic task, so it should require a lot of cognitive load.
Even better than more variations on tool interfaces, is more tools to capture values that can be used and re-used many ways later.
Value capturing tools
Our group has been developing some simple tools for conveniently capturing values from the user. The idea is with these you get most of the convenience of having non-standard interfaces in many places, without the additional complexity of depending on non-standard interfaces being everywhere.
The trouble with nonstandard evaluation is that it doesn’t follow standard evaluation rules …—Peter Dalgaard (about nonstandard evaluation in the curve() function) R-help (June 2011) –As quoted in the fortunes package.
Standard evaluation interfaces (or value oriented interfaces) are generally preferred because their primary property is referential transparency. Referential transparency is when expressions can be replaced by their evaluated values without changing outcomes. Sequentially replacing expressions with values is program evaluation.
But, away from theory in the large and back to programming in the small. Lets conclude with a few tool that make constructing useful values easier.
We have already seen
qc() is “quoting concatenate”, which we have already demonstrated. It is used as follows.
v <- qc(name, height) print(v) #  "name" "height" dput(v) # c("name", "height")
qc() can also be used to construct named vectors, which are very useful as maps.
map <- qc(a = A, b = B) print(map) # a b # "A" "B"
We also have a “print as paste-able code” function
map_to_char(), which is a bit more convenient (for simple structures) than
dput(map) # structure(c("A", "B"), .Names = c("a", "b")) map_to_char(map) #  "c('a' = 'A', 'b' = 'B')"
We also have
build_frame(), which is a convenience for typing in simple small
data.frames directly in row-oriented form (similar in intent to
d <- build_frame( "name", "value" | "a" , 1 | "b" , 2 ) print(d) # name value # 1 a 1 # 2 b 2
The end of the first row is indicated by most any infix operator (we used “
|“). More details on working with
build_frame() can be found here.
draw_frame() function can render small simple
data.frames into paste-able form. This is a great way to capture and sure examples (without dates or other complex or annotated types).
cat(draw_frame(d)) # build_frame( # "name", "value" | # "a" , 1 | # "b" , 2 ) dput(d) # structure(list(name = c("a", "b"), value = c(1, 2)), .Names = c("name", # "value"), row.names = c(NA, -2L), class = "data.frame")
Strip off the comment
#-marks and you can paste the
draw_frame() presentation into other work as legible code.
data.frames that are purely string valued, we have
qchar_frame(), which is essentially
d <- qchar_frame( name, value | a , x | b , y ) print(d) # name value # 1 a x # 2 b y cat(draw_frame(d)) # build_frame( # "name", "value" | # "a" , "x" | # "b" , "y" )
In conclusion: sometimes when you think you need more code, you actually just need to move more of your intent into data and values. In
R it pays to treat as much as you can as values (data, selections, configuration, and even results).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.