Menu Home

Evolving R Tools and Practices

One of the distinctive features of the R platform is how explicit and user controllable everything is. This allows the style of use of R to evolve fairly rapidly. I will discuss this and end with some new notations, methods, and tools I am nominating for inclusion into your view of the evolving “current best practice style” of working with R.


Let’s place R (or the S programming language) into context.

Strict Languages

Often computer programming language semantics are effectively described by use of analogy that separates the user-observable behavior from the implementation.

For example it would make sense to say in C++ the decision as to which implementation is used during a method call is implemented as if a search were made at runtime across the C++ object type hierarchy until a match is found. Whereas in practice the C++ compiler implements this dynamic dispatch as a reference to a hidden data structure (that is not visible to the programmer) called a vtable. This leads me to say that languages like C++ and Java implement strong object oriented programming as these languages work hard to enforce meaningful invariants and hide implementation details from the user.

Tolerant Languages

In the Python programming language we also see object oriented semantics, but the implementation details are somewhat user visible because the programmer has direct access to the implementation of the object oriented effects (such as: self, __dict__, __doc__, __name__, __module__, __bases__). The object oriented semantics of Python are defined in terms of lookups against these structures, which are user visible (and alterable). So in some sense we can say Python‘s object semantics somewhat rely on convention (the convention being the users don’t mess with the “__*__” structures too much).

Wild Languages

Then we get to the case of R where everything is user visible. In R almost nothing is implemented “as if” a given lookup is performed; the described lookup is almost always explicit, user visible, and alterable. For example R‘s common object oriented system S3 is visibly implemented as pasting method names together with class names (such as the method summary being specialized to models of class lm by declaring a function named “summary.lm“). And to invoke dynamic dispatch there must be an explicit base function itself calling “UseMethod()” to re-route the method call.

Further, under R‘s “everything is a function” rubric, things you would think are language constructs controlled by the interpreter are actually user visible (and modifiable) functions and operators. For an example see the “evil rebind parenthesis” example found here.

R‘s user visible semantics are wholly convention, as they stand only so long as nothing has been tinkered with yet.

So Why Does R Work?

Language extensions that would require cooperation of the core development team in most languages can be implemented through user definable functions and packages in R. This means users can re-define and extend the R language pretty much at will. Given this extreme malleability of the R runtime it is a legitimate question: “why R hasn’t fractured into a million incompatible domain specific languages and died?”

I think R‘s survival and success stems from four things:

  1. Most R users are have the same goal: analyzing data. So they are mostly working in the same domain.
  2. The open nature of the R ecosystem allows competitive evolution of notations and language extensions. We retain the winning ideas and paradigms, regardless of their original source.
  3. R is probably a lot less constant than we choose to perceive it to be. Package maintainers work hard so “things just work” and continue to do so over time.
  4. The amazing efforts of open-source non-profits such as: CRAN, The R Foundation, and The R Consortium.

Some relevant examples that help illustrate how the R ecosystem works include:

  • Konrad Rudolph has a very clever re-binding of -> as function abstraction or lambda introduction allowing code like the following:
    sapply(1 : 4, x -> 2 * x)
    ## [1] 2 4 6 8

    Unfortunately this is incompatible with any code that uses either of “<-” or “->” for assignment (you lose both as the R parser perversely aliases both symbols together). This incompatibility is why, even though this is a neat effect, we don’t see a large sub-population coding in this style.

  • Stefan Milton Bache and Hadley Wickham’s magrittr pipe rapidly rose to prominence as it didn’t break previous code (using a previously uncommon operator glyph “%>%“). It also made explicit the pre-existing property that most common R analysis functions can already be considered as transforms on their first argument (all other arguments being controls or parameters). Some of this consistency is due to the first-argument dispatch of R‘s S3 object system.

What Am I Advocating?

The uses of R‘s plasticity that my group (Win-Vector LLC) distributes, educates on, and advocates include the following:

  • The vtreat package that prepares noisy real-world data for predictive analytics in a statistically sound manner. If your analytics task has a “quantity to predict”, “independent variable”, or “y” then you tend to get substantial improvements in quality of fit by applying the vtreat methodology (stronger than indicators/dummies, one-hot encoding, hashing, and non-signalling missing value imputation).

    We have a lot of material on vtreat but we suggest you start with our formal article on the package.

  • The replyr::DebugFnW wrapper function for capturing errors and greatly speeding up debugging in R (right now only in the Github development branch of the package). replyr::DebugFnW is extremely effective at capturing enough state to make debugging a breeze.

    I have a free video lecture here describing how to use the technique.

  • The “bizarro pipe” (or pseudo pipe) “->.;” for much easier step-debugging of dplyr pipelines.

    I have a free video lecture here illustrating the method.

  • replyr::let” which makes programming over packages that prefer non-standard evaluation based (or argument capture based) interfaces (such as dplyr) much easier.

    Our group has a lot of writing on this topic, and I am presented on this at the February 2017 BARUG meeting (recording of my rehearsal screencast).


I think these techniques will make your work as an analyst or data scientist much easier. If this is the case I hope you will help teach and promote these methods.

Categories: Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

1 reply

%d bloggers like this: