There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.” John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the
Disclaimer: Win-Vector LLC has no official standing with RStudio, or
“One need not have been Caesar in order to understand Caesar.”Alternately: Georg Simmmel or Max Webber.
Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving
Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use
dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )
Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") #  ‘0.7.3’ db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights <- copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # <int> <int> <int> <int> <int> ... # 1 2013 1 1 517 515 ... # ...
That worked. One of the selling points of
dplyr is a lot of
dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases,
However, if a new user tries to extend such an example (say adding
gain_per_minutes) they run into this:
mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hour
(Some detail on the failing query are here.)
It is hard for experts to understand how frustrating the above is to a new
R user or to a part time
R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for
dplyr::mutate() has been for some time:
Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above
mutates are allowed. The second should be written as (switching to pipe-notation):
flights %>% mutate(gain = arr_delay - dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)
And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr documentation (such as “
help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is
dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around re-use of column names has been driving
dplyr issues for quite some time:
It makes sense to work in a reliable and teachable sub-dialect of
dplyr that will serve users well (or barring that, you can use an adapter, such as
seplyr). In production you must code to what systems are historically reliably capable of, not just the specification. “Works for the instructor” is not an acceptable level of dependability.
Categories: Opinion Pragmatic Data Science Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Generally speaking, it feels like magrittr piping is going out of hand as dplyr becomes more popular. I encountered blog posts that included ten or more lines of piped dplyr commands, making reading, debugging, and extending the code difficult. I hope that this and other best-practice posts become more common.
I agree a best practice for debuggable code is to write a sequence you control.
Replacing nesting with sequencing is good for debugging and readability, but there were always ways to do that.
If the reason one is using pipes is: “it is hard to pick intermediate variable names” just use dot:
Or if you want fancy notation use Bizarro Pipe (and maybe pay somebody to add the Bizarro symbol a glyph font now that RStudio claims to support multi-characther glyph fonts).
dplyra lot. My group is in fact betting on it via our
Sparklyrpractice. Barring that, I always use the coding style the client wants. However, I am getting tired of zealots (charitably the outliers) calling perfectly good performant standard (or base)
R” (and criticizing when I happen to not use
dplyr). And I am very pleased with the stability and future-proof nature of projects I chose to write in idiomatic base
R(not having to go back and patch them as
My current advice is: use
seplyr::partition_mutate_se()(especially in light of these remaining `dplyr` `0.7.4` issues: 3222, 3223, and this).
We have a lot of new documentation on the new methods:
Mutate Partitioner package vignette
“Partition Mutate” article
“Partitioning Mutate, Example 2” (includes