There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.” John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
Disclaimer: Win-Vector LLC has no official standing with RStudio, or
“One need not have been Caesar in order to understand Caesar.”Alternately: Georg Simmmel or Max Webber.
Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving
Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use
dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )
Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") #  ‘0.7.3’ db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights <- copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # <int> <int> <int> <int> <int> ... # 1 2013 1 1 517 515 ... # ...
That worked. One of the selling points of
dplyr is a lot of
dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases,
However, if a new user tries to extend such an example (say adding
gain_per_minutes) they run into this:
mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hour
(Some detail on the failing query are here.)
It is hard for experts to understand how frustrating the above is to a new
R user or to a part time
R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for
dplyr::mutate() has been for some time:
Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above
mutates are allowed. The second should be written as (switching to pipe-notation):
flights %>% mutate(gain = arr_delay - dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)
And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr documentation (such as “
help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is
dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around re-use of column names has been driving
dplyr issues for quite some time:
It makes sense to work in a reliable and teachable sub-dialect of
dplyr that will serve users well (or barring that, you can use an adapter, such as
seplyr). In production you must code to what systems are historically reliably capable of, not just the specification. “Works for the instructor” is not an acceptable level of dependability.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.