I currently advise
dplyr users to take care when using multiple assignment
dplyr::mutate() commands on databases.
(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)
In this note I exhibit a troublesome example, and a systematic solution.
First let’s set up
dplyr, our database, and some example data.
## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
##  '0.7.4'
##  '1.2.0'
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") d <- dplyr::copy_to( db, data.frame(xorig = 1:5, yorig = sin(1:5)), "d")
Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following.
d %>% mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta ) %>% select(-xorig, -yorig, -delta) %>% knitr::kable()
Notice the above gives an incorrect result: all of the
x_i columns are identical, and all of the
y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect
dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed
dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.
The issue is:
dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “
delta” is taking only one value for the entire lifetime of the
dplyr::mutate() statement (which is clearly not what the user would want).
The fix is: break up the
dplyr::mutate() into a series of smaller mutates that don’t exhibit the problem. It is a trade-off breaking up
dplyr::mutate() on a database causes deeper statement nesting, and potential loss of performance. However, correct results should come before speed.
One automated variation of the fix is to use
seplyr‘s statement partitioner.
seplyr can factor the large mutate in a minimal number of very safe sub-mutates (and use
dplyr to execute them).
d %>% seplyr::mutate_se( seplyr::quote_mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta )) %>% select(-xorig, -yorig, -delta) %>% knitr::kable()
The above notation is, however, a bit clunky for everyday use. We did not use the more direct
seplyr::mutate_nse() as we are (to lower maintenance effort) deprecating the direct non-standard evaluation methods in
seplyr in favor of code using
One can instead use
seplyr as a code inspecting and re-writing tool with
cat(seplyr::factor_mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta ))
Warning in seplyr::factor_mutate(delta = 0, x0 = xorig + delta, y0 = yorig + : Mutate should be split into more than one stage. mutate(delta = 0) %>% mutate(x0 = xorig + delta, y0 = yorig + delta) %>% mutate(delta = delta + 1) %>% mutate(x1 = xorig + delta, y1 = yorig + delta) %>% mutate(delta = delta + 1) %>% mutate(x2 = xorig + delta, y2 = yorig + delta)
seplyr::factor_mutate() both issued a warning and produced the factored code snippet seen above. We think this is in fact a different issue than explored in our prior note on dependency driven result corruption, and fixes for the first issue did not fix this issue last time we looked.
And that why to continue to be careful when using multi assignment
dplyr::mutate() statements with database backed data.
Categories: Coding Programming Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Honestly dplyr is solving problems that really do not to exist. Everything done with dplyr can be done as efficiently with standard and basic R code.
There are good reasons dplyr is a very popular package. For example the grouped summarize facility is very useful.
Our new rquery package is built with this sort of dependence in mind.