A note to dplyr
with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate()
statements.
If you are using the R
dplyr
package with a database or with Apache Spark
: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate()
statement inside the same dplyr::mutate()
statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()
s).
I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.
To keep things in proportion: if you are not writing multi-assignment mutates on a dplyr
database-backed system you can’t run into the problem (though, for performance, multi-statement mutates are preferred over database sources such as Apache Spark
).
The issue has been reported to the dplyr
team. And I presume a fix is in the works. However, one does not want to be distributing incorrect results in the interim. This is the advice I have been giving private clients. After some thought I have come to feel it would be unfair to withhold such advice from the larger R
community. This is not meant to make dplyr
look bad, but to try and help prevent both dplyr
and dplyr
users from unnecessarily looking bad.
To be clear: I am a proponent of dplyr
plus database development (which is why I ran into this). Also, I am not affiliated with RStudio
or affiliated with the dplyr
development team.
Categories: Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Good to know and not terribly surprising. dplyr is like a semantically hyper-advanced SQL that lacks a query optimizer. Databases are the opposite. Takes a lot of work to make those two things work under a single syntax.
Could you provide an illustration of the types of errors you encountered? Also a snippet of problematic versus refactored code? I realize that it may not be feasible to post an reproducible example, but having more information would help others determine if they were encountering this issue.
Two examples: herehere and here. What to look for is any re-use of a value created in a
dplyr::mutate()
inside the samedplyr::mutate()
. The solution is to separate such adplyr::mutate()
into more than onedplyr::mutate()
.