A note to dplyr
with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate()
statements.
If you are using the R
dplyr
package with a database or with Apache Spark
: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate()
statement inside the same dplyr::mutate()
statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()
s).
I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.
To keep things in proportion: if you are not writing multi-assignment mutates on a dplyr
database-backed system you can’t run into the problem (though, for performance, multi-statement mutates are preferred over database sources such as Apache Spark
).
The issue has been reported to the dplyr
team. And I presume a fix is in the works. However, one does not want to be distributing incorrect results in the interim. This is the advice I have been giving private clients. After some thought I have come to feel it would be unfair to withhold such advice from the larger R
community. This is not meant to make dplyr
look bad, but to try and help prevent both dplyr
and dplyr
users from unnecessarily looking bad.
To be clear: I am a proponent of dplyr
plus database development (which is why I ran into this). Also, I am not affiliated with RStudio
or affiliated with the dplyr
development team.
Categories: Opinion Programming Statistics
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Good to know and not terribly surprising. dplyr is like a semantically hyper-advanced SQL that lacks a query optimizer. Databases are the opposite. Takes a lot of work to make those two things work under a single syntax.
LikeLike
Could you provide an illustration of the types of errors you encountered? Also a snippet of problematic versus refactored code? I realize that it may not be feasible to post an reproducible example, but having more information would help others determine if they were encountering this issue.
LikeLike
Two examples: herehere and here. What to look for is any re-use of a value created in a
dplyr::mutate()
inside the samedplyr::mutate()
. The solution is to separate such adplyr::mutate()
into more than onedplyr::mutate()
.LikeLike