A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
If you are using the
dplyr package with a database or with
Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a
dplyr::mutate() statement inside the same
dplyr::mutate() statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more
I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.
To keep things in proportion: if you are not writing multi-assignment mutates on a
dplyr database-backed system you can’t run into the problem (though, for performance, multi-statement mutates are preferred over database sources such as
The issue has been reported to the
dplyr team. And I presume a fix is in the works. However, one does not want to be distributing incorrect results in the interim. This is the advice I have been giving private clients. After some thought I have come to feel it would be unfair to withhold such advice from the larger
R community. This is not meant to make
dplyr look bad, but to try and help prevent both
dplyr users from unnecessarily looking bad.
To be clear: I am a proponent of
dplyr plus database development (which is why I ran into this). Also, I am not affiliated with
RStudio or affiliated with the
dplyr development team.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Good to know and not terribly surprising. dplyr is like a semantically hyper-advanced SQL that lacks a query optimizer. Databases are the opposite. Takes a lot of work to make those two things work under a single syntax.
Could you provide an illustration of the types of errors you encountered? Also a snippet of problematic versus refactored code? I realize that it may not be feasible to post an reproducible example, but having more information would help others determine if they were encountering this issue.
Two examples: herehere and here. What to look for is any re-use of a value created in a
dplyr::mutate()inside the same
dplyr::mutate(). The solution is to separate such a
dplyr::mutate()into more than one