Menu Home

Tidyverse users: gather/spread are on the way out

From https://twitter.com/sharon000/status/1107771331012108288:

NewImage

From https://tidyr.tidyverse.org/dev/articles/pivot.html (text by Hadley Wickham):

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.

There are two important new features inspired by other R packages that have been advancing of reshaping in R:

  • The reshaping operation can be specified with a data frame that describes precisely how metadata stored in column names becomes data variables (and vice versa). This is inspired by the cdata package by John Mount and Nina Zumel. For simple uses of pivot_long() and pivot_wide(), this specification is implicit, but for more complex cases it is useful to make it explicit, and operate on the specification data frame using dplyr and tidyr.
  • pivot_long() can work with multiple value variables that may have different types. This is inspired by the enhanced melt() and dcast() functions provided by the data.table package by Matt Dowle and Arun Srinivasan.

If you want to work in the above way we suggest giving our cdata package a try. We named the functions pivot_to_rowrecs and unpivot_to_blocks. The idea was: by emphasizing the record structure one might eventually internalize what the transforms are doing. On the way to that we have a lot of documentation and tutorials.

Categories: Pragmatic Data Science Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

5 replies

  1. Whether the all of above is adoption or appropriation is going to depend on if the credit to the cdata authors is ever mentioned in talks, how long even the reference lives on in documentation, and if mis-attribution is corrected. We hope it is adoption.

    Also I want to emphasize the theory was joint work with Dr. Nina Zumel. I do a lot of the coding and blogging, but she does the more serious writing and more of the concept development. So this is her idea as much as it is mine (despite me being noisier and much harder to like).

  2. Thought I would share a clarification (though I personally consider “won’t fix bugs” as step one of “on the way out”).

  3. This is a great case for we should stick to base-r wherever possible.

%d