With all of the excitement surrounding `cdata`

style control table based data transforms (the `cdata`

ideas being named as the “replacements” for `tidyr`

‘s current methodology, by the `tidyr`

authors themselves!) I thought I would take a moment to describe how they work.

`cdata`

defines two primary data manipulation operators: `rowrecs_to_blocks()`

and `blocks_to_rowrecs()`

. These are the fundamental transforms that convert between data representations. The two representations it converts between are:

- A world where all facts about an instance or record are in a single row (“rowrecs”).
- A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

`rowrecs_to_blocks()`

does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms`rowrecs_to_blocks()`

is therefore a join of the data to the control table. Conversely`blocks_to_rowrecs()`

combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in `R`

as `unpivot`

/`pivot`

(`pivot`

invented by Pito Salas), `stack`

/`unstack`

, `melt`

/`cast`

, or `gather`

/`spread`

. These special cases are handled in `cdata`

by convenience functions `unpivot_to_blocks()`

and `pivot_to_rowrecs()`

. A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).

citation("cdata") To cite package ‘cdata’ in publications use: John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/. A BibTeX entry for LaTeX users is @Manual{, title = {cdata: Fluid Data Transformations}, author = {John Mount and Nina Zumel}, year = {2019}, note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/}, }

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Forgot to mention the general transform, any shape record to any shape record. This is specified as before drawing to an after drawing, with no need for any theory.

Notationally

`rowrecs_to_blocks()`

is multiplying by the control table and`blocks_to_rowrecs()`

is divide by the control table. We have the operator notation worked out here.