cdata implements an operator notation for data transform.
The idea is simple, yet powerful.
First let’s start with some data.
d <- wrapr::build_frame( "id", "measure", "value" | 1 , "AUC" , 0.7 | 1 , "R2" , 0.4 | 2 , "AUC" , 0.8 | 2 , "R2" , 0.5 ) knitr::kable(d)
In the above data we have two measurements each for two individuals (individuals identified by the "
id" column). Using
new_record_spec() method we can capture a description of this record structure.
library("cdata") record_spec <- new_record_spec( wrapr::build_frame( "measure", "value" | "AUC" , "AUC" | "R2" , "R2" ), recordKeys = "id") print(record_spec)
## $controlTable ## measure value ## 1 AUC AUC ## 2 R2 R2 ## ## $recordKeys ##  "id" ## ## $controlTableKeys ##  "measure" ## ## attr(,"class") ##  "cdata_record_spec"
Once we have this specification we can transform the data using operator notation.
We can collect the record blocks into rows by a "factoring"/"division" (or aggregation/projection) step.
d2 <- d %//% record_spec knitr::kable(d2)
We can expand record rows into blocks by a "multiplication" (or join) step.
d3 <- d2 %**% record_spec knitr::kable(d3)
And that is truly fluid data manipulation.
This article can be found expanded into a vignette here.
Categories: Coding Exciting Techniques Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Mixing infix operators with pivot/unpivot operations seems genius! I’m way too familiar with the
tidyrsyntax to fully embrace
cdatafor my own work. But I’ve always found your work interesting and valuable as it brings healthy competition and new thinking to R’s data manipulation space.
By the way, the metaphors of “dividing” and “multiplying” sound a bit confusing to me. Personally, I would find it easier to understand if the infix operator was something like, say, %T% as in matrix transpose and can be used “symmetrically” or quasi-involutorily. For example,
d %T% record_spec %T% record_specwould give back the original
Anyway, keep up the good work!
I agree- the dividing name isn’t a winner. In relational terms it is definitely join versus aggregate.
I had not thought of the two-sided notation. Interesting point. I will definitely experiment with that. Not sure how to get the quasi-involution property yet. In my current design it feels like the user has to specify the desired operation.
Okay I have a sort of anti-symmetric version. The operator decides what to do based on which side of the operator the record specification is found. For the
dbelow we have that
record_spec %pivot% d %pivot% record_specis an identity (due to the left-association of operators).
I think the control table idea really frees up one’s mind to try ideas and variations (after one has internalized the table). A great article on this can be found here (yes, it praises our work- but it also has some deep analysis and good ideas of its own): From One Table to Another and Back Again. I think the points is: once the semantics are stable (and they are) then it is safe to think about variations of notation on top of those semantics.
Notice how much the identity relations look like the Moore–Penrose pseudo-inverse check equations.
Actually this is more than just analogy.
rowrecs_to_blocks()are in fact (intentionally limited) linear operators over the space of matrices or
data.frames. So we would say they have a natural adjoint relation.
And another variation.
I like the ease of
%pivot%, but can’t you test for which side of the pivot transformation the data being passed to the record spec is? Could something like
be used, such that you now have a single function to transform data back and forth? (this is untested, just ideas)
I guess the con here is it’s not clear what is happening to the data by reading the code–is it converting to row records or block records? But then again using
%pivot%has that same issue if the reader of the code doesn’t know what record_spec looks like. Seems like a case where %pivot% might benefit from further abstraction, and if you then don’t want that abstraction,
Also, by using a function generator, might you exploit your awesome dot pipe, such that simply piping your data to the record spec function returns the result you hope for?
Some good points. I am still trying ideas and variations to see what clicks.
Had a lot of thoughts on this and some discussion with Nina Zumel. What we came up with is this: the methodology has to not only be unambiguous to on the implementation side, but also easy for a user to see what is intended.
To that end we are going we promoted the record specification object to a transform specification object that has explicit annotation of the intended transform type in the class of the transform object. The
%pivot%operator is going away, and the
%//%now only work if the direction they specify matches the direction encoded in the transform. Similarly the “determine the direction by comparing control table contents to data column names” idea you suggested is, while sound, not a direction we are approaching.
We did, however, use the explicitness to increase the role of the
wraprdot-pipe. Nina calls this “you pipe into the shape you want.”
Now all of this code is extendable by users. So someone who disagrees with the above design decisions/principles can in fact implement the speculative transform (and even lash it into the dot-pipe, as that is what the dot-pipe is designed for).
Some update on this are (no surprise) in progress.