Menu Home

Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

First let’s start with some data.

d <- wrapr::build_frame(
  "id", "measure", "value" |
    1   , "AUC"    , 0.7     |
    1   , "R2"     , 0.4     |
    2   , "AUC"    , 0.8     |
    2   , "R2"     , 0.5     )

knitr::kable(d)
id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5

In the above data we have two measurements each for two individuals (individuals identified by the "id" column). Using cdata‘s new_record_spec() method we can capture a description of this record structure.

library("cdata")

record_spec <- new_record_spec(
  wrapr::build_frame(
    "measure", "value" |
    "AUC"    , "AUC" |
    "R2"     , "R2"  ),
  recordKeys = "id")

print(record_spec)
## $controlTable
##   measure value
## 1     AUC   AUC
## 2      R2    R2
## 
## $recordKeys
## [1] "id"
## 
## $controlTableKeys
## [1] "measure"
## 
## attr(,"class")
## [1] "cdata_record_spec"

Once we have this specification we can transform the data using operator notation.

We can collect the record blocks into rows by a "factoring"/"division" (or aggregation/projection) step.

knitr::kable(d)
id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5
d2 <- d %//% record_spec

knitr::kable(d2)
id AUC R2
1 0.7 0.4
2 0.8 0.5

We can expand record rows into blocks by a "multiplication" (or join) step.

knitr::kable(d2)
id AUC R2
1 0.7 0.4
2 0.8 0.5
d3 <- d2 %**% record_spec

knitr::kable(d3)
id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5

And that is truly fluid data manipulation.

This article can be found expanded into a vignette here.

Categories: Coding Exciting Techniques Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

6 replies

  1. Mixing infix operators with pivot/unpivot operations seems genius! I’m way too familiar with the dplyr/tidyr syntax to fully embrace cdata for my own work. But I’ve always found your work interesting and valuable as it brings healthy competition and new thinking to R’s data manipulation space.

    By the way, the metaphors of “dividing” and “multiplying” sound a bit confusing to me. Personally, I would find it easier to understand if the infix operator was something like, say, %T% as in matrix transpose and can be used “symmetrically” or quasi-involutorily. For example, d %T% record_spec %T% record_spec would give back the original d.

    Anyway, keep up the good work!

    1. Thanks!

      I agree- the dividing name isn’t a winner. In relational terms it is definitely join versus aggregate.

      I had not thought of the two-sided notation. Interesting point. I will definitely experiment with that. Not sure how to get the quasi-involution property yet. In my current design it feels like the user has to specify the desired operation.

      1. Okay I have a sort of anti-symmetric version. The operator decides what to do based on which side of the operator the record specification is found. For the d below we have that record_spec %pivot% d %pivot% record_spec is an identity (due to the left-association of operators).

        I think the control table idea really frees up one’s mind to try ideas and variations (after one has internalized the table). A great article on this can be found here (yes, it praises our work- but it also has some deep analysis and good ideas of its own): From One Table to Another and Back Again. I think the points is: once the semantics are stable (and they are) then it is safe to think about variations of notation on top of those semantics.

        library("cdata")
        
        d <- wrapr::build_frame(
          "id", "measure", "value" |
          1   , "AUC"    , 0.7     |
          1   , "R2"     , 0.4     |
          2   , "AUC"    , 0.8     |
          2   , "R2"     , 0.5     )
        
        record_spec <- new_record_spec(
          wrapr::qchar_frame(
            measure, value |
            AUC    , "AUC" |
            R2     , "R2"  ),
          recordKeys = "id")
        
        (d2 <- record_spec %pivot% d)
        #>   id AUC  R2
        #> 1  1 0.7 0.4
        #> 2  2 0.8 0.5
        
        # identity
        ( record_spec %pivot% d ) %pivot% record_spec
        #>   id measure value
        #> 1  1     AUC   0.7
        #> 2  1      R2   0.4
        #> 3  2     AUC   0.8
        #> 4  2      R2   0.5
        
        # identity
        record_spec %pivot% d %pivot% record_spec
        #>   id measure value
        #> 1  1     AUC   0.7
        #> 2  1      R2   0.4
        #> 3  2     AUC   0.8
        #> 4  2      R2   0.5
        
        # identity
        record_spec %pivot% ( d2 %pivot% record_spec )
        #>   id AUC  R2
        #> 1  1 0.7 0.4
        #> 2  2 0.8 0.5
        

        Notice how much the identity relations look like the Moore–Penrose pseudo-inverse check equations.

        Actually this is more than just analogy. blocks_to_rowrecs() and rowrecs_to_blocks() are in fact (intentionally limited) linear operators over the space of matrices or data.frames. So we would say they have a natural adjoint relation.

      2. And another variation.

        library("cdata")
        
        d <- wrapr::build_frame(
          "id", "measure", "value" |
          1   , "AUC"    , 0.7     |
          1   , "R2"     , 0.4     |
          2   , "AUC"    , 0.8     |
          2   , "R2"     , 0.5     )
        
        print(d)
        #>   id measure value
        #> 1  1     AUC   0.7
        #> 2  1      R2   0.4
        #> 3  2     AUC   0.8
        #> 4  2      R2   0.5
        
        record_spec <- new_record_spec(
          wrapr::qchar_frame(
            measure, value |
            AUC    , "AUC" |
            R2     , "R2"  ),
          recordKeys = "id")
        
        print(record_spec)
        #> $controlTable
        #>   measure value
        #> 1     AUC   AUC
        #> 2      R2    R2
        #> 
        #> $recordKeys
        #> [1] "id"
        #> 
        #> $controlTableKeys
        #> [1] "measure"
        #> 
        #> attr(,"class")
        #> [1] "cdata_record_spec"
        
        t_record_spec <- t(record_spec)
        
        d %.>% t_record_spec
        #>   id AUC  R2
        #> 1  1 0.7 0.4
        #> 2  2 0.8 0.5
        
        # using dot-pipe's bquote style .() execute immediate notation
        d %.>% .(t(record_spec)) 
        #>   id AUC  R2
        #> 1  1 0.7 0.4
        #> 2  2 0.8 0.5
        
        # identity
        d %.>% .(t(record_spec)) %.>% record_spec
        #>   id measure value
        #> 1  1     AUC   0.7
        #> 2  1      R2   0.4
        #> 3  2     AUC   0.8
        #> 4  2      R2   0.5
        
      3. I like the ease of %pivot%, but can’t you test for which side of the pivot transformation the data being passed to the record spec is? Could something like

        rs_fn <- function(record_spec){
            FUN <- function(d){
                # more robust check likely needed...
                if(all(c(record_spec$controlTableKeys %in% colnames(d))) {
                    return(d %pivot% record_spec)
                  } else {
                    return (record_spec %pivot% d)
                    }
            return(FUN)
        }
        

        be used, such that you now have a single function to transform data back and forth? (this is untested, just ideas)

        spec_pivoter <- recordspec_fn(record_spec)
        
        wide <- spec_pivoter(d)
        identity <- spec_pivoter(wide)
        
        

        I guess the con here is it’s not clear what is happening to the data by reading the code–is it converting to row records or block records? But then again using %pivot% has that same issue if the reader of the code doesn’t know what record_spec looks like. Seems like a case where %pivot% might benefit from further abstraction, and if you then don’t want that abstraction, %//% and %**% are fine.

        Also, by using a function generator, might you exploit your awesome dot pipe, such that simply piping your data to the record spec function returns the result you hope for?

        d %.>% spec_pivoter(.)  # wide/blocks format
        d %.>% spec_pivoter(.) %.>% spec_pivoter(.) #identity
        
        

      4. Some good points. I am still trying ideas and variations to see what clicks.

        Edit 2019-03-28.

        Had a lot of thoughts on this and some discussion with Nina Zumel. What we came up with is this: the methodology has to not only be unambiguous to on the implementation side, but also easy for a user to see what is intended.

        To that end we are going we promoted the record specification object to a transform specification object that has explicit annotation of the intended transform type in the class of the transform object. The %pivot% operator is going away, and the %**% and %//% now only work if the direction they specify matches the direction encoded in the transform. Similarly the “determine the direction by comparing control table contents to data column names” idea you suggested is, while sound, not a direction we are approaching.

        We did, however, use the explicitness to increase the role of the wrapr dot-pipe. Nina calls this “you pipe into the shape you want.”

        Now all of this code is extendable by users. So someone who disagrees with the above design decisions/principles can in fact implement the speculative transform (and even lash it into the dot-pipe, as that is what the dot-pipe is designed for).

        Some update on this are (no surprise) in progress.

%d bloggers like this: