Menu Home

cdata Control Table Keys

In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys.

The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys that are spread across more than one column). This is easiest to demonstrate with an example.

Consider the simple data frame (the first couple of rows of the famous iris data set).

library("cdata")

d <- wrapr::build_frame(
  "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" |
  5.1           , 3.5          , 1.4           , 0.2          , "setosa"  |
  4.9           , 3            , 1.4           , 0.2          , "setosa"  )
d$id <- seq_len(nrow(d))

knitr::kable(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
5.1 3.5 1.4 0.2 setosa 1
4.9 3.0 1.4 0.2 setosa 2

Suppose we wish to land the dimensions of different irises in rows keyed by two columns: Part and Measure. That is we want the data to look like the following.

expect <- wrapr::build_frame(
  "id", "Species", "Part" , "Measure", "Value" |
  1L  , "setosa" , "Sepal", "Length" , 5.1     |
  1L  , "setosa" , "Sepal", "Width"  , 3.5     |
  1L  , "setosa" , "Petal", "Length" , 1.4     |
  1L  , "setosa" , "Petal", "Width"  , 0.2     |
  2L  , "setosa" , "Sepal", "Length" , 4.9     |
  2L  , "setosa" , "Sepal", "Width"  , 3       |
  2L  , "setosa" , "Petal", "Length" , 1.4     |
  2L  , "setosa" , "Petal", "Width"  , 0.2     )

knitr::kable(expect)
id Species Part Measure Value
1 setosa Sepal Length 5.1
1 setosa Sepal Width 3.5
1 setosa Petal Length 1.4
1 setosa Petal Width 0.2
2 setosa Sepal Length 4.9
2 setosa Sepal Width 3.0
2 setosa Petal Length 1.4
2 setosa Petal Width 0.2

With multiple control table keys this is easy.

First define the control table specifying what a single row-record looks like after transformation to a multi-row block-record.

control_table <- wrapr::qchar_frame(
  Part,  Measure, Value          |
  Sepal, Length,  "Sepal.Length" |
  Sepal, Width,   "Sepal.Width"  |
  Petal, Length,  "Petal.Length" |
  Petal, Width,   "Petal.Width"  )

We have designed our desired transform above (as taught here). We are introducing a new convention of putting the non-key values in quotes (even though wrapr::qcar_frame() does not need this). Notice the quoted fields are column-names of the original row-oriented data, and the quote is indicating that these are names standing in for values.

We can now have cdata::rowrecs_to_blocks() perform the transform. Notice we are specifying that the columns Part and Measure are the control table row-keys.

res <- rowrecs_to_blocks(
  d,
  control_table,
  controlTableKeys = c("Part", "Measure"),
  columnsToCopy = c("id", "Species"))

knitr::kable(res)
id Species Part Measure Value
1 setosa Sepal Length 5.1
1 setosa Sepal Width 3.5
1 setosa Petal Length 1.4
1 setosa Petal Width 0.2
2 setosa Sepal Length 4.9
2 setosa Sepal Width 3.0
2 setosa Petal Length 1.4
2 setosa Petal Width 0.2

And, as one comes to expect with coordinatized/fluid-data transforms, the process is easy to reverse (modulo column and row order).

back <- blocks_to_rowrecs(
  res,
  keyColumns = c("id", "Species"),
  control_table,
  controlTableKeys = c("Part", "Measure"))

knitr::kable(back)
id Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2

The cdata unit test include the following variations of the above example:

We think cdata (and the accompanying fluid data methodology, plus extensions) is a very deep and powerful way of wrangling data. Once you take the time to learn the methodology (which is "draw what you want to happen to one record, type that in as your control table, and you are done!") it is very easy to use.


Note: the above capabilities is now part of the development version of cdata, and will likely make their way to the CRAN release in a couple of weeks. We have some substantial reasons for announcing it now. Users who would like to try the feature can either wait for the CRAN release, or install the development version of cdata with a procedure such as devtools::install_github("WinVector/cdata"). A thank you to Recle Etino Vibal for requesting this feature, and moving general keys from my “nice to have” list to “at least one user requested it” list.

Categories: Exciting Techniques Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

%d bloggers like this: