# Arbitrary Data Transforms Using cdata

We have been writing a lot on higher-order data transforms lately:

What I want to do now is "write a bit more, so I finally feel I have been concise."

The `cdata` `R` package supplies general data transform operators.

We will end with a quick example, centered on pivoting/un-pivoting values to/from more than one column at the same time.

Suppose we had some sales data supplied as the following table:

SalesPerson Period BookingsWest BookingsEast
a 2017Q1 100 175
a 2017Q2 110 180
b 2017Q1 250 0
b 2017Q2 245 0

Suppose we are interested in adding a derived column: which region the salesperson made most of their bookings in.

``library("cdata")``
``## Loading required package: wrapr``
``library("seplyr")``
``````d <- d  %.>%
dplyr::mutate(., BestRegion = ifelse(BookingsWest > BookingsEast,
"West",
ifelse(BookingsEast > BookingsWest,
"East",
"Both")))``````

Our notional goal is (as part of a larger data processing plan) to reformat the data a thin/tall table or a RDF-triple like form. Further suppose we wanted to copy the derived column into every row of the transformed table (perhaps to make some other step involving this value easy).

We can use `cdata::moveValuesToRowsD()` to do this quickly and easily.

First we design what is called a transform control table.

``````cT1 <- data.frame(Region = c("West", "East"),
Bookings = c("BookingsWest", "BookingsEast"),
BestRegion = c("BestRegion", "BestRegion"),
stringsAsFactors = FALSE)
print(cT1)``````
``````##   Region     Bookings BestRegion
## 1   West BookingsWest BestRegion
## 2   East BookingsEast BestRegion``````

In a control table:

• The column names specify new columns that will be formed by `cdata::moveValuesToRowsD()`.
• The values specify where to take values from.

This control table is called "non trivial" as it does not correspond to a simple pivot/un-pivot (those tables all have two columns). The control table is a picture of of the mapping we want to perform.

An interesting fact is `cdata::moveValuesToColumnsD(cT1, cT1, keyColumns = NULL)` is a picture of the control table as a one-row table (and this one row table can be mapped back to the original control table by `cdata::moveValuesToRowsD()`, these two operators work roughly as inverses of each other; though `cdata::moveValuesToRowsD()` operates on rows and `cdata::moveValuesToColumnsD()` operates on groups of rows specified by the keying columns).

The mnemonic is:

• `cdata::moveValuesToColumnsD()` converts arbitrary grouped blocks of rows that look like the control table into many columns.
• `cdata::moveValuesToRowsD()` converts each row into row blocks that have the same shape as the control table.

Because pivot and un-pivot are fairly common needs `cdata` also supplies functions that pre-populate the controls tables for these operations (`buildPivotControlTableD()` and `buildUnPivotControlTable()`).

To design any transform you draw out the control table and then apply one of these operators (you can pretty much move from any block structure to any block structure by chaining two or more of these steps).

We can now use the control table to supply the same transform for each row.

``````d  %.>%
dplyr::mutate(.,
Quarter = substr(Period,5,6),
Year = as.numeric(substr(Period,1,4)))  %.>%
dplyr::select(., -Period)  %.>%
moveValuesToRowsD(.,
controlTable = cT1,
columnsToCopy = c('SalesPerson',
'Year',
'Quarter')) %.>%
arrange_se(., c('SalesPerson', 'Year', 'Quarter', 'Region'))  %.>%
knitr::kable(.)  ``````
SalesPerson Year Quarter Region Bookings BestRegion
a 2017 Q1 East 175 East
a 2017 Q1 West 100 East
a 2017 Q2 East 180 East
a 2017 Q2 West 110 East
b 2017 Q1 East 0 West
b 2017 Q1 West 250 West
b 2017 Q2 East 0 West
b 2017 Q2 West 245 West

Notice we were able to easily copy the extra `BestRegion` values into all the correct rows.

It can be hard to figure out how to specify such a transformation in terms of pivots and un-pivots. However, as we have said: by drawing control tables one can easily design and manage fairly arbitrary data transform sequences (often stepping through either a denormalized intermediate where all values per-instance are in a single row, or a thin intermediate like the triple-like structure we just moved into).

Tagged as:

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 3 replies ›

1. The need to move multiple columns in a pivot or un-pivot shows up a lot in practice, it just tends to be under-served by packages. That is why I thought it was a good example to show some of the ways cdata stands out.

2. Paul Shearer says:

Could you explain when this approach might be preferred over tidyr’s spread and gather functions? Is the main idea to cover big-data table backends, which spread and gather don’t (yet) cover?

1. Good question, and thanks!

In my opinion `cdata`‘s primary advantages are:

`cdata` provides the ability to carry multiple columns in either direction, meaning it has easy access to many more transforms that just `tidyr::gather()` and `tidyr::spread()`.
`cdata` works on big-data back ends (in particular `PostgreSQL` and `Apache Spark`, in addition to in-memory data. I don’t know if `tidyr` has big data back ends on its schedule, so “yet” may or may not be the right way to view `tidyr`.
The `pivotValuesToColumns` and `unpivotValuesToRows` (currently only available for in-memory processing) check and maintain more invariants than `tidyr::gather()` and `tidyr::spread()`. This allows one to ensure and document a sequence of these operations is safely composable and reversible.
(Controversial one) I think the standard evaluation interfaces are better (easier to program over) than the non-standard column naming schemes used in `tidyr`. Also given my experience in moving from `dplyr 0.5.*` to `dplyr 0.7.*` I expect a lot of breakage in `tidyr` using code when (as I assume it will) `tidyr` switches from its own interface implementation to `rlang`/`tidyeval`/`tidyselect`.
(Controversial one) I have found `pivotValuesToColumns` and `unpivotValuesToRows` easier to teach.
`cdata` disadvantage: `cdata` is newer code with
much less experience in production than `tidyr`.
`cdata` disadvantage: The `cdata` in-memory implementation
is supplied by `RSQLite`, so any problems arising from this bridge (dates, list columns, S4 class columns) are going to be present in `cdata`.

So to as to when `cdata` is better: I think quite often. Our group is not developing any new code on top of `tidyr` unless clients specifically request it.

Obviously the `tidyr` authors will have a “similar opinion with some signs reversed” (i.e., prefer their own work).