Menu Home

Controlling Data Layout With cdata

Here is an example how easy it is to use cdata to re-layout your data.

Tim Morris recently tweeted the following problem (corrected).

Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!

Starting with:
    d xa xb ya yb
    1  1  3  6  8
    2  2  4  7  9

How can I get to:
    id t x y
    1  a 1 6
    1  b 3 8
    2  a 2 7
    2  b 4 9
In Stata it's:
 . reshape long x y, i(id) j(t) string
In R, it's:
 . an hour of cursing followed by a desperate tweet 👆

Thanks for any help!

PS – I can make reshape() or gather() work when I have just x or just y.

This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.

First: (and this is the important part) define our problem using an example. Tim Morris did this really well, but let’s repeat it here. We want to realize the following data layout transform.


Second: identify the record ID and record structure in both the before and after examples.


Third: attach the cdata package, and use build_frame() to type in the example "before" data.


before <- build_frame(
  "id"  , "xa", "xb", "ya", "yb" |
    1   , 1   , 3   , 6   , 8    |
    2   , 2   , 4   , 7   , 9    )

id xa xb ya yb
1 1 3 6 8
2 2 4 7 9

Fourth: (this is the "hard" part) copy the column marked names from the before into the matching record positions in the after example.


Fifth: copy the annotated "after" record in as your layout transform control table.

ct <- qchar_frame(
  "t"  , "x" , "y" |
    "a", xa  , ya  |
    "b", xb  , yb  )

t x y
a xa ya
b xb yb

In the above we are using a convention that concrete values are written in quotes, and symbols to be taken from the "before" data frame are written without quotes.

Now specify the many-record transform.

layout_spec <- rowrecs_to_blocks_spec(
  recordKeys = "id")

The layout_spec completely encodes our intent. So we can look at it to double check what transform we have specified.

## {
##  row_record <- wrapr::qchar_frame(
##    "id"  , "xa", "xb", "ya", "yb" |
##      .   , xa  , xb  , ya  , yb   )
##  row_keys <- c('id')
##  # becomes
##  block_record <- wrapr::qchar_frame(
##    "id"  , "t", "x", "y" |
##      .   , "a", xa , ya  |
##      .   , "b", xb , yb  )
##  block_keys <- c('id', 't')
##  # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE)
## }

And we can now apply the layout transform to data.

after <- before %.>% layout_spec
# cdata 1.0.9 adds the non-piped function notation:
# layout_by(layout_spec, before)

id t x y
1 a 1 6
1 b 3 8
2 a 2 7
2 b 4 9

A really fun extra: we can build an inverse layout specification to reverse the transform.

reverse_layout <- t(layout_spec) # invert the spec using t()

## {
##  block_record <- wrapr::qchar_frame(
##    "id"  , "t", "x", "y" |
##      .   , "a", xa , ya  |
##      .   , "b", xb , yb  )
##  block_keys <- c('id', 't')
##  # becomes
##  row_record <- wrapr::qchar_frame(
##    "id"  , "xa", "xb", "ya", "yb" |
##      .   , xa  , xb  , ya  , yb   )
##  row_keys <- c('id')
##  # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE)
## }
after %.>% 
  reverse_layout %.>%
id xa xb ya yb
1 1 3 6 8
2 2 4 7 9

Because the layout conversion is invertible, we only really need to learn how to design transforms in one direction, such as row records to block records.

And that is it, we have a re-usable layout_spec that can transform future data. We have many tutorials on the method here, and the source code for this note can be found here.

Categories: Coding Pragmatic Data Science Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Here is another way using tidyverse


    before <- build_frame(
    + “id” , “xa”, “xb”, “ya”, “yb” |
    + 1 , 1 , 3 , 6 , 8 |
    + 2 , 2 , 4 , 7 , 9 )

    before %>%
    + gather(key,value,-id) %>% # wide to long
    + separate(key, c(‘x’,’t’), sep=1) %>% # split column names
    + spread(x,value) # long to wide
    id t x y
    1 1 a 1 6
    2 1 b 3 8
    3 2 a 2 7
    4 2 b 4 9

    1. Thanks for the note.

      That works, but tidyr itself is moving to the cdata control table methodology:

      `pivot_longer()` and `pivot_wider()` can take a data frame that specifies precisely how metadata stored in column names becomes data variables (and vice versa), inspired by the [cdata][cdata] package by John Mount and Nina Zumel.

      From 2019-04-16 version of `tidyr`’s pivot vingette, author: Hadley Wickham.

      The new ideas are: you can either type in the control table by hand or write code to produce it (as we do here). Once the details of the transform are defined, then the transform is performed as specified. From that point on, things are simple. One does the hard work (splitting columns, cbinding things, and so on) to build up the prototype of or description of the transform and then you can save it and use it at will.

      I’ve added how to convert the tidyverse solution into a cdata solution here.

  2. I am a fan of your approach, and have been trying to move towards it over the past few months. It might be a mistake to say “look how easy it is to use cdata”.

    I completely agree that by laying out the steps here make the intentions clear, and provide a general procedure for data shape transformation. Contrast with the Stata example, and the general problem of users bouncing around until they hit magic words that produce the desired structure. That is a real benefit, but it requires some understanding of data structures in order to apply. I’m not sure that is necessarily easy?

    It reminds me a little of the saying about how X makes easy things easy and hard things possible. For me, the current formulation of cdata makes everything possible.

    It strikes me now, saying cdata is easy to use is essentially saying data shape invariance is an easy idea to understand, but I’m not sure it is!

    This isn’t meant to be a criticism in any way. I know you are looking to expand the uptake of these ideas and so just offering a view of someone who is trying the approach.

    1. First, definitely appreciate your note. I think I get the intent, and agree with it.

      Lets say it is our aspiration to make the methodology easy.

      The theory and design took us a while to work out, so in that sense it isn’t “easy” and takes a while to internalize. Even for use there is a period of “can’t ever use this thing without one of the examples in front of me” (which is part of why we write these). However, we think we can eliminate the “darn I can’t remember the magic words, which way is which function name again?” problem.

      I myself can’t make the transform table without making a drawing. So in some sense it is hard or beyond what I can do in my head. However. if I do take the time to sketch out the transform on paper, I can get it to work. Also being able to print the transform, as we showed in the note, is a major step forward in clarity.

      And you are correct cdata makes a lot of transforms possible. For example a single gather/spread is exactly a cdata transform where the control table (or prototype block record) has exactly two columns. The big scatter plot in the package README is a fairly exotic transform that has complex in-record keying and is creating new cells (so actually can’t be inverted).

%d bloggers like this: