In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris
data, like so:

I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both:

Except — which columns do I plot and what do I facet on?
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Here’s one way to create the plot I want, using the cdata
package along with ggplot2
.
First, load the packages and data:
library("ggplot2")
library("cdata")
iris <- data.frame(iris)
Now define the data-shaping transform, or control table. The control table is basically a picture that sketches out the final data shape that I want. I want to specify the x
and y
columns of the plot (call these the value columns of the data frame) and the column that I am faceting by (call this the key column of the data frame). And I also need to specify how the key and value columns relate to the existing columns of the original data frame.
Here’s what the control table looks like:
The control table specifies that the new data frame will have the columns flower_part
, Length
and Width
. Every row of iris
will produce two rows in the new data frame: one with a flower_part
value of Petal
, and another with a flower_part
value of Sepal
. The Petal
row will take the Petal.Length
and Petal.Width
values in the Length
and Width
columns respectively. Similarly for the Sepal
row.
Here I create the control table in R, using the convenience function wrapr::build_frame()
to create the controlTable
data frame in a legible way.
(controlTable <- wrapr::build_frame(
"flower_part", "Length" , "Width" |
"Petal" , "Petal.Length", "Petal.Width" |
"Sepal" , "Sepal.Length", "Sepal.Width" ))
## flower_part Length Width
## 1 Petal Petal.Length Petal.Width
## 2 Sepal Sepal.Length Sepal.Width
Now I apply the transform to iris
using the function rowrecs_to_blocks()
. I also want to carry along the Species
column so I can color the scatterplot points by species.
iris_aug <- rowrecs_to_blocks(
iris,
controlTable,
columnsToCopy = c("Species"))
head(iris_aug)
## Species flower_part Length Width
## 1 setosa Petal 1.4 0.2
## 2 setosa Sepal 5.1 3.5
## 3 setosa Petal 1.4 0.2
## 4 setosa Sepal 4.9 3.0
## 5 setosa Petal 1.3 0.2
## 6 setosa Sepal 4.7 3.2
And now I can create the plot!
ggplot(iris_aug, aes(x=Length, y=Width)) +
geom_point(aes(color=Species, shape=Species)) +
facet_wrap(~flower_part, labeller = label_both, scale = "free") +
ggtitle("Iris dimensions") +
scale_color_brewer(palette = "Dark2")

In the next post in this series, I will show how to use cdata
and ggplot2
to create a scatterplot matrix.
Categories: Tutorials
Nina Zumel
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
The following is an alternative method using only the tidy verse packages:
This example can be done with one
rbind
and creating one trivial new column. I can also see the value in having a programmatic way of doing it usingcdata
to write a specification. But that mess of percentage sandwiches masquerading as “tidy”? It has pipes within functions within pipes. Where’s the value in that?The
cdata
system actually gives one extra additional ability to reason about these transforms.For example: for this direction (
rowrecs_to_blocks()
) any two row control table is anrbind
in disguise (as this is), and any two column control table is atidyr::gather()
(so this isn’t a singletidyr::gather()
).In fact the control table is literally the column re-namings you would do to solve this with an an rbind; but written in a very orderly and concise format, check out how the colored arguments match in the diagram comparing two solutions, below.
Also in terms of Codd-style data manipulation algebra:
rowrecs_to_blocks()
is analogous to joining with the control table, andblocks_to_rowrecs()
is its inverse (and looks a lot like a table widening followed by a summary/coalesce).Some articles and video developing the theory:
Coordinatized Data: A Fluid Data Specification
Fluid Data
Fluid data reshaping with cdata
Data Shaping with cdata
There is indeed a tidyverse solution, but it is easier than what Justin proposes:
Ahh. I felt there was a simpler method I just was not coming up with it. Thanks for the example.
We now have an article up detailing more on the systematic method to derive the transform control table for a given desired transform here.
Thanks for the post. Also could be accomplished with 3 sqldf statements
septal = sqldf(‘select Species, “Septal” as flower_part, “Sepal.Length” as length, “Sepal.Width” as width from iris ‘)
petal = sqldf(‘select Species, “Petal” as flower_part, “Petal.Length” as length, “Petal.Width” as width from iris ‘)
iris_aug = sqldf(“select * from septal
union all
select * from petal”)
But good to have other tools available and I have not used cdata much. The concept of a control table is nice.
In fact
cdata
itself can translate the control table intoSQL
code realizing the transform (and thus be used with databases).