In my previous post, I showed how to use cdata
package along with ggplot2
‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata
to produce a ggplot2
version of a scatterplot matrix, or pairs plot?
A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs()
function. Here is the base version of the pairs plot of the iris
dataset:
pairs(iris[1:4],
main = "Anderson's Iris Data -- 3 species",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
There are other ways to do this, too:
# not run
library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) +
ggtitle("Anderson's Iris Data -- 3 species")
library(lattice)
splom(iris[1:4],
groups=iris$Species,
main="Anderson's Iris Data -- 3 species")
But I wanted to see if cdata
was up to the task. So here we go….
First, load the packages:
library(ggplot2)
library(cdata)
To create the pairs plot in ggplot2
, I need to reshape the data appropriately. For cdata
, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.
Here, I specify the variables I want to plot.
meas_vars <- colnames(iris)[1:4]
The function expand_grid()
returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.
# the data.frame() call strips the attributes from
# the frame returned by expand.grid()
controlTable <- data.frame(expand.grid(meas_vars, meas_vars,
stringsAsFactors = FALSE))
# rename the columns
colnames(controlTable) <- c("x", "y")
# add the key column
controlTable <- cbind(
data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
stringsAsFactors = FALSE),
controlTable)
controlTable
## pair_key x y
## 1 Sepal.Length Sepal.Length Sepal.Length Sepal.Length
## 2 Sepal.Width Sepal.Length Sepal.Width Sepal.Length
## 3 Petal.Length Sepal.Length Petal.Length Sepal.Length
## 4 Petal.Width Sepal.Length Petal.Width Sepal.Length
## 5 Sepal.Length Sepal.Width Sepal.Length Sepal.Width
## 6 Sepal.Width Sepal.Width Sepal.Width Sepal.Width
## 7 Petal.Length Sepal.Width Petal.Length Sepal.Width
## 8 Petal.Width Sepal.Width Petal.Width Sepal.Width
## 9 Sepal.Length Petal.Length Sepal.Length Petal.Length
## 10 Sepal.Width Petal.Length Sepal.Width Petal.Length
## 11 Petal.Length Petal.Length Petal.Length Petal.Length
## 12 Petal.Width Petal.Length Petal.Width Petal.Length
## 13 Sepal.Length Petal.Width Sepal.Length Petal.Width
## 14 Sepal.Width Petal.Width Sepal.Width Petal.Width
## 15 Petal.Length Petal.Width Petal.Length Petal.Width
## 16 Petal.Width Petal.Width Petal.Width Petal.Width
The control table specifies that for every row of iris
, sixteen new rows get produced, one for each possible pair of variables. The column pair_key
will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x
and y
will be the value columns in the new data frame — these will be the columns that we plot.
Now I can create the new data frame, using rowrecs_to_blocks()
. I’ll also carry along the Species
column to color the points in the plot.
iris_aug = rowrecs_to_blocks(
iris,
controlTable,
columnsToCopy = "Species")
head(iris_aug)
## Species pair_key x y
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1
## 3 setosa Petal.Length Sepal.Length 1.4 5.1
## 4 setosa Petal.Width Sepal.Length 0.2 5.1
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5
Note that the data is now sixteen times larger, which I admit is perverse.
If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y
vs x
, and facet_wrap
on pair_key
. But I want the subplots arranged in a grid. To do this I use facet_grid
, which will require two key columns:
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1))
head(iris_aug)
## Species pair_key x y xv yv
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 Sepal.Width Sepal.Length
## 3 setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length
## 4 setosa Petal.Width Sepal.Length 0.2 5.1 Petal.Width Sepal.Length
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length Sepal.Width
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5 Sepal.Width Sepal.Width
And now I can produce the graph, using facet_grid
.
# reorder the key columns to be the same order
# as the base version above
iris_aug$xv <- factor(as.character(iris_aug$xv),
meas_vars)
iris_aug$yv <- factor(as.character(iris_aug$yv),
meas_vars)
ggplot(iris_aug, aes(x=x, y=y)) +
geom_point(aes(color=Species, shape=Species)) +
facet_grid(yv~xv, labeller = label_both, scale = "free") +
ggtitle("Anderson's Iris Data -- 3 species") +
scale_color_brewer(palette = "Dark2") +
ylab(NULL) +
xlab(NULL)
This pair plot has x = y
plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs()
.
Of course, calling pairs()
(or ggpairs()
, or splom()
) is a lot easier than all this, but now I’ve proven to myself that cdata
with ggplot2
can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs()
— I prefer the Brewer Dark2 palette, myself.
Luckily, this code is straightforward to wrap as a function, so if you like the cdata
version, I’ve now added the PairPlot()
function to WVPlots
. Now it’s a one-liner, too.
library(WVPlots)
PairPlot(iris,
colnames(iris)[1:4],
"Anderson's Iris Data -- 3 species",
group_var = "Species")
Categories: Tutorials
Nina Zumel
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
Thanks for this! Maybe to add, the ggpairs function from the GGally package produces a very nice scatter matrix as well, and it even handles categorical variables (not advertising my own package here, btw).
Glad you are a
GGally
fan. Just so our readers are not confused by the comment: there has always been aGGally
example right after the the firstgraphics::pairs()
example in the article (followed by alattice
example).