We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:
The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.
I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.
However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.
library(ggplot2) library(ggExtra) frm = read.csv("tips.csv") plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + geom_point() + geom_smooth(method="lm") # default: type="density" ggMarginal(plot_center, type="histogram")
I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.
The ggMarginal()
function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.
# our own (very beta) plot package: details later library(WVPlots) frm = read.csv("tips.csv") ScatterHist(frm, "total_bill", "tip", smoothmethod="lm", annot_size=3, title="Tips vs. Total Bill")
You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal
version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R
in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.
Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.
Categories: Coding Exciting Techniques Pragmatic Data Science Tutorials
Nina Zumel
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
An easy way to try Nina’s plot is to install the package from Github (using devtools):
You might be able to do what you’re trying with ggplot2, gtable and grid. I did something requiring similar manipulations and alignments of multiple plot panels for my plot.qcc rewrite (devtools::install_github(“tomhopper/gcc_ggplot”).
Thanks! We’ve been working with grid, but I haven’t tried gtable yet. I will check it out (and your code, too).
A couple of years ago, I was doing doing something similar using ggplot2 and gtable. See https://github.com/SandyMuspratt/ScatterBoxPlot
Thank you! As I said to the commenter above, we’ve not tried gtable. Thanks for the pointer, and I will check out your version as well.
Though fancy can certainly be good, a more generalized panelplot might be of some interest. Here is a riff with marginals, borrowing on functionalities in
asbio::panel.cor.res
The robust confidence bounds (in green and grey), robust correlation coefficient and robust analogue of the t-test are from Rand Wilcox. Dotted verticals and horizontals are arithmetic means; red is linear fit while blue is loess estimator (when appropriate). Though the x-axis labeling for binary items does need tweaking, this kind of automated panelplot, called with just one line, provides a lot of payback for datasets containing reasonable k’s and n’s.
Thanks for posting this! I’ve seen variations of pair plots like that before. I like them very much when I want to get a quick overview of several variables at once.
I sometimes use a version based on ggplot (package ggally, I think), but your base plot version with the additional annotations (linear fits/loess, means, robust bounds, etc) is quite nice.
I’m glad you found my package easy to use! Your ScatterHist output looks really nice (haven’t looked at the code).
I never really thought of the usecase of having both types of plots on top of each other. You’re welcome to submit a PR or open a github issue if it’s something that you think more people will find useful :)
Thanks for stopping by! ggExtra was a good find for us. And I liked your marginal boxplot variation on ggMarginal, too.