We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:
The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.
I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.
library(ggplot2) library(ggExtra) frm = read.csv("tips.csv") plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + geom_point() + geom_smooth(method="lm") # default: type="density" ggMarginal(plot_center, type="histogram")
I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.
ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.
# our own (very beta) plot package: details later library(WVPlots) frm = read.csv("tips.csv") ScatterHist(frm, "total_bill", "tip", smoothmethod="lm", annot_size=3, title="Tips vs. Total Bill")
You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the
ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file
sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.
Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.