We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:

The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

```
```library(ggplot2)
library(ggExtra)
frm = read.csv("tips.csv")
plot_center = ggplot(frm, aes(x=total_bill,y=tip)) +
geom_point() +
geom_smooth(method="lm")
# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.

The `ggMarginal()`

function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

```
```# our own (very beta) plot package: details later
library(WVPlots)
frm = read.csv("tips.csv")
ScatterHist(frm, "total_bill", "tip",
smoothmethod="lm",
annot_size=3,
title="Tips vs. Total Bill")

You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the `ggMarginal`

version. If you’re curious, the code is here. It relies on some functions in the file `sharedFunctions.R`

in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent *ggplot2: Cheatsheet for Visualizing Distributions* for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

Categories: Coding data science Exciting Techniques Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Programming Statistics Tutorials

### nzumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

An easy way to try Nina’s plot is to install the package from Github (using devtools):

LikeLike

You might be able to do what you’re trying with ggplot2, gtable and grid. I did something requiring similar manipulations and alignments of multiple plot panels for my plot.qcc rewrite (devtools::install_github(“tomhopper/gcc_ggplot”).

LikeLike

Thanks! We’ve been working with grid, but I haven’t tried gtable yet. I will check it out (and your code, too).

LikeLike

A couple of years ago, I was doing doing something similar using ggplot2 and gtable. See https://github.com/SandyMuspratt/ScatterBoxPlot

LikeLike

Thank you! As I said to the commenter above, we’ve not tried gtable. Thanks for the pointer, and I will check out your version as well.

LikeLike

Though fancy can certainly be good, a more generalized panelplot might be of some interest. Here is a riff with marginals, borrowing on functionalities in

`asbio::panel.cor.res`

The robust confidence bounds (in green and grey), robust correlation coefficient and robust analogue of the t-test are from Rand Wilcox. Dotted verticals and horizontals are arithmetic means; red is linear fit while blue is loess estimator (when appropriate). Though the x-axis labeling for binary items does need tweaking, this kind of automated panelplot, called with just one line, provides a lot of payback for datasets containing reasonable k’s and n’s.

LikeLike

Thanks for posting this! I’ve seen variations of pair plots like that before. I like them very much when I want to get a quick overview of several variables at once.

I sometimes use a version based on ggplot (package ggally, I think), but your base plot version with the additional annotations (linear fits/loess, means, robust bounds, etc) is quite nice.

LikeLike

I’m glad you found my package easy to use! Your ScatterHist output looks really nice (haven’t looked at the code).

I never really thought of the usecase of having both types of plots on top of each other. You’re welcome to submit a PR or open a github issue if it’s something that you think more people will find useful 🙂

LikeLike

Thanks for stopping by! ggExtra was a good find for us. And I liked your marginal boxplot variation on ggMarginal, too.

LikeLike