The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.
— William Cleveland, The Elements of Graphing Data, Chapter 2
In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.
I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.
The graphs are all produced in
R, using the
ggplot2 package. While
ggplot2 has a fairly high learning curve, it is the most flexible of the
R graphing packages that I have encountered, and I’ve been able to quickly create rich graphics more easily than I would be able to with the
R base graphics, or with other graphics packages.
Let’s start with some exploratory analysis. We will use the
AdultUCI dataset that is included in the
library(arules) data("AdultUCI") dframe = AdultUCI[, c("education", "hours-per-week")] colnames(dframe) = c("education", "hours_per_week") # get rid of the annoying minus signs in the column names
We want to compare the distribution of work-week length to education, using a box-and-whisker plot that is overlaid on a jittered scatterplot of the data.
library(ggplot2) ggplot(dframe, aes(x=education, y=hours_per_week)) + geom_point(colour="lightblue", alpha=0.1, position="jitter") + geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip()
outlier.size=0 argument to
geom_boxplot turns off the outlier plotting, and
coord_flip switches the coordinate axes (because there are a lot of education levels).
The resulting graph:
Recall that the box of a box-and-whisker plot covers the central 50% of the data distribution; the line in the center marks the median. In this case, the work-week length concentrates so strongly at 40 hours (except for PhDs and those with professional degrees; they are doomed to work longer hours, typically) that most of the boxes appear one-sided; it’s easier to see what is happening with both the scatterplot and box-and-whisker superimposed, than it might be with the box-and-whisker alone. We can also see the relative concentration of the subjects along each educational level.
I’ve found that this superimposed graph is fairly easy to explain in a presentation (easier than a plain box-and-whisker, actually). The primary disadvantage that the scatterplot can get illegible for high volume datasets (this set has about 49 thousand rows). In this case, we have to return to the box-and-whisker plot alone.
Beyond exploratory analysis, we also want plots to evaluate the models that we fit. Win-Vector’s bread-and-butter recently has been logistic regression, so we will start with some visualizations for evaluating binary logistic regression models. We’ll use the heart disease dataset that Hastie, et.al, used in the Elements of Statistical Learning.
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data" saheart = read.table(path, sep=",",head=T,row.names=1) fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity" model = glm(fmla, data=saheart, family=binomial(link="logit"), na.action=na.exclude)
We will make a data frame of chd (the true response, coronary heart disease), and the score from the model.
dframe = data.frame(chd=as.factor(saheart$chd), prediction=predict(model, type="response"))
The standard diagnostic plot for logistic models is the ROC curve, which is fine, but personally, I don’t get a visceral feel for the model from looking at the ROC. Also, if you are interested in setting a score threshold on the model for classification purposes, the ROC adds an additional level of indirection, since it essentially integrates the score away. I used to plot the distribution of score (prediction) versus true response, like so:
ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()
This visualization tells me whether or not the model scores actually separate the response — in this case, the model identifies negative cases (no coronary heart disease) better than positive cases. The graph is hard to explain to a non-technical audience, and it has the disadvantage that both distributions are separately normalized to have unit area, so you get no sense of the relative proportion of positive and negative cases (in this case, about 35% of the population have coronary heart disease).
Here’s an alternate graph:
ggplot(dframe, aes(x=prediction, fill=chd)) + geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
This is two semi-transparent histograms; the blue histogram for
chd=1 is “in front” of the the red histogram. Because they are histograms, rather than density plots, we can more clearly see the relative distribution of positive to negative cases, and we have a better sense of how well (or not) the model separates the positive cases from the negative ones. Clearly, for most score thresholds, the model will have a fairly high false positive rate. I use this visualization all the time, but it is also fairly hard to explain, the transparency in particular.
We can also use our friend the box-and-whisker scatterplot.
ggplot(dframe, aes(x=chd, y=prediction)) + geom_point(position="jitter", alpha=0.2) + geom_boxplot(outlier.size=0, alpha=0.5)
The median score for the coronary heart disease cases is pulled away from the median score of the healthy subjects, but the central 50% of the two distributions still overlap.
Finally, let’s look at visualizations for linear regression. We’ll use the prostate cancer data from Elements of Statistical Learning.
fmla = "lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45" model = lm(fmla, data=prostate.data)
We can just
plot(model) for some diagnostic graphs:
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0)) plot(model)
These diagnostics are useful to determine whether or not a linear model is suitable, and to identify outliers; but again, I personally don't get a visceral feel for the model. I prefer to directly plot prediction against true response:
dframe = data.frame(lpsa=prostate.data$lpsa, prediction=predict(model)) title = sprintf("Prostate Cancer modeln R-squared = %1.3f", summary(model)$r.squared) ggplot(dframe, aes(x=lpsa, y=prediction)) + geom_point(alpha=0.2) + geom_line(aes(y=lpsa), colour="blue") + opts(title=title)
This graph gives you the same information as the Residuals vs. Fitted plot, and the Q-Q plot -- in particular, whether there is systematic over- or under-prediction in specific ranges of the data. It will expose outliers, and it is intuitive to explain when presenting your results. Furthermore, it can be used to evaluate other models that predict a continuous response, such as regression trees or polynomial fits.
Which graphs do you find especially useful for your day-to-day work?
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.