I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics .
I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.
Details of specific plots aside, the key points of Cleveland’s philosophy are:
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
- Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.
Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.
Let’s look at some data on household languages in the United States, according to the U.S. Census Bureau’s 2011 American Community Survey. We took a sample of just over 18,000 households, with information about state of residence and languages spoken in the home.
> summary(hdata) hh_id state hh.lang Min. : 33 California : 1987 English only :14732 1st Qu.: 376776 Texas : 1376 Spanish : 1862 Median : 751107 Florida : 1218 Other Indo-European : 843 Mean : 751033 New York : 1156 Asian/Pacific Island: 515 3rd Qu.:1120868 Illinois : 738 Other Lang. : 195 Max. :1501962 Pennsylvania: 738 (Other) :10934
The PUMS dataset that I started from can be found here. I restricted my sample to households that were non-vacant and non-institutional (meaning that I eliminated prisons, convalescent homes, etc.).
First: let’s just see the distribution of households by state:
ggplot(hdata) + geom_bar(aes(x=state), fill="gray") + coord_flip() + # reduce the font size of the y-axis tick labels theme(axis.text.y=element_text(size=rel(0.8)))
Not terribly easy to read. Cleveland would recommend sorting the states by frequency. I seem to recall that this is straightforward in base graphics (it’s been a while since I’ve used them), but ggplot sorts factor variables alphabetically by default. To change the plotting order, you have to reorder the factors.
# Reorder the state column levels to population-sorted order. hdata = transform(hdata, state=reorder(state, 1+numeric(dim(hdata)), FUN=sum)) ggplot(hdata) + geom_bar(aes(x=state), fill="gray") + coord_flip() + theme(axis.text.y=element_text(size=rel(0.8)))
Better. Now you can see the relative population order of the states at a glance. Cleveland would argue that a dotplot is even more informationally dense than this barchart. I didn’t completely buy that — the above barchart seems to tell the whole story. What else is there? But let’s try anyway.
# A dotplot: pretty close approximation to the style in Cleveland's book. # The theme arguments refer to the FINAL x and y axes, # not the pre-coord_flip axes. ggplot(hdata)+ geom_point(aes(x=state), stat="bin") + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank() , # explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype=3, color="darkgray"), axis.text.y=element_text(size=rel(0.8)) )
I admit it, there is a bit more information in this graph. It’s more obvious how much California outpaces the other states, and how much more populated even the next three states (Texas, Florida and New York) are than the remaining 46 states plus District of Columbia. And it’s a clean visual.
Let’s try something more complicated now: household language by state. The easy plot (bar chart) first.
# This is already sorted by state population. ggplot(hdata) + geom_bar(aes(x=state, fill=hh.lang)) + coord_flip() + theme(axis.text.y=element_text(size=rel(0.8)))
This is hard to read for a number of reasons. First, the “English only” household segment dominates the rest: almost 82% of the households in this sample. Second, you can’t easily compare the absolute counts of households that speak a given language family across states, because except for “Asian/Pacific Islander”, the colored segments don’t have a common ground-line to compare from. Third, you can’t easily compare the relative prevalence of the different language families across states, because the bars are all different lengths.
We can address point one by not plotting the English-only households, and point two by using the
position="dodge" argument to
geom_bar or — a better solution — faceting the graph by language family (
facet_wrap(~hh.lang)). Let’s try the second approach. Instead of dropping “English-only,” I log-scaled the count axes to make all of the graphs mutually legible. In theory, I could use the
scales="free_x" argument to
facet-wrap to make all the facets fill the plotting area, but the
scales argument doesn’t work when you use the
ggplot(hdata) + geom_bar(aes(x=state), fill="gray") + facet_wrap(~hh.lang) + scale_y_continuous(trans="log2") + coord_flip() + theme(axis.text.y=element_text(size=rel(0.8)) )
Looking at this, I think that dotplots would be preferable to barcharts, at least aesthetically. I also want the language families ordered by prevalence, rather than alphabetically.
# Order the facets (hh.lang) by population. # Use -1 instead of 1 to sort in decreasing order. hdata = transform(hdata, hh.lang = reorder(hh.lang, -1+numeric(dim(hdata)), FUN=sum)) # now dotplot ggplot(hdata) + geom_point(aes(x=state), stat="bin") + facet_wrap(~hh.lang) + scale_y_continuous(trans="log2") + coord_flip() + theme( panel.grid.major.x = element_blank() , panel.grid.major.y = element_line(linetype=3, color="darkgray"), axis.text.y=element_text(size=rel(0.8)) )
The resulting graph reads so that language families read left-to-right/top-to-bottom in decreasing order of their (national) prevalence, and the states read top-to-bottom in decreasing order of population. This is an order that is compatible with the way English-language readers would scan a page.
If you look hard, you can spot a few states that have unusually high or unusually low language prevalence relative to other states of their size. New Mexico has a slightly lower than expected number of English-only households, and higher than expected Spanish speaking and “other non-Asian/non-Indo-European” speaking households. Arizona shows the same pattern, though I didn’t mark it . Alaska has a much higher count of “other non-Asian/non-Indo-European” speaking households than you would expect for its population . I wish that I could add the x-axis labels to the right side of the graph as well; it would make matching points to states easier. I don’t think it’s possible in ggplot — I remember this being a deliberate design choice by Hadley Wickham, since dual-axis labels can be used misleadingly.
We’re hitting about the limit of what can be gleaned from this family of graphs (the count of household languages by state). Let’s look at the relative proportions of language families within each state. I’ll also go ahead and reorder the state levels by the fraction of non-English-only households.
# Get the fraction of non-English-only households by state. other.lang= ifelse(hdata$hh.lang=="English only", 0, 1) tmp = aggregate(other.lang, by=list(state= hdata$state), FUN=mean) other.lang.map = tmp$x; names(other.lang.map)= tmp$state # Add the fraction to hdata. hdata$hh.other.lang = other.lang.map[hdata$state] # Reorder the state levels by the fraction of non-English-only households, increasing. hdata = transform(hdata, state=reorder(state, hh.other.lang)) # Get the 10 (okay, 11) states with the highest fraction of non-English-only households. nl = nlevels(hdata$state) top10states = levels(hdata$state)[(nl-10):nl]
Let’s start with the easiest graph (the filled barchart). We’ll just graph the ten (actually, eleven) states with the highest fractions of non-English-only households.
# Plot the fraction of hholds who speak each language family. ggplot(subset(hdata, hdata$state %in% top10states)) + geom_bar(aes(x=state, fill=hh.lang), position="fill") + coord_flip()
As before, the states are ordered top-to-bottom from the highest fraction of non-English-only households to lowest, and the languages are ordered left-to-right by prevalence. It’s easy to compare the fraction of “English only” households and “Other Lang.” households across states. The remaining language families are harder to compare. On the other hand, you get a nice holistic view of language prevalence within each state.
Will a dotplot be any better? It will be a bit more work, because now I have to build the aggregations by hand (or at least, I haven’t figured out how to trick ggplot into doing it). Normally — because I’m lazy — I would make do with the barchart above, but let’s try building the dotplots. First we have to build the table of aggregates.
# Create the table of aggregates. langtab = table(hdata$state, hdata$hh.lang) langtotals = rowSums(langtab) # Create a data table of the fraction of households in each # language family, by language family and state langnorm = as.data.frame((1/langtotals)*as.matrix(langtab)) colnames(langnorm) = c("state", "hh.lang", "fraction")
The state and hh.lang variables in langnorm have the same level orderings as in hdata, so we can go straight to plotting. Here’s a ribbon version that’s the most direct analogy to the filled barchart.
# Dotplot version of the above fill plot. ggplot(subset(langnorm, langnorm$state %in% top10states)) + geom_point(aes(x=fraction, y=state)) + facet_grid(~hh.lang) + theme( panel.grid.major.x = element_blank() , panel.grid.major.y = element_line(linetype=3, color="darkgray"))
Now it’s easier to compare language families across states, but the view of languages within each state isn’t as obvious as it was with the filled barchart. You can reverse the emphasis with a faceted version of the dotplot.
# The faceted dotplot version. You can substitute geom_bar for geom_point. # This graph reads right-to-left/bottom-to-top, but I don't # feel like flipping the levels anymore ggplot(subset(langnorm, langnorm$state %in% top10states)) + geom_point(aes(x=hh.lang, y=fraction), stat="identity") + facet_wrap(~state) + coord_flip() + theme( panel.grid.major.x = element_blank() , panel.grid.major.y = element_line(linetype=3, color="darkgray"), axis.text.y=element_text(size=rel(0.8)) )
I like this one the best. You can make a barchart equivalent of it, by substituting
geom_point in the code snippet above.
In all the above graphs, we’ve played around with two of Cleveland’s concrete graphing principles:
- Dotplots are preferable to barcharts.
- Order categorical variables by important quantities, rather than alphabetically.
Point number 2 is almost always the right thing to do. Point number 1 is generally true, but, in my opinion, not essential. The main takeaway from this exercise is that it takes a few rounds of experimentation to find the graphs that give you what you need to know. Now you have a few more visualization tools to add to your arsenal.
 Cleveland actually did formal experiments to test his theories. Chapter 4 of The Elements of Graphing Data discusses the principles of graphical perception. Nathan Yau at FlowingData gives a brief synopsis of Cleveland’s findings, along with a link to Cleveland and McGill’s 1984 paper describing their work. They don’t seem to have done much beyond what is described in that paper. I would have thought that someone would have followed up on their research in the thirty years since the paper — especially since color was not as important a graphical factor back in their day as it is now. But if anyone has followed up on their research, they haven’t done it under the rubric “graphical perception.” ↩
 The fact that a household speaks a language other than English does NOT imply that they don’t speak English at all. 79% of the non-English-only households in my sample have at least one family member over age fourteen who speaks English fluently or “very well.” ↩
 I don’t know for sure, because I didn’t track it down, but I’m guessing that the “Other languages” populations in Alaska, New Mexico, and Arizona are primarily households that speak local Native American languages. ↩
Categories: Coding Pragmatic Data Science Tutorials
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
I am a new R user so I really appreciate the detailed write-up and the code. But, more than that I was stumbling around the US Census website just yesterday and it turns out your handy link to the PUMS data was just what I needed to answer a question for my boss. Many thanks!
“I would have thought that someone would have followed up on their research in the thirty years since the paper — especially since color was not as important a graphical factor back in their day as it is now. But if anyone has followed up on their research, they haven’t done it under the rubric “graphical perception.”
The study was replicated as a way to test using Mechanical Turk for studies,
There are a lot of good datasets on the Census site. The one Nina found is extra-awesom because it has everything in only a few files (millions of lines, hundreds of columns about 1 billion cells). I thought I would share the exact path to the dataset Nina found (as it is a bit hard to navigate to):
PUMS Data set from:
select “2011 ACS 1-year PUMS”
select “2011 ACS 1-year Public Use Microdata Samples (PUMS) – CSV format”
download “United States Population Records” and
“United States Housing Unit Records”