I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.
The zero bug
Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem.
For our example problem please consider aggregating significant earthquake events in the United States of America.
To do this we will start with data from:
The National Geophysical Data Center / World Data Service (NGDC/WDS): Significant Earthquake Database, National Geophysical Data Center, NOAA, doi:10.7289/V5TD9V7K.
They database is described thusly:
The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.
I queried the form for “North America and Hawaii”:”USA” in tab delimited form. For simplicity and reproducibility I saved a the result in the url given in the
R example below. Starting our example we can use
R to load the data from the url and start our project.
url <- 'http://www.win-vector.com/dfiles/earthquakes.tsv' d <- read.table(url, header=TRUE, stringsAsFactors = FALSE, sep='t') View(d) head(d[,c('I_D','YEAR','INTENSITY','STATE')]) # I_D YEAR INTENSITY STATE # 1 6697 1500 NA HI # 2 6013 1668 4 MA # 3 9954 1700 NA OR # 4 5828 1755 8 MA # 5 5926 1788 7 AK # 6 5927 1788 NA AK
We see the data is organizes such that each row is an event (with
I_D), and contains many informational columns including “YEAR” (the year the event happened) and “STATE” (the state abbreviation where the event happened). Using
R tools and packages we can immediately start to summarize and visualize the data.
For example: we can count modern (say 1950 and later) US earthquakes by year.
library("ggplot2") library("dplyr") dModern <- d[d$YEAR>=1950, , drop=FALSE] # histogram ggplot(data=dModern, mapping=aes(x=YEAR)) + geom_histogram(binwidth=1) + ggtitle('number of modern USA earthquakes by year')
Or we can get use
dplyr to build the count summaries by hand and present the summary in a stem-plot instead of the histogram.
# aggregate the data byYear <- dModern %>% group_by(YEAR) %>% summarize(count = n()) %>% arrange(YEAR) # produce a stem-style plot ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=YEAR, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byYear$count))) + ggtitle('number of modern USA earthquakes by year')
We have already snuck in the mistake. The by-hand aggregation “
byYear” is subtly wrong. The histogram is almost correct (given its graphical convention of showing counts as height), but the stem plot is revealing problems.
Here is the problem:
summary(byYear$count) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1.00 1.00 2.00 2.50 3.75 7.00
Notice the above summary implies the minimum number of significant earthquakes seen in the United States in a modern year is
1. Looking at our graphs we can see it should in fact be
0. This wasn’t so bad for graphing, but can be disastrous in calculation or in directing action.
This is the kind of situation that
anti_join is designed to fix (and is how
replyr::replyr_coalesce deals with the problem).
A simple ad-hoc fix I recommend is to build a second synthetic summary frame that carries explicit zero counts. We then add these counts into our aggregation and get correct summaries.
For example the following code:
# add in zero summaries zeroObs <- data.frame(YEAR=1950:2016, count=0) byYear <- rbind(byYear, zeroObs) %>% group_by(YEAR) %>% summarize(count = sum(count)) %>% arrange(YEAR) # re-plot ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=YEAR, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byYear$count))) + ggtitle('number of modern USA earthquakes by year')
gives us the corrected stem plot:
The figure above is the correct stem-plot. Remember: while a histogram denotes counts by filled-heights of bars, a stem-plot denotes counts by positions of visible points. The point being: in a proper stem-plot zero counts are not invisible (and are in fact distinguishable from missing summaries).
A harder example
This issue may seem trivial but that is partly because I deliberately picked a simple example where it is obvious there is missing data. This is not always the case. Remember: hidden errors can be worse than visible errors. In fact even in the original histogram it was not obvious what to think about the missing years 1950 (which apparently had no significant US earthquakes) and 2017 (which is an incomplete year). We really have to explicitly specify the complete universe (also called range or support set) of keys (as in
YEAR=1950:2016, and not use a convenience such as
This is much more obvious if we summarize by state instead of year.
byState <- dModern %>% group_by(STATE) %>% summarize(count = n()) %>% arrange(STATE) ggplot(data=byState, mapping=aes(x=STATE, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=STATE, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byState$count))) + ggtitle('number of modern USA earthquakes by state')
We do not have a proper aggregation where each state count is represented with an explicit zero, and not by implicit missingness (which can’t differentiate states with zero-counts from non-states or misspellings). To produce the proper summary we need a list of US state abbreviations to merge in.
A user of the above graph isn’t likely to be too confused. Such a person likely knows there are 50 states and will presume the missing states have zero counts. However downstream software (which can be punishingly literal) won’t inject such domain knowledge and will work on the assumption that there are 17 states and all states have had significant earthquakes in modern times. Or suppose we are aggregating number treatments given to a set of patients; in this case the unacceptable confusion of not-present and zero-counts can really hide huge, problems: such as not noticing some patients never received treatment.
R actually has list of state abbreviations (in “
datasets::state.abb“) and we can use
replyr::replyr_coalesce to quickly fix our issue:
library('replyr') support <- data.frame(STATE= c('', datasets::state.abb), stringsAsFactors = FALSE) byState <- byState %>% replyr_coalesce(support, fills= list(count= 0)) %>% arrange(STATE)
An important feature of
replyr_coalesce is: it checks that the count-carrying rows (the data) are contained in the support rows (the range definition). It would (intentionally) throw an error if we tried to use just
datasets::state.abb as the support (or range) definition as this signals the analyst didn’t expect the blank state in the data set.
“The Dog that Didn’t Bark”
An additional illustrative example is from Sir Arthur Conan Doyle’s story “The Adventure of Silver Blaze”.
In this story Sherlock Holmes deduces that a horse had been absconded not by a stranger, but by somebody well known at its stable. Holmes explains the key clue here:
Gregory (Scotland Yard): "Is there any other point to which you would wish to draw my attention?" Holmes: "To the curious incident of the dog in the night-time." Gregory: "The dog did nothing in the night-time." Holmes: "That was the curious incident."
For this type of reasoning to work: Holmes has to know that there was a dog present, the dog would have barked if a stranger approached the stables, and none of his witnesses reported hearing the dog. Holmes needs an affirmative record that no barks were heard: a zero written in a row, not a lack of rows.
When describing “common pitfalls” the reaction is often: “That isn’t really interesting, as I would never make such an error.” I believe “the zero bug” is in fact common and not noticed as it tends to hide. The bug is often there and invisible, but can produce invalid results.
Any summary that is produced by counting un-weighted rows can never explicitly produce a value of zero (it can only hint at such through missingness).
n() can never form zero, so if zero is important it must be explicitly joined in after counting. In a sense organizing rows for counting with
n() censors out zeros.
I can’t emphasize enough how important it is to only work with explicit representations (records indicating counts, even if they are zero), and not implicit representations (assuming non-matched keys indicate zero-counts). The advantages of explicit representations are one of the reasons
R has a notation for missing values (“
NA“) in the first place.
This is, of course, not a problem due to use of
dplyr. This is a common problem in designing
SQL database queries where you can think of it as “inner-join collapse” (failing to notice rows that disappear due to join conditions).
My advice: you should be suspicious of any use of
summarize(VALUE=n()) until you find the matching zero correction (
replyr_coalesce, or a comment documenting why no such correction is needed). This looking from
n() to a matching range-control should be become as habitual as looking from an opening parenthesis to its matching closing parenthesis. Even in data science, you really have to know you have all the little things right before you rely on the large things.
Edit 3-12-2020, we have mostly retired the replyr package and coalesce has moved into the wrapr package here.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.