My criticism of R‘s numeric
summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs.
summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.
The Big Lebowski, 1998.
Please read on for some context and my criticism.
Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.
My group has been doing a lot more professional training lately. This is interesting because bright students really put a lot of interesting demands on how you organize and communicate. They want things that make sense (so they can learn them), that are powerful (so it is worth learning them), and that are regular (so they can compose them and move beyond what you are teaching). Students are less sympathetic to implementation history and unstated conventions, as new users tend not to benefit from them. Remember a new
R student is still deciding if they want to use
R, to them it is new so an instructor needs to defend
R‘s current trade-offs (not its evolutionary path). We find it is best to point out both what is great in
R and what isn’t great (versus skipping such, or worse trying to justify such portions).
Please keep this in mind when I demonstrate what goes wrong when one attempts to teach R’s
summary() function to the laity.
Suppose you had a list or vector of numbers in R. It would be useful to be able to produce and view some summaries or statistics about these numbers. The primary way to do this in R is to call the
summary() method. Here is an example below:
numbers <- 1:7 print(numbers) ##  1 2 3 4 5 6 7 summary(numbers) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.0 2.5 4.0 4.0 5.5 7.0
From the names attached to the results you can get the meanings and move on. But the whole time you are hoping none of your students call
summary() on a single number. Because if the do, they have a very good chance of seeing
summary() fail. And now you have broken trust in
Let’s tack into the wind and demonstrate the failure:
summary(15555) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 15560 15560 15560 15560 15560 15560
summary() is claiming the minimum value from the set of numbers
15560. Now this is a deliberately trivial example where we can see what is going on (it sure looks like presentation rounding). To make matters worse, this isn’t just confusion generated during presentation- the actual values are wrong.
str(summary(15555)) ## Classes 'summaryDefault', 'table' Named num [1:6] 15560 15560 15560 15560 15560 ... ## ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ... summary(15555)[['Min.']] == min(15555) ##  FALSE
It may seem silly to expect the slots from a
summary() call on a vector would be used in calculation (when we have direct functions such as
mean() for getting the same results), but using values from summaries of models is standard practice in R. The trivial linear model summary
summary(lm(y~0,data.frame(y=15555))) shows rounded results (though it appears to hold accurate results, and only round during presentation; use
unclass() to inspect the actual values).
Why it Matters
This is in fact a problem. You can say this is a consequence of the “default settings of
summary()” and it is my fault for not changing those settings. But frankly it is quite fair to expect the default settings to be safe and sane.
Let us also appeal to authority:
The many computational steps between original data source and displayed results must all be truthful, or the effect of the analysis may be worthless, if not pernicious. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive.John Chambers, Software for Data Analysis: Programming with R, Springer 2008.
The point is you are delegating work to your system. If it needlessly fails (no matter how trivially) when observed, how can you trust it when unobserved? John Chambers’ point is that trust is very expensive to build up, so you really don’t want to squander it.
I used to try to “lecture this away” as just being “rounding in the presentation for neatness.” But this runs into two objections:
- Why doesn’t the presentation hint at this by switching to scientific notation such as
summary()“is just presentation” wouldn’t it be a string?
We are losing substitutability. We would love to be able to say to students that “
summary() is a convenient shorthand and you can treat the following as equivalent”:
summary(x)[['Min.']] == min(x)
summary(x)[['1st Qu.']] == quantile(x,0.25)
summary(x)[['Median']] == median(x)
summary(x)[['Mean']] == mean(x)
summary(x)[['3rd Qu.']] == quantile(x,0.75)
summary(x)[['Max.']] == max(x)
But the above isn’t always the case. What we would like is for
summary() to contain these values and get pretty printing by using the S3 or S4 object system to override the
print() method. It is quite likely
summary() predates these object systems, so achieved pretty printing through rounding of values.
What is going on?
We can take a look at the actual code and see what is happening. We are looking for a reason, not an excuse.
help(summary) we see summary takes a
digits option with default value
digits = max(3, getOption("digits")-3) (lets not even get into why setting
digits directly does one thing and the system default is shifted by
7 on my machine so we see we are asking for four digit rounding, which is consistent with what we saw. Digging through the dispatch rules we can eventually determine that for a numeric vector
summary() eventually calls
summary.default(). By calling
print(summary.default) we can look at the code. The offending snippet is:
qq <- stats::quantile(object) qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
After computing the quantiles summary then calls
signif() to round the results.
R isn’t inaccurate, it just went out of its way to round the results.
Why is this whiny rant so long?
One reason this article is long is the behavior we are describing breaks expectations. So we end up having to document what is actually going on (a laborious process) instead of being able to rely on shared educated expectations. The whining is where actualities and expectations diverge.
summary() attempts to achieve neatness and legibility. This is a laudable goal, if achievable. Numeric analysis is not so simple that rounding could safely achieve such a goal.
It is well known that rounding is not a safe or faithful operation (it loses information, and can be catastrophic if naively applied in many stages of a complex calculation). Because it is obvious rounding is dangerous, sophisticated students are surprised that it defaults to “on” in common calculations without indication or warning (such as moving to scientific notation).
summary() compounds this error by returning rounded values (instead of rounding only at
summary() is often a first view of data (along with
print()) we encounter confusing inconsistent situations where un-rounded values (presentation of original data) and rounded values are compared.
Of course, we can (and should) teach students to call
quantile(x) rather than
summary(x) when they want to reuse the summary statistics. But then we have to explain why. After seeing something like this it becomes an unfortunate additional teaching goal to convince students that more of
R doesn’t behave like
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.