Here is a really nice feature found in the current 3.4.0 version of R: summary() has become a lot more reasonable.
summary(15555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 15555 15555 15555 15555 15555 15555
Please read on for some background.
In older versions of R (say R 3.3.1) the above code gave the following undesirable result:
summary(15555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 15560 15560 15560 15560 15560 15560
This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury
summary() picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “
Frankly people do not expect significant digits to be 4 when viewing what appears to be an integer presented directly from software. They either expect display significance to be much lower such as “Earth has about
7,500,000,000 people” (2 sigfig) or higher as “Daniel Burnham’s New York flatiron building has zip code
10010” (5 sigfig, and not the same as
10012). In my opinion it is a bit of crime to aggressively round numbers in an analysis (not presentation) system prior to moving into scientific notation (which can, in principle, signal the number of significant figures through the use of trailing zeros).
I take “
1.556e+4” as an acceptable textual approximation of
15555 and “
15560” as unacceptable.
To make matters much worse, at the time R was storing rounded numbers in the summary! It wasn’t storing the presentation string “
15560” but the floating point or numeric value
15560.0. This very much confused representation and presentation and made pulling the median off a summary needlessly different than calling
Now thanks to Martin Maechler and the R core team:
summary() now stores much more reasonable numbers and separates representation from presentation:
summary(1555555555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 format(summary(1555555555), digits=12) # Min. 1st Qu. Median Mean 3rd Qu. Max. # "1555555555" "1555555555" "1555555555" "1555555555" "1555555555" "1555555555"
One of the motivations for the fix (which obviously will change some results) was [loc. sit.]:
The benefit for maintainers and old timers like me will be that we will not need to answer this (non-official) FAQ nor excuse a peculiar behavior in the future …..
The idea is: it is simpler to fix things than to forever explain/defend peculiar behavior. At some point software must adapt to its domain and users, and not always expect the users to retrain an arbitrary number of distinctions and caveats.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.