Here is a really nice feature found in the current 3.4.0 version of R: *summary()* has become a *lot* more reasonable.

summary(15555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 15555 15555 15555 15555 15555 15555

Please read on for some background.

In older versions of R (say R 3.3.1) the above code gave the following undesirable result:

summary(15555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 15560 15560 15560 15560 15560 15560

This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury `summary()`

picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “`15555`

“.

Frankly people do not expect significant digits to be 4 when viewing what appears to be an integer presented directly from software. They either expect display significance to be much lower such as “Earth has about `7,500,000,000`

people” (2 sigfig) or higher as “Daniel Burnham’s New York flatiron building has zip code `10010`

” (5 sigfig, and not the same as `10012`

). In my opinion it is a bit of crime to aggressively round numbers in an analysis (not presentation) system prior to moving into scientific notation (which can, in principle, signal the number of significant figures through the use of trailing zeros).

I take “`1.556e+4`

” as an acceptable textual approximation of `15555`

and “`15560`

” as unacceptable.

To make matters much worse, at the time R was storing rounded numbers in the summary! It wasn’t storing the presentation string “`15560`

” but the floating point or numeric value `15560.0`

. This very much confused representation and presentation and made pulling the median off a summary needlessly different than calling `median()`

.

Now thanks to Martin Maechler and the R core team: `summary()`

now stores much more reasonable numbers *and* separates representation from presentation:

summary(1555555555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 format(summary(1555555555), digits=12) # Min. 1st Qu. Median Mean 3rd Qu. Max. # "1555555555" "1555555555" "1555555555" "1555555555" "1555555555" "1555555555"

One of the motivations for the fix (which obviously will change some results) was [loc. sit.]:

The benefit for maintainers and old timers like me will be that we will not need to answer this (non-official) FAQ nor excuse a peculiar behavior in the future …..

The idea is: it is simpler to fix things than to forever explain/defend peculiar behavior. At some point software must adapt to its domain and users, and not always expect the users to retrain an arbitrary number of distinctions and caveats.

Categories: data science Opinion Statistics

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.