When we teach “`R`

for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list:

- To test the quality of a numeric model to numeric outcome: F-test (as in linear regression).
- To test the quality of a numeric model to a categorical outcome: χ
^{2}or “Chi-squared” test (as in logistic regression). - To test the association of a categorical predictor to a categorical outcome: many tests including Fisher’s exact test and Barnard’s test.
- To test the quality of a categorical predictor to a numeric outcome: t-Test, ANOVA, and Tukey’s “honest significant difference” test.

The above tests are all in terms of checking model results, so we don’t allow re-scaling of the predictor as part of the test (as we would have in a Pearson correlation test, or an area under the curve test). There are, of course, many alternatives such as Wald’s test- but we try to start with a set of tests that are standard, well known, and well reported by `R`

. An odd exception has always been the χ^{2} test, which we will write a bit about in this note.

The χ^{2} test is a very useful statistical test. In particular, under fairly mild assumptions, it is a usable probability model for the quality of fit of a logistic regression. It is based on a summary statistics called “deviance” (an odd name, but the quantity is strongly related to likelihood and to entropy). And, after a simple transform, it yields a quantity called “pseudo-R^{2}” (see The Simpler Derivation of Logistic Regression) that reads like “fraction of variation explained.” It is great final test for well-tuned models designed to estimate probabilities (just as area under the curve is a good early test as it abstracts out scale and choice of decision threshold).

Yet the χ^{2} test is under-emphasized and under-implemented in R. Consider the following trivial logistic regression model.

```
```d <- data.frame(x=c(1,2,3,4,5,6,7,7),
y=c(TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE))
model <- glm(y~x,data=d,family=binomial)
summary(model)
## Call:
## glm(formula = y ~ x, family = binomial, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.37180 -1.09714 -0.00811 1.08024 1.42939
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7455 1.6672 -0.447 0.655
## x 0.1702 0.3429 0.496 0.620
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.090 on 7 degrees of freedom
## Residual deviance: 10.837 on 6 degrees of freedom
## AIC: 14.837
##
## Number of Fisher Scoring iterations: 4

Notice that `R`

reported the change in deviance (the measure of model quality, also note the AIC or Akaike information criteria is present), but no significance on the overall model fit is reported (n.b. this is different from coefficient significances). The significance is probably not there because `glm()`

is a fully general generalized linear model fitter (it can fit much more than just logistic regressions) and the error model likely changes as you change the model type (controlled by the `family`

and `link`

controls).

But logistic regression really deserves to be front and center. This is why in Section 7.2 of *Practical Data Science with R*, Nina Zumel, John Mount, Manning 2014 we take the time to define and show how to calculate the pseudo-R^{2} and the significance for the reported deviance.

As we observed in “Proofing statistics in papers” having standard tests and standard reporting of tests is a great advantage. In this spirit we add an “APA-like” report of the χ^{2} significance in our `sigr`

package. The use is quick and decisive:

```
```library('sigr')
formatChiSqTest(model,pLargeCutoff=1,format='html')$formatStr

**Chi-Square Test** summary: *pseudo- R^{2}*=0.023 (

*χ*(1,

^{2}*N*=8)=0.25,

*p*=0.615).

The `sigr`

package isn’t up on CRAN, but can be installed using `devtools::install_github('WinVector/sigr')`

. It includes documentation, and an example vignette. `sigr`

can format directly to HTML, Latex, Markdown, and Word. It designed to be used with a `knitr`

workflow, and when used with `knit`

will automatically select the correct target formatting. Right now it generates reports for only linear and logistic regressions; we will likely fill out with a few more “most ofter used, so should have a nice neat format” tests going forward.

Categories: data science Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Hi! Nice stuff. I’ve been doing the same things within my apastats package: https://github.com/ralfer/apa_format_and_misc. Here are some example outputs: https://github.com/ralfer/apa_format_and_misc/blob/master/example/example.md and also there are examples in the help files (but not to all functions).

For the GLM model, it’ll give you different formatting depending on the style you want:

> describe.glm(model, ‘x’, 1)

[1] “_Z_ = 0.50, _p_ = .620”

> describe.glm(model, ‘x’, 2)

[1] “_B_ = 0.17 (0.34), _p_ = .620”

> describe.glm(model, ‘x’, 3)

[1] “_B_ = 0.17, _SE_ = 0.34, _Z_ = 0.50, _p_ = .620”

I’m usually not that interested in the overall (pseudo)R2, so I didn’t bother including it. There are also options for latex/markup formatting and a number of misc functions.

That looks really nice much more comprehensive (and standardized) than our project, thanks for sharing it.