# sigr: Simple Significance Reporting

`sigr` is a simple `R` package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use. ## Model Example

Let’s take as our example the following linear relation between `x` and `y`:

``````library('sigr')
set.seed(353525)
d <- data.frame(x= rnorm(5))
d\$y <- 2*d\$x + 0.5*rnorm(nrow(d))``````

`stats::lm()` has among the most complete summaries of all models in `R`, so we easily get can see the quality of fit and its significance:

``````model <- lm(y~x, d=d)
summary(model)``````
``````##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
##       1       2       3       4       5
##  0.3292  0.3421  0.0344 -0.1671 -0.5387
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.4201     0.2933   1.432  0.24747
## x             2.1117     0.2996   7.048  0.00587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4261 on 3 degrees of freedom
## Multiple R-squared:  0.943,  Adjusted R-squared:  0.9241
## F-statistic: 49.67 on 1 and 3 DF,  p-value: 0.005871``````

`sigr::wrapFTest()` can render the relevant model quality summary.

``````cat(render(wrapFTest(model),
pLargeCutoff=1))``````

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Note: `sigr` reports the un-adjusted R-squared, as this is the one that significance is reported for in `summary::lm()`.

`sigr` also carries around the important summary components for use in code.

``unclass(wrapFTest(model))``
``````## \$test
##  "F Test"
##
## \$numdf
##  1
##
## \$dendf
##  3
##
## \$FValue
##  49.66886
##
## \$R2
##  0.9430403
##
## \$pValue
##  0.005871236``````

In this function `sigr` is much like `broom::glance()` or `modelr::rsquare()`.

``broom::glance(model)``
``````##   r.squared adj.r.squared     sigma statistic     p.value df    logLik
## 1 0.9430403     0.9240538 0.4261232  49.66886 0.005871236  2 -1.552495
##       AIC      BIC deviance df.residual
## 1 9.10499 7.933304 0.544743           3``````
``modelr::rsquare(model, d)``
``##  0.9430403``

This is something like what is reported by `caret::train()` (where `caret` controls the model training process).

``````cr <- caret::train(y~x, d,
method = 'lm')``````
``````## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.``````
``cr\$results``
``````##   intercept      RMSE  Rsquared    RMSESD   RsquaredSD
## 1      TRUE 0.9631662 0.9998162 0.6239989 0.0007577834``````
``summary(cr\$finalModel)``
``````##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
##      X1      X2      X3      X4      X5
##  0.3292  0.3421  0.0344 -0.1671 -0.5387
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.4201     0.2933   1.432  0.24747
## x             2.1117     0.2996   7.048  0.00587 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4261 on 3 degrees of freedom
## Multiple R-squared:  0.943,  Adjusted R-squared:  0.9241
## F-statistic: 49.67 on 1 and 3 DF,  p-value: 0.005871``````

(I presume `cr\$results\$Rsquared` is a model quality report and not a consistency of cross-validation procedure report. If it is a model quality report it is somehow inflated, perhaps by the resampling procedure. So I apologize for using either `caret::train()` or its results incorrectly.)

## Data example

The issues in taking summary statistics (and significances) from models include:

• Working only from models limits you to models that include the statistic you want.
• Working only from models is mostly "in-sample." You need additional procedures for test or hold-out data.

With `sigr` it is also easy to reconstruct quality and significance from the predictions, no matter where they came from (without needing the model data structures).

### In-sample example

``d\$pred <- predict(model, newdata = d)``
``````cat(render(wrapFTest(d, 'pred', 'y'),
pLargeCutoff=1))``````

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Notice we reconstruct the summary statistic and significance, independent of the model data structures. This means the test is generic and can be used on any regression (modulo informing the significance model of the appropriate number of parameters). It also can be used on held-out or test data.

In this mode it is a lot like `ModelMetrics::rmse()` or `caret::postResample()`.

``ModelMetrics::rmse(d\$y, d\$pred)``
``##  0.3300736``
``caret::postResample(d\$y, d\$pred)``
``````##      RMSE  Rsquared
## 0.3300736 0.9430403``````

### Out of sample example

If we had more data we can look at the quality of the prediction on that data (though you have to take care in understanding the number of degrees of freedom is different for held-out data).

``````d2 <- data.frame(x= rnorm(5))
d2\$y <- 2*d2\$x + 0.5*rnorm(nrow(d2))
d2\$pred <-  predict(model, newdata = d2)``````
``````cat(render(wrapFTest(d2, 'pred', 'y'),
pLargeCutoff=1))``````

F Test summary: (R2=0.76, F(1,3)=9.6, p=0.054).

### Exotic model example

We could have used `glmnet` instead of `lm`:

``library("glmnet")``
``````d\$one <- 1 # make sure we have at least 2 columns
xmat <- as.matrix(d[, c('one','x')])
glmnetModel <- cv.glmnet(xmat, d\$y)``````
``````## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold``````
``summary(glmnetModel)``
``````##            Length Class  Mode
## lambda     53     -none- numeric
## cvm        53     -none- numeric
## cvsd       53     -none- numeric
## cvup       53     -none- numeric
## cvlo       53     -none- numeric
## nzero      53     -none- numeric
## name        1     -none- character
## glmnet.fit 12     elnet  list
## lambda.min  1     -none- numeric
## lambda.1se  1     -none- numeric``````
``````d\$predg <- as.numeric(predict(glmnetModel,
newx= xmat))``````
``````cat(render(wrapFTest(d, 'predg', 'y'),
pLargeCutoff=1))``````

F Test summary: (R2=0.91, F(1,3)=30, p=0.012).

## Plotting

Because `sigr` can render to "`LaTex`" it can (when used in conjunction with `latex2exp`) also produce formatted titles for plots.

``````library("ggplot2")
library("latex2exp")

f <- paste0(format(model\$coefficients['x'], digits= 3),
'*x + ',
format(model\$coefficients['(Intercept)'], digits= 3))
title <- paste0("linear y ~ ", f, " relation")
subtitle <- latex2exp::TeX(render(wrapFTest(d, 'pred', 'y'),
format= 'latex'))
ggplot(data=d, mapping=aes(x=pred, y=y)) +
geom_point() + geom_abline(color='blue') +
xlab(f) +
ggtitle(title,
subtitle= subtitle)`````` ## Conclusion

`sigr` computes a few statistics or summaries (with standard significance estimates) and returns the calculation in both machine readable and formatted forms. For non-standard summaries `sigr` includes some resampling methods for significance estimation. For formatting `sigr` tries to get near the formats specified by "The American Psychological Association." `sigr` works with models (such as `lm` and `glm(family='binomial')`) or data, and is a small package with few dependencies.

Categories: Programming Statistics

Tagged as: ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.