## Talking about clinical significance

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals). An example would […]

## Why No Exact Permutation Tests at Scale?

Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported […]

## Remember: p-values Are Not Effect Sizes

Authors: John Mount and Nina Zumel. The p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of p-values that can cause problems in evaluating summaries. Keep in mind: p-values are useful and […]

## sigr: Simple Significance Reporting

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

## Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For […]

## Adding polished significance summaries to papers using R

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms […]

## Proofing statistics in papers

Recently saw a really fun article making the rounds: “The prevalence of statistical reporting errors in psychology (1985–2013)”, Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al., Behav Res (2015), doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for statistical errors. Please read on for how that […]

## Variables can synergize, even in a linear model

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, […]

## Variable pruning is NP hard

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as […]

## How Do You Know if Your Data Has Signal?

Image by Liz Sullivan, Creative Commons. Source: Wikimedia An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors […]