We have an exciting new article to share: Don’t Feel Guilty About Selecting Variables.

If you are at all interested in the probabilistic justification of important data science techniques, such as variable selection or pruning, this should be an informative and fun read.

“Data Science” is often criticized with the common slur “if it has science in the name it isn’t a science.” Data science is in fact a science for the following reason: it has empirical content. That is, there are methods that are used because we can confirm they work.

However, data science when done well also has a mathematical basis. We expect to find good mathematical, probabilistic, or statistical justification for reliable procedures.

Variable pruning or selection is one such procedure. It is well known that it can in fact improve data science results. It is an empirical fact or experience: for some datasets, for some fitting procedures explicit prior variable selection improves results. Our new note examines how this is not a mere empirical alchemy, but something that is mathematically justified and to be expected (under an appropriate Bayesian formulation of model fitting).

So please read on and also share: Don’t Feel Guilty About Selecting Variables, or How I Learned to Stop Worrying and Love Variable Selection.

Categories: data science Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

So what is the posterior on the variable

`explanatory`

that we have been calculating?Roughly, by Bayes law

`P[explanatory | data]`

is going to be of the form`P[explanatory] P[data | explanatory]/Z`

. Under normality assumptions`P[data | explanatory]`

is going to be of the form`c(1 - exp(-s residual_variance)/exp(-s initial_variance))`

.The classic F-statistic is very roughly (ignoring degrees of freedom) of the form

`(initial_variance - residual_variance)/residual_variance`

. Or`initial_variance/residual_variance - 1`

.In both cases (ignoring priors, and ignoring degrees of freedom) for normalized data

`initial_variance`

is a constant. So for a given data set both the posterior`P[explanatory | data]`

and the F-test are monotone functions in`residual_variance`

. Any choice on a threshold for one of these values is approximately equivalent to a choice subject to a transformed threshold on the other.Michele Scandola tweeted some nice references: