Menu Home

Don’t Feel Guilty About Selecting Variables

We have an exciting new article to share: Don’t Feel Guilty About Selecting Variables.

If you are at all interested in the probabilistic justification of important data science techniques, such as variable selection or pruning, this should be an informative and fun read.

“Data Science” is often criticized with the common slur “if it has science in the name it isn’t a science.” Data science is in fact a science for the following reason: it has empirical content. That is, there are methods that are used because we can confirm they work.

However, data science when done well also has a mathematical basis. We expect to find good mathematical, probabilistic, or statistical justification for reliable procedures.

Variable pruning or selection is one such procedure. It is well known that it can in fact improve data science results. It is an empirical fact or experience: for some datasets, for some fitting procedures explicit prior variable selection improves results. Our new note examines how this is not a mere empirical alchemy, but something that is mathematically justified and to be expected (under an appropriate Bayesian formulation of model fitting).

So please read on and also share: Don’t Feel Guilty About Selecting Variables, or How I Learned to Stop Worrying and Love Variable Selection.

Categories: Pragmatic Data Science Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. So what is the posterior on the variable explanatory that we have been calculating?

    Roughly, by Bayes law P[explanatory | data] is going to be of the form P[explanatory] P[data | explanatory]/Z. Under normality assumptions P[data | explanatory] is going to be of the form c(1 - exp(-s residual_variance)/exp(-s initial_variance)).

    The classic F-statistic is very roughly (ignoring degrees of freedom) of the form (initial_variance - residual_variance)/residual_variance. Or initial_variance/residual_variance - 1.

    In both cases (ignoring priors, and ignoring degrees of freedom) for normalized data initial_variance is a constant. So for a given data set both the posterior P[explanatory | data] and the F-test are monotone functions in residual_variance. Any choice on a threshold for one of these values is approximately equivalent to a choice subject to a transformed threshold on the other.

  2. Michele Scandola tweeted some nice references:

    Variable Selection for Regression Models
    Author(s): Lynn Kuo and Bani Mallick
    Source: Sankhyā: The Indian Journal of Statistics, Series B (1960-2002), Vol. 60, No. 1, Bayesian Analysis (Apr., 1998), pp. 65-81
    Published by: Indian Statistical Institute
    Stable URL: https://www.jstor.org/stable/25053023
    
    Bayesian Model Choice via Markov Chain Monte Carlo Methods
    Author(s): Bradley P. Carlin and Siddhartha Chib
    Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 57, No. 3 (1995), pp. 473-484
    Published by: Wiley for the Royal Statistical Society
    Stable URL: https://www.jstor.org/stable/2346151
    
%d bloggers like this: