We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.
vtreat is something we really feel you you should add to your predictive analytics or data science work flow.
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell
vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.
1: "Not a domain expert" assumption
vtreat avoids any transformation that cannot be reliably performed without domain expertise. For example vtreat does not perform outlier detection or density estimation to attempt to discover sentinel values hidden in numeric data. We consider reliably detecting such values (which can in fact ruin an analysis when not detected) a domain specific question. To get special treatment of such values the analyst needs to first convert them to separate indicators and/or a special value such as NA.
This is also why, as of version 0.5.28, vtreat does not default to collaring or Winsorizing numeric values (restricting numeric values to ranges observed during treatment design). For some variables Winsorizing seems harmless, for others (such as time) it is a catastrophe. This determination can be subjective, which is why we include the feature as a user control.
2: "Not the last step" assumption
One of the design principles of vtreat is the assumption that any use of prepare is followed by a sophisticated modeling technique. That is: a technique that can reason about groups of variables. So vtreat defers reasoning about groups of variables and other post-processing to this technique.
This is one reason vtreat allows both level indicators and complex derived variables (such as effects or impact coded variables) to be taken from the same original categorical variable, even though this can introduce linear dependency among derived variables. vtreat does prohibit constant or non- varying derived variables as those are traditionally considered anathema in modeling.
R’s base lm and glm(family=binomial) methods are somewhat sophisticated in that they do work properly in the presence of co-linear independent variables, as both methods automatically remove a set of redundant variables during analysis. However, in general we would recommend regularized techniques as found in glmnet as a defense against near-dependency among variables.
vtreat variables are intended to be used with regularized statistical methods, which is one reason that for categorical variables no value is picked as a reference level to build contrasts. For L2 or Tikhonov regularization it can be more appropriate to regularize indicator-driven effects towards zero than towards a given reference level.
This is also one reason the user must supply a variable pruning significance rather than vtreat defaulting to our suggested 1/numberOf V ariables heuristic; the variable pruning level is sensitive to the modeling goals, number of variables, and number of training examples. Variable pruning is so critical in industrial data science practice we feel we must supply some tools for it. Any joint dimension reduction technique (other than variable pruning) is again left as a next step (though vtreat’s scaling feature can be a useful preparation for principal components analysis, please see here).
vtreat’s explicit indication of missing values is meant to allow the next stream processing to use missingness as possibly being informative and work around vtreat’s simple unconditioned point replacement of missing values.
3: "Consistent estimators" principle
The estimates vtreat returns should be consistent in the sense that they converge to ideal non-constant values as the amount of data available for calibration or design goes to infinity. This means we can have conditional expectations (such as catB and catN variables), prevelances or frequencies (such as catP), and even conditional deviations (such as catD). But the principle forbids other tempting summaries such as conditional counts (which scale with data) or frequentist significances (which either don’t converge, or converge to the non-informative constant zero).
These are some of the principles that went into the design of vtreat. We hope you can use the advantages of vtreat in your next data science project. A good place to start is reviewing the pre-rendered package vignettes (please start with the one called “vtreat”).
Categories: Pragmatic Data Science
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.