Menu Home

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.

NewImage

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Our concrete advice is: when building a supervised model (regression or classification) in R, prepare your training, test, and application data by doing the following.

# load the vtreat package
library("vtreat")

# use your training data to design 
# data treatment plan
ce <- mkCrossFrameCExperiment(trainData, 
                              vars,
                              yName, yTarget)

# look at the variable scores
varScores <- ce$treatments$scoreFrame
print(varScores)

# prune variables based on significance
pruneSig <- 1/nrow(varScores)
modelVars <- varScores$varName[varScores$sig<=pruneSig]

# instead of preparing training data, use 
# "simulated out of sample data" to reduce modeling bias
treatedTrainData <- ce$crossFrame

# prepare any other data (test, future application)
# using the treatment plan
treatedTestData <- prepare(ce$treatments, 
                           testData, 
                           varRestriction= modelVars, 
                           pruneSig= NULL)

Then work through our examples to find out what all these steps are doing for you.

Categories: Opinion Pragmatic Data Science

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. “or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others)”

    are there already some implementations for python / spark?

    1. There are no Python or Spark implementations yet. What I meant is the document serves as a specification that would allow an easy Python or Spark implementation. Frankly our group would love a grant to make those versions and a test suite to ensure all 3 are equivalent.

  2. I printed v1 from arxiv some time ago to read it during the holiday. I expect to read it until the end of the year.

    Is there any substantial charge at the v2 version?

%d bloggers like this: