Dr. Nina Zumel recently published an excellent tutorial on a modeling technique she called impact coding. It is a pragmatic machine learning technique that has helped with more than one client project. Impact coding is a bridge from Naive Bayes (where each variable’s impact is added without regard to the known effects of any other variable) to Logistic Regression (where dependencies between variables and levels is completely accounted). A natural question is can pick up more of the positive features of each model?For example: Naive Bayes handles very many level values quite well and is an easy framework to add improvements like Laplace smoothing. Logistic Regression is excellent at handling correlation between modeling variables (stops over-estimation) but traditionally is at a disadvantage when confronted with an extreme number of levels (and does not incorporate smoothing unless you add explicit shrinkage, regularization or conjugate priors to the formulation). Methods like stochastic gradient descent (see also: Mahout) can deal with a large number of category levels, but at this point you are a bit far afield of the traditional method for solving a logistic regression problem: Iteratively re-weighted least squares or the related Newton’s method. This complete switch in methodology loses some of the algorithmic advantages of these methods.
In both IRWS and Newton’s method you form a item called the “Hessian matrix” which is the matrix of all second derivatives of the logistic likelihood function (or equivalently it is the derivative of the vector of balance equations). The problem is this matrix has size proportional to the square of the number of categories in the model. And in the general case this matrix is not sparse. This matrix can be much larger than your training data and forming it can be an unreasonable amount of work. The impact coding technique hides from the logistic regression problem the large number of category encodings and usefully moderates the size of this matrix.
What we have noticed is that good Bayesian impact-coded variables are optimal for the first pass of the preferred logistic regression optimization procedures (IRWLS and Newton’s method). The natural question is: are there are simple variations of the impact-coding technique that are more appropriate for later Newton iterations and their (non-uniform) derived data weightings. It turns out that there are such variations and they are fairly simple to implement.
The trick is to look at the gradient of the likelihood function (or any other goodness of fit metric). From this you can quickly generate new impact encodings such as the balance deficit:
sum( 1 | x_i has the chosen level and outcome is "true" ) - sum( p(x_i) | x_i has the chosen level )
p(x_i) is the probability the current model assigns to the outcome “true” for example i. We get one such summary per level of our categorical variable, and together these values give us a new direction to add to our problem re-encoding.
This is a bit ad-hoc and playing with the order of summation and estimation we can get different interesting improvement directions. For instance we can also look at the error in predication of the correct class c on a single training example. We would like our final model to place a probability estimate of 1 on the correct class and our current model places p(c,x_i). So we have an error of 1 – p(c,x_i) and by arguments given here we know changing the encoding of a single level present on this example gives us a rate of change in encoding of p(c,x_i) (1 – p(c,x_i)). So from a linearized point of view this training example would like to see the given level encoding increased by (1 – p(c,x-i))/(p(c,x_i)(1-p(c,x_i)) = 1/p(c,x_i). We can sum this vote over all traning examples marked with the same outcome c and the some level to get a new direction:
sum(1/p(c,x_i) | x_i has the chosen level and outcome is c) / sum(1 | x_i has the chosen level and outcome is c)
p(c,x_i) is the probability the current model assigns to the outcome “c” for example i. Here we get one encoding for each combination of outcome category and variable level.
A practical technique is to use all of these as a succinct re-encoding of categorical variables with a great many levels and then to intersperse these re-encodings with the traditional Newton or IRLS steps. Unfortunately a lot of these directions become co-linear and not the best support for the Newton/IRLS process, but the technique is worth a try.
Note that we do not suggest principle components analysis between the indicator variables that would represent levels of a categorical variables. This is because principle components analysis looks for variable to variable correlation (“x alone”) which ignores both the fact that competing levels are anti-correlated and we are most interested in correlations to outcome (picking useful effects, not memorizing input distributions).
We have demonstration code that collects a number of these heuristics and presents them as new possible direction vectors in the model. So in our new impact coding a single categorical variable with a large number of levels is replaced by a small family of numeric variables (reading off different heuristic estimates of the best impact encoding to date and best improving directions). We then run the Newton steps, collect the sum of all impacts into a single summed impact and introduce new directions using the new model estimates. We retain the effects of all previous encodings and directions with the single additional summed-effect variable. This tends to converge quite quickly and leaves us at a model solution point where the gradient is near zero (some evidence we may be near an optimum). This is implemented in our experimental logistic code (in particular the encoding class: BTable ) and activated by specifying a “LogisticTrainPlus” control class in the training command line:
As a trivial example take the MovieLense 10 million dataset to model movie rating as a function of UserID and MovieID (basically a fixed effect model trying to balance out users with different general biases). We unzip the file
ml-10m.zip and use
sed to mechanically edit the file
ratings.dat into a more machine-friendly format
echo "UserID,MovieID,Rating,Timestamp" > ratings.csv cat ratings.dat | sed "s/::/,/g" >> ratings.csv gzip -9 ratings.csv
We can then use our experimental logistic code to either build a model of a move rating as a function of the movie id alone (essentially memorizing the distribution of ratings):
java -Xmx3G -cp h2-1.2.147.jar:junit-4.4.jar:logistic.jar:commons-cli-1.2.jar:commons-logging-1.1.1.jar:commons-logging-api-1.0.4.jar com.winvector.logistic.LogisticTrain --trainClass com.winvector.logistic.LogisticTrainPlus --trainURI file:ratings.csv.gz --sep , --formula "Rating ~ ^MovieID" --resultTSV model0.tsv
The “^” notation is telling the modeling software to treat MovieID as a categorical variable (even though it is encoded as an integer). We could also build a model estimating the movie rating as an additive combination (modulo the logistic term) of MovieID and UserID.
java -Xmx5G -cp h2-1.2.147.jar:junit-4.4.jar:logistic.jar:commons-cli-1.2.jar:commons-logging-1.1.1.jar:commons-logging-api-1.0.4.jar com.winvector.logistic.LogisticTrain --trainClass com.winvector.logistic.LogisticTrainPlus --trainURI file:ratings.csv.gz --sep , --formula "Rating ~ ^UserID + ^MovieID" --resultTSV model1.tsv
This is a deliberately silly example but it shows we can train logistic model on a single machine a data set with 10,000,000 rows and two large categorical variables (one with MovieID with 10,677 levels and UserID with 69,878 levels) in a reasonable amount of time. Note that neither model is particularly good (and adding UserID as a variable only represents and 3% absolute increase in accuracy and a 11% relative decrease in model deviance). Model performance was poor in both cases, so the consider this not as a legitimately treated statistical result- but as a mere proof of scale for the method.
What is needed overall is a method to hide detail. The method can be domain-expert based roll-ups (using spatial relations, clustering or shrinkage), data based (like partial least squares, bilinear interpolation or splines) or complexity based (complexity or manifold regularization methods). Or, as in our example, it can look more like a model within a model (as in a hierarchical or graphical model).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.