I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly.
The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).
This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 and shared their fully machine readable results ( .arff
and apparently standardized *_R.dat
files) in a convenient single downloadable tar-file (see their paper for the URL).
The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.
The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).
To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.
But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names, car.data, and car.names
car.names
Free-form description of the data-set and format.car.data
Comma separated data (without header).car.c45-names
Presumably machine readable header for aC4.5
package
The standard way to deal with this data is to (by hand) inspect car.names
or car.c45-names
and hand-build a custom command to load the data. Example R code to do this is given below:
library(RCurl) url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data" tab <- read.table(text=getURL(url,write=basicTextGatherer()), header=F,sep=',') colnames(tab) <- c('buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class') options(width=50) print(summary(tab))
Which (assuming RCurl
is properly installed) yields:
buying maint doors persons high :432 high :432 2 :432 2 :576 low :432 low :432 3 :432 4 :576 med :432 med :432 4 :432 more:576 vhigh:432 vhigh:432 5more:432 lug_boot safety class big :576 high:576 acc : 384 med :576 low :576 good : 69 small:576 med :576 unacc:1210 vgood: 65
For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.
The DWN paper car
directory has 9 items:
car.data
original file from UCI.car.names
original file from UCI.le_datos.m
Matlab custom data loading code.car.txt
Facts about the data set.car.arff
Derived.arff
format version of the data set.car.cost
Pricing of classification errors.car_R.dat
Derived standard tab separated values file with header.conxuntos.dat
Likely a result file.conxuntos_kfold.dat
Likely a result file.
The files I am interested in are car_R.dat
and le_datos.m
. car_R.dat
looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named f*
) and category to be predicted last (named clase
and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into le_datos.m
and see the following code fragment:
for i_fich=1:n_fich f=fopen(fich{i_fich}, 'r'); if -1==f error('erro en fopen abrindo %sn', fich{i_fich}); end for i=1:n_patrons(i_fich) fprintf(2,'%5.1f%%r', 100*n_iter++/n_patrons_total); for j = 1:n_entradas t= fscanf(f,'%s',1); if j==1 || j==2 val={'vhigh', 'high', 'med', 'low'}; elseif j==3 val={'2', '3', '4', '5-more'}; elseif j==4 val={'2', '4', 'more'}; elseif j==5 val={'small', 'med', 'big'}; elseif j==6 val={'low', 'med', 'high'}; end n=length(val); a=2/(n-1); b=(1+n)/(1-n); for k=1:n if strcmp(t,val{k}) x(i_fich,i,j)=a*k+b; break end end end t = fscanf(f,'%s',1); % lectura da clase for j=1:n_clases if strcmp(t,clase{j}) cl(i_fich,i)=j; break end end end fclose(f); end
It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from -1
through 1
(using the linear rule x(i_fich,i,j)=a*k+b
). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:
buying maint doors persons lug_boot safety class 1 vhigh vhigh 2 2 small low unacc 2 vhigh vhigh 2 2 small med unacc 3 vhigh vhigh 2 2 small high unacc 4 vhigh vhigh 2 2 med low unacc 5 vhigh vhigh 2 2 med med unacc 6 vhigh vhigh 2 2 med high unacc
To this
f1 f2 f3 f4 f5 f6 clase 1 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 -1.22439 1 2 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 0 1 3 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 1.22439 1 4 -1.34125 -1.34125 -1.52084 -1.22439 0 -1.22439 1
It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the Python scikit-learn
methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).
Indicator variables encoding US Census reported levels of education.
The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes and the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).
This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.
If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.
Is there any actual damage in this encoding? Let’s load the processed data set and see.
url2 <- 'http://winvector.github.io/uciCar/car_R.dat' dTreated <- read.table(url2, sep='t',header=TRUE)
The original data set supports a pretty good logistic regression model for unaccaptable cars:
set.seed(32353) train <- rbinom(dim(tab)[[1]],1,0.5)==1 m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety, family=binomial(link='logit'), data=tab[train,]) tab$pred <- predict(m1,newdata=tab,type="response") print(table(class=tab[!train,'class'], unnacPred=tab[!train,'pred']>0.5)) ## unnacPred ## class FALSE TRUE ## acc 181 18 ## good 30 0 ## unacc 22 577 ## vgood 35 0
The transformed data set does not support as good a logistic regression mode.
m2 <- glm(clase==1~f1+f2+f3+f4+f5+f6, family=binomial(link='logit'), data=dTreated[train,]) dTreated$pred <- predict(m2,newdata=dTreated,type="response") print(table(class=dTreated[!train,'clase'], unnacPred=dTreated[!train,'pred']>0.5)) ## unnacPred ## class FALSE TRUE ## 0 35 0 ## 1 43 556 ## 2 118 81 ## 3 28 2
Now obviously some modeling methods are more sensitive to this mis-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question why to ever use logistic regression? Because when you have a proper encoding of the data and the model structure is in fact somewhat linear, logistic regression can in fact be a very good method.
In the DWN paper 8 data sets (out of 123) have the a*k+b
fragment in their le_datos.m
file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so critique other authors is many other authors don’t share their work.
It always surprises statisticians that the indicator variable trick is not always first in mind. This means we forget to teach and re-teach the method enough. We also need to do more to root-out the incorrect alternatives to the method. Indicator encoding is sometimes hard to point out as it is either not done correctly or done silently.
In R
, strings
and factors
can be treated as single columns or variables and are silently converted during model training and application (or can be explicitly built using model.matrix()
. Oddly enough R
also goes out of its way to also provide a publicly visible “convert to numbers by using interior codes” method (data.matrix()
) which in my opinion is almost always the wrong method and lures unsuspecting programmers and engineers into error. I have written on this before, but if anything failed to fully appreciate the pervasive nature of the incorrect practice.
Python
‘s scikit-learn
supplies the correct encoding methods in sklearn.feature_extraction.DictVectorizer/sklearn.preprocessing.OneHotEncoder(). I think a lot of Python users get confused because they do not appreciate that Pandas
(which deals so well with data representation) and scikit-learn
(which really only wants to work with numbers) are two independent packages (and coded not to depend on each other) and some work is required to faithfully move data from one package to the other.
Note: as expected randomForest
does better reversing the re-encoding. Also we accidentally left out the variable f6
in an early version of this post.
library(randomForest) m1F <- randomForest(as.factor(class=='unacc')~ buying+maint+doors+persons+lug_boot+safety, data=tab[train,]) tab$predF <- predict(m1F,newdata=tab,type="response") print(table(class=tab[!train,'class'], unnacPred=tab[!train,'predF'])) ## unnacPred ## class FALSE TRUE ## acc 193 6 ## good 30 0 ## unacc 9 590 ## vgood 35 0 m2F <- randomForest(as.factor(clase==1)~f1+f2+f3+f4+f5+f6, data=dTreated[train,]) dTreated$predF <- predict(m2F,newdata=dTreated,type="response") print(table(class=dTreated[!train,'clase'], unnacPred=dTreated[!train,'predF'])) ## unnacPred ## class FALSE TRUE ## 0 35 0 ## 1 10 589 ## 2 193 6 ## 3 30 0
And we can confirm the encoding is in fact reversible by showing which variables and outcomes are in bijective correspondence. This means something as simple as changing the type/class
declaration from real
to string/factor
would undoing the coding problem. The machine learning doesn’t need to know the original names of the levels, it just needs to know to treat the data as levels.
print(table(tab$class,dTreated$clase)) ## 0 1 2 3 ## acc 0 0 384 0 ## good 0 0 0 69 ## unacc 0 1210 0 0 ## vgood 65 0 0 0 print(table(tab$buying,dTreated$f1)) ## -1.34125 -0.447084 0.447084 1.34125 ## high 0 432 0 0 ## low 0 0 0 432 ## med 0 0 432 0 ## vhigh 432 0 0 0 print(table(tab$maint,dTreated$f2)) ## -1.34125 -0.447084 0.447084 1.34125 ## high 0 432 0 0 ## low 0 0 0 432 ## med 0 0 432 0 ## vhigh 432 0 0 0 print(table(tab$doors,dTreated$f3)) ## -1.52084 -0.168982 0.506946 1.18287 ## 2 432 0 0 0 ## 3 0 432 0 0 ## 4 0 0 0 432 ## 5more 0 0 432 0 print(table(tab$persons,dTreated$f4)) ## -1.22439 0 1.22439 ## 2 576 0 0 ## 4 0 576 0 ## more 0 0 576 print(table(tab$lug_boot,dTreated$f5)) ## -1.22439 0 1.22439 ## big 0 0 576 ## med 0 576 0 ## small 576 0 0 print(table(tab$safety,dTreated$f6)) ## -1.22439 0 1.22439 ## high 0 0 576 ## low 576 0 0 ## med 0 576 0
Categories: Pragmatic Data Science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Thank you for this article. I also questioned why not the use of dummy variables instead of a numerical transformation on categorical variables. One other question I have is given the model, aren’t there issues of running into linear contrast issues with logistic regression if dummy variables represent all levels of the previous categorical variable?
There is an issue when different indicator/dummy variables end up being linearly dependent (or even nearly so). Basically you lose a lot of the interpretability of the coefficients (as you at best only get bounds on linear-subspaces, not on values). But if your only goal is to make predictions (as it often is for data scientists, though not always so for statisticians) then simple precautions like regularization give you good models.