Menu Home

Added worked example to logistic regression project

We have added a worked example to the README of our experimental logistic regression code.

The Logistic codebase is designed to support experimentation on variations of logistic regression including:

What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout.In principle running the code is easy: all you do is supply a training file as a TSV (tab separated file) or CSV (comma separated file), write down the column you want to predict as a schematic formula of the columns you wish to use in your model. In practice it is a bit harder: you have to have already set up your Java or Hadoop environment to bring in all required dependencies.

Setting up a Hadoop configuration can range from simple (like the single machine tests in our projects JUnit test suite) to complicated. Also, for non-trivial clusters you often do not control the configuration (i.e. somebody else supplies the cluster environment). So we really can’t tell you how to set up your non-trivial Hadoop environment (just too many variables). Some time ago we supplied a complete example of how to run an example on Amazon’s Elastic Map Reduce (but that was in 2010, so the environment may have changed a bit. However the current code runs at least on Hadoop versions 0.20.0, 1.0.0 and 1.0.3 without modification).

Our intent was never to depend on Hadoop except in the case of very large data (and even then there are other options like sub-sampling and Mahout). So we supplied direct Java command line options. It is time to re-work a simple example of using these options (this example is now in the project README file). We repeat this example here. We are assuming a command-line environment (for example the Bash shell on OSX or Linux, but this can be done on Windows using either CMD or Cygwin).

We show what works for us in our environment, you will have to adapt to your environment as it differs.

Example of running at the command line (using some Apache support classes, but not running under Hadoop):

  1. Get a data file. For our example we took the data file from the UCI Iris data example saved it and added a header line of the form “SepalLength,SepalWidth,PetalLength,PetalWidth,TrainingClass” to make the file machine readable. The edited file is available here: .

  2. Get all the supporting code you need and set your Java CLASSPATH. To run this you need all of the classes from:

    This is complicated- but it is one time set up cost. In practice you would not manipulate classes directly at the command line by use an IDE like Eclipse or a build manager like Maven to do all of the work. But not everybody uses the same tools and tools bring in even more dependencies and complications; so we show how to set up the class paths directly.

    In our shell (bash on OSX) we set our class variable as follows:


    We are using where we put the files “/Users/johnmount/project/” and the path separator “:” (separator is “;” on Windows). The path you would use would depend on where you put the files you downloaded.

  3. In the directory you saved the training data file run the logistic training procedure:

    java -cp $CLASSES com.winvector.logistic.LogisticTrain -trainURI -sep , -formula "TrainingClass ~ SepalLength + SepalWidth + PetalLength + PetalWidth" -resultSer iris_model.ser

    This produces iris_model.ser , the trained model. The diagnostic printouts show the confusion matrix (tabulation of training class versus predicted class) and show a high degree of training accuracy.

    INFO: Consfusion matrix:
    prediction	actual	actual	actual
    index:outcome	0:Iris-setosa	1:Iris-versicolor	2:Iris-virginica
    0:Iris-setosa	50	0	0
    1:Iris-versicolor	0	47	1
    2:Iris-virginica	0	3	49

    Notice that there are only 4 training errors (1 Iris-verginica classified as Iris-versicolor and 3 iris-versicolor classified as Irs-virginica).

    The model coefficients are also printed as part of the diagnostics.

    INFO: soln details:
    outcome	outcomegroup	variable	kind	level	value
    Iris-setosa	0		Const		0.774522294889561
    Iris-setosa	0	PetalLength	Numeric		-5.192578560594749
    Iris-setosa	0	PetalWidth	Numeric		-2.357410090972314
    Iris-setosa	0	SepalLength	Numeric		1.69382466234698
    Iris-setosa	0	SepalWidth	Numeric		3.9224697382723903
    Iris-versicolor	1		Const		1.8730846627542541
    Iris-versicolor	1	PetalLength	Numeric		-0.3747220505092903
    Iris-versicolor	1	PetalWidth	Numeric		-2.839314336523609
    Iris-versicolor	1	SepalLength	Numeric		1.1786843497402208
    Iris-versicolor	1	SepalWidth	Numeric		0.2589801610139257
    Iris-virginica	2		Const		-2.6476069576668615
    Iris-virginica	2	PetalLength	Numeric		5.567416774143793
    Iris-virginica	2	PetalWidth	Numeric		5.196786825681218
    Iris-virginica	2	SepalLength	Numeric		-2.8725550015586947
    Iris-virginica	2	SepalWidth	Numeric		-4.1815354814927215
  4. In the same directory run the logistic scoring procedure:

    java -cp $CLASSES com.winvector.logistic.LogisticScore -dataURI -sep , -modelFile iris_model.ser -resultFile iris.scored.tsv

    This produces iris.scored.tsv , the final result. The scored file is essentially the input file (in this case copied over with a few prediction columns (predicted category, confidence in predicted category and probability for each possible outcome category) prepended.

For the Hadoop demonstrations of training and scoring the commands are as follows (though obviously some of the details depend on your Hadoop set-up):

  1. Get or build a jar containing the Logistic code, SQLScrewdriver code and free portions of the COLT library (pre built: WinVectorLogistic.Hadoop0.20.2.jar).
  2. Make sure the training file is tab-separated (instead of comma separated). For example the iris data in such a format is here:
  3. Run the Hadoop version of the trainer:

    /Users/johnmount/project/hadoop-1.0.3/bin/hadoop jar /Users/johnmount/project/Logistic/WinVectorLogistic.Hadoop0.20.2.jar logistictrain "TrainingClass ~ SepalLength + SepalWidth + PetalLength + PetalWidth" iris_model.ser

  4. Run the Hadoop version of the scorring function:

    /Users/johnmount/project/hadoop-1.0.3/bin/hadoop jar /Users/johnmount/project/Logistic/WinVectorLogistic.Hadoop0.20.2.jar logisticscore iris_model.ser scoreDir

Scored output is left in Hadoop format in the user specified scoreDir (slightly different format than the stand-alone programs). The passes take quite a long time due to the overhead of setting up and tearing down a Hadoop environment for such a small problem. Also if you are running in a serious Hadoop environment (like elastic map-reduce) you will have to change certain file names to the type of URI type the system is using. In our elastic map reduce example we used S3 containers which had forms like: “s3n://bigModel.ser” and so on.

The Hadoop code can also be entered using the main()s found in com.winvector.logistic.demo.MapReduceLogisticTrain and com.winvector.logistic.demo.MapReduceScore . This allows interactive debugging through an IDE (like Eclispe) without having go use the “hadoop” command.

Again: this is experimental code. It can do things other code bases can not. If you need one of its features or capabilities it is very much worth the trouble. But if you can make do with a standard package like R you have less trouble and are more able to interact with others work.

Categories: Coding Computer Science Mathematics

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

%d bloggers like this: