Friday, October 9, 2015

Evaluation of Optimizers for Logistic Regression in Apache Spark

Simple comparison of the stochastic gradient descent with the quasi-Newton Limited memory BFGS optimizer for the binomial classification using the logistic regression in Apache Spark MLlib


Overview
Apache Spark 1.5.x MLlib library provides developers and data scientists alike, two well known optimizers for the binomial classification using the logistic regression.
  • Stochastic gradient descent (SGD)
  • Limited memory version of the Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)
I thought it would be instructive to compare the two optimization methods on the accuracy of the logistic regression as a binomial classification.
Note: MLlib in Apache Spark 1.5.1 does not support multinomial classification. It may be added in the future versions.
NoteBackground information on the stochastic gradient descent can be found at

Logistic regression
The logistic regression is one of the most commonly used discriminative supervised learning model mainly because it is intuitive and simple. It relies on the logistic function (refer to An Introduction to Logistic and Probit Regression Models
In the case of the classification problem, the probability that on observation x belong to a class C is computed as \[p(C|x)=\frac{1}{1+e^{-w_{0}-w^{T}x}}\] where w are the weights or model coefficients.
Apache Spark MLlib has two implementations of the logistic regression as a binary classifier

  • org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS using the L-BFGS optimizer
  • org.apache.spark.mllib.classification.LogisticRegressionWithSGD using the SGD optimizer
Background information on the stochastic gradient descent can be found at Comparing Stochastic Gradient Descent And Batch Gradient Descent
For those interested in inner workings of the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm, Limited Memory BFGS

Data generation
The scaladocs for Apache Spark API are available at Apache Spark API
Let's create a synthetic training set to evaluate and compare the two implementation of the logistic regression. The training set for the binomial classification consist of

  • Two data sets of observations with 3 features, each following a data distribution with same standard deviation and two different means.
  • Two labels (or expected values) {0, 1}, one for each Gaussian distribution
The following diagram illustrates the training set for a single feature.



The margin of separation between the two groups of observations of 3 dimension is computed as mean(first group) - mean (second group). As the margin increases the accuracy of the binomial classification is expected to increase.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
final val SIGMA = 2.0

class DataGenerator(numTasks: Int)(implicit sc: SparkContext) {
  private def f(mean: Double): Double = 
      mean + SIGMA*(Random.nextDouble - 0.5)

  def apply(half: Int, mu: Double): Array[LabeledPoint] = {
    val trainObs =
      ArrayBuffer.fill(half)(Array[Double](f(1.0),f(1.0),f(1.0))) ++
      ArrayBuffer.fill(half)(Array[Double](f(mu),f(mu),f(mu)))

    val labels = ArrayBuffer.fill(half)(0.0) ++ 
                  ArrayBuffer.fill(half)(1.0)
    labels.zip(trainObs).map{ case (y, ar) => 
          LabeledPoint(y, new DenseVector(ar)) }.toArray
  }
}

The method apply generates the two groups of half observations following normal distribution of mean 1.0 and 1.0 + mu. (line 9 and 11).
Next we create two sets of labels 0 and 1 (line 13) that are used to generated the Apache Spark labeled points (line 15).
Apache Spark LogisticRegression classes process LabeledPoint instances which are generated in this particular case from DenseVector wrappers of the observations.


Test
The first step consists of initializing the Apache spark environment, using SparkConf and SparkContext classes (line 2 & 4).
1
2
3
4
5
6
7
8
val numTasks: Int = 64
val conf = new SparkConf().setAppName("LogisticRegr")
   .setMaster(s"local[$numTasks]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")

  // Execution of Spark driver code here...
sc.stop

The next step is to generate the training and validation set. The validation set is used at a later stage for comparing the accuracy of the respective model (line 5).
1
2
3
4
5
val halfTrainSet = 32000
val dataGenerator = new DataGenerator(numTasks)(sc)
    
val trainSet = dataGenerator(halfTrainSet, mean)
val validationSet = dataGenerator(halfTrainSet, mean)

It is now time to instantiate the two logistic regression classifiers and generate two distinct models. You need to make sure that the parameters (tolerance, number of iterations) are identical for both models.
This implementation uses the Logistic regression from MLlib that uses a pre-canned stochastic gradient descent (line 1). A customized gradient descent can be defined by using the standalone SGD class from MLlib.
For this example, the optimization parameters (line 2 & 3) are purely arbitrary. MLlib uses RDD as input for training and validation set (line 4) while the logistic regression in ML uses instances of DataFrame.
1
2
3
4
5
6
val logRegrSGD = new LogisticRegressionWithSGD 
logRegrSGD.optimizer.setNumIterations(1000) 
logRegrSGD.optimizer.setConvergenceTol(0.02) 
val inputRDD = sc.makeRDD(trainingSet, numTasks) 
logisticRegression.setIntercept(true) val model = 
logisticRegression.run(inputRDD)

Validation
Now it is time to use the validation set to compute the mean sum of square error and the accuracy of each predictor for different values of margin.
We need to define and implement a validation framework or class, simple but relevant enough for our evaluation. The first step is to specify the quality metrics as follows
  • metrics produced by the Spark logistic regression
  • mSSE Mean sum of square errors
  • accuracy accuracy of the classification
The quality metrics are defined in the Quality class as described in the following code snippet.
1
2
3
4
5
6
7
8
9
case class Quality(
   metrics: Array[(Double, Double)], 
   msse: Double, 
   accuracy: Double) {

 override def toString: String =
    s"Metrics: ${metrics.mkString(",")}\n
    |msse = ${Math.sqrt(msse)} accuracy = $accuracy"
}

Let's implement our validation class, BinomialValidation for the binomial classification. The validation is created using the spark context sc, the logistic regression model generated through training and the number of partitions or tasks used in the data nodes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
final class BinomialValidation(
   sc: SparkContext, 
   model: LogisticRegressionModel, 
   numTasks: Int) {

 def metrics(validationSet: Array[LabeledPoint]): Quality = {
   val featuresLabels = validationSet.map( lbPt => 
       (lbPt.label, lbPt.features)).unzip
   val predicted_rdd = model.predict(    
         sc.makeRDD(featuresLabels._2, numTasks)
   )

   val scoreAndLabels = sc.makeRDD(featuresLabels._1, 
           numTasks).zip(predicted_rdd)
  
   val successes = scoreAndLabels.map{ 
        case(e,p) => Math.abs(e-p) }.filter( _ < 0.1)
   val msse = scoreAndLabels.map{ 
        case (e,p) => (e-p)*(e-p)}.sum

   val metrics = new BinaryClassificationMetrics(scoreAndLabels)
   Quality(metrics.fMeasureByThreshold().collect, 
        msse, 
        successes.count.toDouble/validationSet.length)
  }
}

The method metrics (line 6) converts the validation set, validationSet into a RDD (line 7, 8) after segregating the expected values from the observations (unzip). The results of the prediction, prediction_rdd (line 9) is then zipped with the labeled values into the evaluation set, scoreAndLabels (line 13-14) from which the different quality metrics such as successes (line 16) and muse (line 18) are extracted.
The computation of metrics is actually performed by the BinaryClassificationMetrics MLlib class. Finally, the validation is applied on the logistic model with a convergence tolerance 0.1

1
2
3
model.setThreshold(0.1)
val validator = new BinomialValidation(sc, model, numTasks)
val quality = validator.metrics(validationSet)

Results
The fact that the L-BFGS optimizer provides a significant more accurate result (or lower mean sum of square errors) that the stochastic gradient descent is not a surprise. However, the lack of convergence of the SGD version merit further investigation.
Note: This post is a brief comparison of the two optimizer in terms of accuracy on a simple synthetic data set. It is important to keep in mind that the stochastic gradient descent has better performance overall than L-BFGS or any quasi-Newton method for that matter, because it does not require the estimation of the hessian metric (second order derivative).




References
An Introduction to Logistic and Probit Regression Models
Machine Learning: A probabilistic perspective Chapter 8 Logistic Regression" K. Murphy - MIT Press 2012

No comments:

Post a Comment