Target audience: Intermediate

Estimated reading time: 20'

Overview

Simple comparison of the stochastic gradient descent with the quasi-Newton Limited memory BFGS optimizer for the binomial classification using the logistic regression in Apache Spark MLlib

Apache Spark 1.5.x MLlib library provides developers and data scientists alike, two well known optimizers for the binomial classification using the logistic regression. - Stochastic gradient descent (
*SGD*) - Limited memory version of the Broyden-Fletcher-Goldfarb-Shanno (
*L-BFGS*)

__Note__: MLlib in Apache Spark 1.5.1 does not support multinomial classification. It may be added in the future versions.

Logistic regression

The logistic regression is one of the most commonly used discriminative supervised learning model mainly because it is intuitive and simple. It relies on the logistic function (refer to An Introduction to Logistic and Probit Regression Models

In the case of the classification problem, the probability that on observation

Apache Spark MLlib has two implementations of the logistic regression as a binary classifier

For those interested in inner workings of the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm, Limited Memory BFGS

In the case of the classification problem, the probability that on observation

**x**belong to a class**C**is computed as \[p(C|x)=\frac{1}{1+e^{-w_{0}-w^{T}x}}\] where**w**are the weights or model coefficients.Apache Spark MLlib has two implementations of the logistic regression as a binary classifier

*org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS*using the L-BFGS optimizer*org.apache.spark.mllib.classification.LogisticRegressionWithSGD*using the SGD optimizer

For those interested in inner workings of the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm, Limited Memory BFGS

Data generation

The scaladocs for Apache Spark API are available at Apache Spark API

Let's create a synthetic training set to evaluate and compare the two implementation of the logistic regression. The training set for the binomial classification consist of

The following diagram illustrates the training set for a single feature.

Let's create a synthetic training set to evaluate and compare the two implementation of the logistic regression. The training set for the binomial classification consist of

- Two data sets of observations with 3 features, each following a data distribution with same standard deviation and two different means.
- Two labels (or expected values) {0, 1}, one for each Gaussian distribution

The margin of separation between the two groups of observations of 3 dimension is computed as mean(first group) - mean (second group). As the margin increases the accuracy of the binomial classification is expected to increase.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | final val SIGMA = 2.0 class DataGenerator(numTasks: Int)(implicit sc: SparkContext) { private def f(mean: Double): Double = mean + SIGMA*(Random.nextDouble - 0.5) def apply(half: Int, mu: Double): Array[LabeledPoint] = { val trainObs = ArrayBuffer.fill(half)(Array[Double](f(1.0),f(1.0),f(1.0))) ++ ArrayBuffer.fill(half)(Array[Double](f(mu),f(mu),f(mu))) val labels = ArrayBuffer.fill(half)(0.0) ++ ArrayBuffer.fill(half)(1.0) labels.zip(trainObs).map{ case (y, ar) => LabeledPoint(y, new DenseVector(ar)) }.toArray } } |

The method

Next we create two sets of labels 0 and 1 (line 13) that are used to generated the Apache Spark labeled points (line 15).

Apache Spark

*apply*generates the two groups of*half*observations following normal distribution of mean 1.0 and 1.0 + mu. (line 9 and 11).Next we create two sets of labels 0 and 1 (line 13) that are used to generated the Apache Spark labeled points (line 15).

Apache Spark

*LogisticRegression*classes process*LabeledPoint*instances which are generated in this particular case from*DenseVector*wrappers of the observations.

Use case

The first step consists of initializing the Apache spark environment, using

*SparkConf*and*SparkContext*classes (line 2 & 4).1 2 3 4 5 6 7 8 | val numTasks: Int = 64 val conf = new SparkConf().setAppName("LogisticRegr") .setMaster(s"local[$numTasks]") val sc = new SparkContext(conf) sc.setLogLevel("ERROR") // Execution of Spark driver code here... sc.stop |

The next step is to generate the training and validation set. The validation set is used at a later stage for comparing the accuracy of the respective model (line 5).

1 2 3 4 5 | val halfTrainSet = 32000 val dataGenerator = new DataGenerator(numTasks)(sc) val trainSet = dataGenerator(halfTrainSet, mean) val validationSet = dataGenerator(halfTrainSet, mean) |

It is now time to instantiate the two logistic regression classifiers and generate two distinct models. You need to make sure that the parameters (tolerance, number of iterations) are identical for both models.

This implementation uses the Logistic regression from

For this example, the optimization parameters (line 2 & 3) are purely arbitrary. MLlib uses RDD as input for training and validation set (line 4) while the logistic regression in ML uses instances of

This implementation uses the Logistic regression from

*MLlib*that uses a pre-canned*stochastic gradient descent*(line 1). A customized gradient descent can be defined by using the standalone*SGD*class from MLlib.For this example, the optimization parameters (line 2 & 3) are purely arbitrary. MLlib uses RDD as input for training and validation set (line 4) while the logistic regression in ML uses instances of

*DataFrame*.1 2 3 4 5 6 | val logRegrSGD = new LogisticRegressionWithSGD logRegrSGD.optimizer.setNumIterations(1000) logRegrSGD.optimizer.setConvergenceTol(0.02) val inputRDD = sc.makeRDD(trainingSet, numTasks) logisticRegression.setIntercept(true) val model = logisticRegression.run(inputRDD) |

Validation

Now it is time to use the validation set to compute the mean sum of square error and the accuracy of each predictor for different values of margin.

We need to define and implement a validation framework or class, simple but relevant enough for our evaluation. The first step is to specify the quality metrics as follows

We need to define and implement a validation framework or class, simple but relevant enough for our evaluation. The first step is to specify the quality metrics as follows

**metrics**produced by the Spark logistic regression**mSSE**Mean sum of square errors**accuracy**accuracy of the classification

*Quality*class as described in the following code snippet.1 2 3 4 5 6 7 8 9 | case class Quality( metrics: Array[(Double, Double)], msse: Double, accuracy: Double) { override def toString: String = s"Metrics: ${metrics.mkString(",")}\n |msse = ${Math.sqrt(msse)} accuracy = $accuracy" } |

Let's implement our validation class,

*BinomialValidation*for the binomial classification. The validation is created using the spark context*sc*, the logistic regression*model*generated through training and the number of partitions or*tasks*used in the data nodes.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | final class BinomialValidation( sc: SparkContext, model: LogisticRegressionModel, numTasks: Int) { def metrics(validationSet: Array[LabeledPoint]): Quality = { val featuresLabels = validationSet.map( lbPt => (lbPt.label, lbPt.features)).unzip val predicted_rdd = model.predict( sc.makeRDD(featuresLabels._2, numTasks) ) val scoreAndLabels = sc.makeRDD(featuresLabels._1, numTasks).zip(predicted_rdd) val successes = scoreAndLabels.map{ case(e,p) => Math.abs(e-p) }.filter( _ < 0.1) val msse = scoreAndLabels.map{ case (e,p) => (e-p)*(e-p)}.sum val metrics = new BinaryClassificationMetrics(scoreAndLabels) Quality(metrics.fMeasureByThreshold().collect, msse, successes.count.toDouble/validationSet.length) } } |

The method

The computation of metrics is actually performed by the

*metrics*(line 6) converts the validation set,*validationSet*into a RDD (line 7, 8) after segregating the expected values from the observations (*unzip*). The results of the prediction,*prediction_rdd*(line 9) is then zipped with the labeled values into the evaluation set,*scoreAndLabels*(line 13-14) from which the different quality metrics such as*successes*(line 16) and*muse*(line 18) are extracted.The computation of metrics is actually performed by the

*BinaryClassificationMetrics*MLlib class. Finally, the validation is applied on the logistic model with a convergence tolerance 0.11 2 3 | model.setThreshold(0.1) val validator = new BinomialValidation(sc, model, numTasks) val quality = validator.metrics(validationSet) |

Results

References

The fact that the L-BFGS optimizer provides a significant more accurate result (or lower mean sum of square errors) that the stochastic gradient descent is not a surprise. However, the lack of convergence of the SGD version merit further investigation.

__Note__: This post is a brief comparison of the two optimizer in terms of accuracy on a simple synthetic data set. It is important to keep in mind that the stochastic gradient descent has better performance overall than L-BFGS or any quasi-Newton method for that matter, because it does not require the estimation of the hessian metric (second order derivative).References

- An Introduction to Logistic and Probit Regression Models
*Machine Learning: A probabilistic perspective*Chapter 8 Logistic Regression" K. Murphy - MIT Press 2012