**Overview**
Apache Spark 1.5.x MLlib library provides developers and data scientists alike, two well known optimizers for the binomial classification using the logistic regression.
- Stochastic gradient descent (
*SGD*)
- Limited memory version of the Broyden-Fletcher-Goldfarb-Shanno (
*L-BFGS*)

I thought it would be instructive to compare the two optimization methods on the accuracy of the logistic regression as a binomial classification.

__Note__: MLlib in Apache Spark 1.5.1 does not support multinomial classification. It may be added in the future versions.

__Note__Background information on the stochastic gradient descent can be found at
**Logistic regression**
The logistic regression is one of the most commonly used discriminative supervised learning model mainly because it is intuitive and simple. It relies on the logistic function (refer to An Introduction to Logistic and Probit Regression Models

In the case of the classification problem, the probability that on observation **x** belong to a class **C** is computed as
\[p(C|x)=\frac{1}{1+e^{-w_{0}-w^{T}x}}\]
where **w** are the weights or model coefficients.

Apache Spark MLlib has two implementations of the logistic regression as a binary classifier
*org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS* using the L-BFGS optimizer
*org.apache.spark.mllib.classification.LogisticRegressionWithSGD* using the SGD optimizer

Background information on the stochastic gradient descent can be found at
Comparing Stochastic Gradient Descent And Batch Gradient Descent

For those interested in inner workings of the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm, Limited Memory BFGS
**Data generation**
The scaladocs for Apache Spark API are available at Apache Spark API

Let's create a synthetic training set to evaluate and compare the two implementation of the logistic regression. The training set for the binomial classification consist of
- Two data sets of observations with 3 features, each following a data distribution with same standard deviation and two different means.
- Two labels (or expected values) {0, 1}, one for each Gaussian distribution

The following diagram illustrates the training set for a single feature.

The margin of separation between the two groups of observations of 3 dimension is computed as mean(first group) - mean (second group). As the margin increases the accuracy of the binomial classification is expected to increase.

final val SIGMA = 2.0
class DataGenerator(numTasks: Int)(implicit sc: SparkContext) {
private def f(mean: Double): Double = mean + SIGMA*(Random.nextDouble - 0.5)
def apply(halfTrainSet: Int, mu: Double): Array[LabeledPoint] = {
val trainObs =
ArrayBuffer.fill(halfTrainSet)(Array[Double](f(1.0), f(1.0), f(1.0))) ++
ArrayBuffer.fill(halfTrainSet)(Array[Double](f(mu), f(mu), f(mu)))
val labels = ArrayBuffer.fill(halfTrainSet)(0.0) ++
ArrayBuffer.fill(halfTrainSet)(1.0)
labels.zip(trainObs).map{ case (y, ar) =>
LabeledPoint(y, new DenseVector(ar)) }.toArray
}
}

The method *apply* generates the two groups of *halfTrainSet* observations following normal distribution of mean 1.0 and 1.0 + mu.

Apache Spark *LogisticRegression* classes process *LabeledPoint* instances which are generated in this particular case from *DenseVector* wrappers of the observations.

**Test**
The first step consists of initializing the Apache spark environment.

val numTasks: Int = 64
val conf = new SparkConf().setAppName("LogisticRegr").setMaster(s"local[$numTasks]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
// Execution of Spark driver code here...
sc.stop

The next step is to generate the training and validation set. The validation set is used at a later stage for comparing the accuracy of the respective model.

val halfTrainSet = 32000
val dataGenerator = new DataGenerator(numTasks)(sc)
val trainSet = dataGenerator(halfTrainSet, mean)
val validationSet = dataGenerator(halfTrainSet, mean)

It is now time to instantiate the two logistic regression classifiers and generate two distinct models. You need to make sure that the parameters (tolerance, number of iterations) are identical for both models.

val logRegrSGD = new LogisticRegressionWithSGD
logRegrSGD.optimizer.setNumIterations(1000)
logRegrSGD.optimizer.setConvergenceTol(0.02)
val inputRDD = sc.makeRDD(trainingSet, numTasks)
logisticRegression.setIntercept(true)
val model = logisticRegression.run(inputRDD)

**Validation**
Now it is time to use the validation set to compute the mean sum of square error and the accuracy of each predictor for different values of margin.

We need to define and implement a validation framework or class, simple but relevant enough for our evaluation. The first step is to specify the quality metrics as follows
**metrics** produced by the Spark logistic regression
**mSSE** Mean sum of square errors
**accuracy** accuracy of the classification

The quality metrics are defined in the *Quality* class as described in the following code snippet.
case class Quality(
metrics: Array[(Double, Double)],
msse: Double,
accuracy: Double) {
override def toString: String =
s"Metrics: ${metrics.mkString(",")}\nmsse = ${Math.sqrt(msse)} accuracy = $accuracy"
}

Let's implement our validation class, *BinomialValidation* for the binomial classification. The validation is created using the spark context *sc*, the logistic regression *model* generated through training and the number of partitions or *tasks* used in the data nodes.

final class BinomialValidation(
sc: SparkContext,
model: LogisticRegressionModel,
numTasks: Int) {
def metrics(validationSet: Array[LabeledPoint]): Quality = {
val featuresLabels = validationSet.map( lbPt => (lbPt.label, lbPt.features)).unzip
val predicted_rdd = model.predict(sc.makeRDD(featuresLabels._2, numTasks))
val scoreAndLabels = sc.makeRDD(featuresLabels._1, numTasks).zip(predicted_rdd)
val successes = scoreAndLabels.map{ case(e,p) => Math.abs(e-p) }.filter( _ < 0.1)
val msse = scoreAndLabels.map{ case (e,p) => (e-p)*(e-p)}.sum
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
Quality(metrics.fMeasureByThreshold().collect,
msse,
successes.count.toDouble/validationSet.length)
}
}

The method *metrics* converts the validation set, *validationSet* into a RDD after segregating the expected values from the observations (*unzip*). The results of the prediction, *prediction_rdd* is then zipped with the labeled values into the evaluation set, *scoreAndLabels* from which the different quality metrics are extracted.

Finally, the validation is applied on the logistic model with a convergence tolerance 0.1

model.setThreshold(0.1)
val validator = new BinomialValidation(sc, model, numTasks)
val quality = validator.metrics(validationSet)

**Results**
The fact that the L-BFGS optimizer provides a significant more accurate result (or lower mean sum of square errors) that the stochastic gradient descent is not a surprise. However, the lack of convergence of the SGD version merit further investigation.

__Note__: This post is a brief comparison of the two optimizer in terms of accuracy on a simple synthetic data set. It is important to keep in mind that the stochastic gradient descent has better performance overall than L-BFGS or any quasi-Newton method for that matter, because it does not require the estimation of the hessian metric (second order derivative).

**References**