## Wednesday, March 7, 2018

The stochastic gradient descent (SGD) optimization algorithm is commonly used in minimizing the loss function in the training of machine learning models such as support vector machines, logistic regression or back-propagation neural networks. In its simplest incarnation, the gradient is computed using a single learning rate.
However, it is not uncommon for the features of a model to have a wide range of variance between observations. In this case an adaptive gradient algorithm, which assigns a learning rate to each feature, may be the solution. There are many different approaches to implement an algorithm that attributes a learning rate to each feature. This post describes the AdaGrad algorithm and its implementation in Apache Spark 2.x.

The stochastic gradient descent optimizer is a randomized approximation of the (batched) gradient descent algorithm used to minimize a continuous differentiable objective function. In supervised machine learning, the objective function is a loss function (logistic, sum of least squares..).
$L(w)=\frac{1}{n}\sum_{i=0}^{n}(y_{i}-f(x_{i}|w))^{2}$ The objective function L is expressed as the summation of differentiable functions. In supervised learning, the loss related to a specific feature is defined as a continuous, differentiable, convex function. $L(w)=\sum_{i=1}^{n}L_{i}(w)$ In supervised learning, the vector w represent the vector of weights (or model parameters). At each iteration of the stochastic gradient descent, the weights are updated using the formula $w_{t+1}=w_{t}-\eta \sum_{i=0}^{n}\frac{\partial L}{\partial w_{i, t}}$ The stochastic gradient descent (SGD) minimizes the loss function between the expected value and the predictive values generated by the model. At each iteration, SGD, selects a subset of the observations (known as a mini-batch) used in the training of the model. The iterative process is expected to converged toward the true global minimum of the loss function.

The main idea behind AdaGrad is the need to increase the learning rate for the sparse features (or model parameters) and decrease the learning rate for features that are denser. Therefore, AdaGrad improves the convergence of the minimization of the loss for model with sparse features, given that these sparse features retains information.
$w_{t+1}=w_{t} -\frac{1}{\sqrt{\sum_{t=1}^{T}\bigtriangledown _{ti}^{t} + \varepsilon }}\frac{\partial L}{\partial w_{ti}}$

SGD in Apache Spark
The Apache spark MLlib library has two implementations of SGD
• Generic Gradient Descent and related classes in the mllib.optimization package
• SGD bundled with classifier or regression algorithms such as LogisticRegressionWithSGD, LassoWithSGD, SVMWithSGD or RidgeRegressionWithSGD
We will be using the optimization package in order to customize the stochastic gradient descent. The objective is to leverage the mllib.optimization.GradientDescent template class and implement the adaptive gradient with per-feature learning rate by creating a customize Updater.
The updater "updates the weights of the model" (Logistic regression or SVM) with the product of the current learning rate with the partial derivative of the loss over this weight (as described in the previous section). Let's call AdaGradUpdater the updater that implement the update of the model weights using the adaptive gradient. The SGD is then instantiated as follow
.setStepSize(0.01)
. .....
Updater.compute(
oldWeights: Vector,
stepSize: Double,
iter: Int,
regCoefs: Double): (Vector, Double)
The method returns the tuple (vector of new weights, loss)

As mentioned earlier, the implementation of AdaGrad consists of overriding the method Updater.compute
The computation of the learning rate requires us to record the past values of the square value of the gradient (previous steps) for this particular weight, in the array gradientHistory (line 3). First we define the method += to update the gradient history (lines 27-36). The first call to the method creates the gradient history (line 31).
The next step consists of converting the existing (old) weights into a Breeze dense vector brzWeights (line 14). The array of the new learning rates is computed as the inverseVelocity coefficient (line 39).
The learning rates are zipped with the old weights (line 15) to update the weights newWeights as a new dense vector(line 21). The linear algebra (matricial computation) on the Spark data node is actually performed by the LINPACK library under the cover through calls to brzAxpy (line 21) and brzNorm (line 22).

References

1. Very interesting formulation for Apache spark. Something that every enthusiast will definitely appreciate, unlike the ones illustrated in SpeedyPaper review which are not understandable.

2. Wow! SmartPaperHelp review also has a good stuff about Apache Spark but this post is more in-depth and very detailed. Good work.

3. Dear Patrick, I have tried to run this code but it does not seem to converge. Do you have a link to a full example?

Thank you!

4. ActiveWizards sad that for creation a non-linear model, you must specify a unique string identifier and a function of the core of the NonLinearFunction model. From the optional parameters, you can list: the maximum number of iterations of training, the initial approximation of the coefficient vector, and the required accuracy. Nonlinear functions often have a lot of extrema and the choice of the initial approximation, based on a priori ideas about the behavior of a particular nuclear function, allows us to direct the search to the region of the global extremum.

5. If you’re interested in becoming a data scientist, our learning delivery model combines the personal motivation of live virtual classroom instruction with the reinforcement of relevant practical projects. For more information, check out Level Up.

6. I strongly believe that there will be great opportunities for those who looked into this area, thanks much for creating and sharing here...
Best Online Software Training Institute | Web Services Training