Monday, March 7, 2016

Weighting logistic loss for imbalanced dataset in Spark

How to apply corrective weight for the training of a logistic regression with imbalanced dataset using Apache Spark MLlib?
Some applications such as spam or online targeting have an imbalanced dataset. The number of observations associated to one label is very small (minority class) compared to the number of observations associated to the other labels.

Let's consider the case of intrusion detection system that identifies an cyber attack through a Virtual Private Network (VPN) in an entreprise. The objective is to classify every session based on one or several TCP connections as potentially harmful intrusion or normal transaction knowing that a very small fraction of attempt to connect and authenticate are actual deliberate attacks (less than 1 for each million of connections).
There are few points to consider:
  • The number of reported attack is very likely very small (< 10). Therefore, the very small set of labeled security breach is plagued with high variance
  • A very large and expensive labeling operation is required, if even available, to generate a fairly stable negative class (security breach)
  • The predictor can be extremely accurate (as measured by F1-score, Area under the ROC curve or PR curve) by always classifying any new attempt to connect to the VPN as harmless. In the case of 1 reported attack per 1 million VPN session, the prediction would be accurate 99.999% of the time.

There are several approaches to address the problem of imbalanced training and validation sets. These list of the most common known techniques onclude
  • Sub-sampling the majority class (i.e. harmless VPN sessions): It reduces the number of labels for the normal sessions while preserving the labeled reported attacks.
  • Over-sampling the minority class: This technique generates synthetic sampling using bootstrapped samples based on k Nearest Neighbors algorithm
  • Application of weights differential to the logistic loss used in the training of the model
This post focuses on the third option or re-balancing (weighting) of the logistic loss. The implementation of this technique is specific to Apache Spark and more specifically the ML implementation of the Logistic Regression (not MLlib)that leverages data frames.
It is assumed the reader is somewhat familiar the Apache Spark, data frame and its machine learning module MLlib.

Weighting the logistic loss
Let's consider the logistic loss commonly used in training a binary model f with feature x and label y: \[logLoss = - \sum_{i=0}^{n-1}[y_i .log(f(x_i)) + (1 - y_i) .log(1 - f(x_i))]\] The first component of the loss function is related to the minority observations (security breach through a VPN: label = 1) while the a second component represents the loss related to the majority observation (harmless VPN sessions: label = 0)
Let's consider the ratio, r of number of observations related to the majority label over the total number of observations: \[r = \frac{ i: (y_i = 1)}{i : y_i}\] The logistic loss can be then rebalanced as \[logloss = -\sum_{i=0}^{n-1} [r.y_i.log(f(x_i)) + (1-r).(1-y_i).log(f(x_i))]\] The next step is to implement the weighting of the binomial logistic regression classifier using Apache Spark.

Weighting the logistic loss
Apache Spark is a open source framework for in-memory processing of large datasets (think as Hadoop on steroids). Apache Spark framework contains a machine learning module known as MLlib. The objective is to modify/override the train method of the LogisticRegression.
One simple option is to sub-class LogisticRegression class in the mllib/ml package. However the logistic loss is actually computed in the private class LogisticAggregatorb which cannot be overridden.
However if you browse through the ml.classification.LogisticRegression.train Scala code, you notice that the class Instance that encapsulates labeled data for the computation of the loss and the gradient of loss has three parameters
  • label: Double
  • feature: linalg.Vector
  • weight: Double

The plan is to use this 3rd parameter weight as the balancing weight ratio r. This is simply accomplished by adding an extra column weightCol to the input data frame dataset and define its value as
  • balancingRatio r for label = 1
  • 1 - balancingRatio (1 - r) for label = 0
A simple implementation of the Weighted Logistic Regression using Apache Spark MLlib implementation of the Logistic Regression :

final val BalancingWeightColumnName: String = "weightCol"
final val BalanceRatioBounds = (0.001, 0.999)
final val ImportanceWeightNormalization = 2.0

class WeightedLogisticRegression(balanceRatio: Double = 0.5) 
   extends LogisticRegression(UUID.randomUUID.toString) 
      with Logging {


  private[this] val balancingRatio = 
    if (balanceRatio < BalanceRatioBounds._1) 
    else if (balanceRatio > BalanceRatioBounds._2) 

  override protected def train(dataset: DataFrame): 
      LogisticRegressionModel = {
    val newDataset = dataset.withColumn("weightCol", lit(0.0))

    val weightedDataset: RDD[Row] = => {
      val w = if (row(0) == 1.0) balancingRatio 
         else 1.0 - balancingRatio
      Row(row.getAs[Double](0), row.getAs[Vector](1), 
           ImportanceWeightNormalization * w)

    val weightedDataFrame = dataset.sqlContext
        .createDataFrame(weightedDataset, newDataset.schema)

  • The balancing ratio has to be normalized by a factor ImportanceWeightNormalization = 2: The factor is require to produce a weight of 1 for both the majority and minority classes from a fully balanced ratio of 0.5.
  • The actual balancing ratio needs to be constrained within an acceptable range BalanceRatioBounds to prevent for the minority class to have an outsize influence of the weights of the logistic regression model. In the extreme case, there may not even be a single observation in the minority class (security breach through the VPN). These minimum and maximum values are highly dependent on the type of application.

Here is an example of application of the WeightedLogisticRegression on a training data frame labeledData. The number of data points (or observations) associated to label = 1 is extracted through a simple filter.

val numPositives = trainingDF.filter("label == 1.0").count
val datasetSize = labeledData.count
val balanceRatio = (datasetSize- numPositives).toDouble/datasetSize

val lr = new WeightedLogisticRegression(balanceRatio)

val model =

One simple validation of this implementation of the weighted logistic regression on Apache Spark is to verify that the weights of the logistic regression model generated with the WeightedLogisticRegression with a balancing ratio of 0.5 are very similar to the weights generated by the logistic regression model of the Apache Spark MLlib (ml.classification.LogisticRegressionModel)

Note: This implementation relies on Apache Spark 1.6.0. The latest implementation of the Logistic Regression in beta release of Apache Spark 2.0 does not allow to override the LogisticRegression outside the scope of Apache Spark MLlib.



  1. The expansion of internet and other business intelligence leads to large volume of data. Industries are looking for talented professionals to maintain and process huge volume of data with latest tools available in the market. Taking Hadoop Training in Chennai | Big Data Training in Chennai will ensure better career prospects for talented professionals.

  2. Very interesting content which helps me to get the in depth knowledge about the technology. To know more details about the course visit this website.
    hadoop training in chennai | Big Data Training in Chennai

  3. An excellent example of a good article!
    I would recommend using a cloud storage server, the resource protects all your personal data and you are not afraid about virus attacks virtual data room pricing

  4. The result being that the auction could not be as competitive as it had the potential to be, and with low revenues from auctions of surplus goods for the government. Liteblue login

  5. your blog contain very useful information and good points were stated in the blog which are very helpful one, thank you for sharing. Software Testing Training Institute in Chennai | Selenium Training Institute in Chennai | ISTQB Training Institute in Chennai

  6. LiteBlue Official Direct access is provided to LiteBlue Login Online USPS (Liteblue Sign In Gov) where we are going to brief you all on the procedure to log into the main portal of the USPS services.
    usps lite blue

  7. Interesting post! This is really helpful for me. I like it! Thanks for sharing..Data Mining Projects Center in Chennai | Data Mining Projects Center in Velachery

  8. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

  9. It is a one of the great discussion which is very essential for me as well. I must follow the handy discussion and sure that the content will be very useful to me as well. Keep it up. 
    Six Sigma Certification Training in Chennai | Linux Certification Training in Chennai | Microsoft Certification Training in Chennai

  10. Thanks for your marvelous posting on weighting logistics! It is very useful and good. Keep updating...
    Final Year

    Project Center in Chennai
     | IEEE Project Center in Chennai |  Diploma Project Center in Chennai

  11. Really an amazing article!!..I gain lot of information from your blog.keep sharing.

    Electrical Project Center in Chennai | Best Electrical projects in Velachery

  12. Wow Very Nice !! Article providing here very nice information am getting from your website... very nice information am getting from your website.. Again Very Nice
    Best Graphics Designing Training Institute in Chennai | No.1 Graphics Designing Training Institute in Velachery

  13. Good to learn something new about web design from this blog. Thanks for sharing such worthy article.
    Best CorelDraw Training Institute in Chennai | No.1 Photoshop Training Institute in Chennai

  14. Great Post, I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.

  15. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

    IBM BPM Online Training in Chennai
    IBM BPM Online Training

  16. This is really an awesome article. Thank you for sharing this.It is worth reading for everyone. Visit us:
    AWS Certification Training in Chennai | AWS Training Institute in Pallikaranai

  17. Good and more informative post... thanks for sharing your ideas and views... keep rocks and updating.........

    PCB Designing Training in Chennai | PCB Training Institute in Chennai | PCB Training in Velachery

  18. Nice Post! It is really interesting to read from the beginning and Keep up the good work and continue sharing like this.

    Linux Certification Training in chennai | Linux Training Institute in Chennai | Linux Exam Center in Chennai

  19. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge.

    MatLab Training Institute in Chennai | MatLab Training in Velachery | MatLab Courses in Medavakkam

  20. Your Blog is really awesome with useful information and informative article.Thanks for sharing such a wonderful and excellent post with us.keep updating such a amazing post..
    ISTQB Certification Training Center in Chennai | ISTQB Certification Exams in Velachery | ISTQB Certification Training in Velachery | ISTQB Certification Exams in Madipakkam

  21. Thanks for sharing your wonderful and very useful information.keep updating such a impressive and attractive blog with interesting content.

    Java Training in Chennai | Java Training in Velachery | Java Training Center in Medvakkam | Java Training in Pallikaranai | Java Courses in Chennai | Java Online Training in Guindy

  22. Good article!!! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    Best UI Path Training Institute in Chennai | Best UI Path Training in Velachery | Best UI Path Certification Training in Pallikaranai

  23. Good article!!! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    Best UI Path Training Institute in Chennai | Best UI Path Training in Velachery | Best UI Path Certification Training in Pallikaranai

  24. Thanks for sharing your informative blog with useful information,its really very interesting and happy to read your article.keep updating such a wonderful post with us..

    Embedded System Training Institute in Chennai | Embedded Training in Velachery | Embedded System Training in Pallikaranai | Embedded Training in Madipakkam

  25. your Blog is really useful for me,its really very interesting and informative information post.keep updating such an amazing post.
    Linux Certification Training in Chennai | Linux Training Institute in Chennai | Linux Training Center in Adayar | Best Linux Training in Pallikaranai


  26. Your Blog is really amazing; it’s really very informative content and useful information. Thanks for sharing your wonderful blog. Keep updating such a creative knowledge.

    Best Java Training Institute in Chenna | Java Training in Velachery | Best Java Courses in Medavakkam | Java Training Center in Pallikaranai

  27. Your Blog is really amazing,its really useful for me and informative content with helpful information.keep updating such a wonderful post..

    Tally ERP9 Training in Chennai | Best Tally Training in Chennai | Tally Training Center in Pallikaranai | No.1 Tally ERP9 with GST Courses in Velachery

  28. Very informative blog. Helps to gain knowledge about new concepts and techniques.Thanks a lot for sharing this wonderful blog.keep updating such a excellent post..

    Best PCB Design Training in Chennai | No.1 PCB Design Course in Velachery | Best PCB Training in Chennai

  29. Nice blog. Thank you for sharing. The information you shared is very effective for learners I have got some important suggestions from it..
    Python Certification Training Institute in Chennai | Python Training in Chennai | Python Exam Center in Velachery | Python Training in Velachery

  30. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...

    AWS Certification Training Institute in Chennai | AWS Training Center in Chennai | AWS Certification Training in Velachery

  31. Thanks for sharing this information,this is helpful to me a lot...It is amazing and wonderful to visit your site.

    Web Designing Training Institute in Chennai | Web designing Training in Velachery | Web Design Training Center in Velachery

  32. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
    Best MicroSoft Azure Training Institute in Chennai | Azure Training in Pallikaranai | Best Azure Certification Training in Pallikaranai | Best Azure Training Center in Chennai