Wednesday, October 10, 2012

K-means Clustering in Java I: Elements

Overview
Among the clustering methods have been developed over the years from Spectral clustering, Non-negative Matrix factorization, Canopy to Hierarchical and K-means clustering. The K-means algorithm is by far the easiest to implement. This simplicity comes with a high price in terms of scalability and even reliability. However, as a unsupervised learning technique, K-means is still valuable for reducing the number of model features or detecting anomalies.
The objective is to classify observations or data points by groups that share common attributes or features. The diagram below illustrates the clustering of observations (x,y) for a simple 2-feature model



Each cluster has a mean or centroid, m = ( .. m..). First we need to define a distance between an observation  X = (...x ..) and c. The Manhattan and Euclidean distances are respectively defined as:  \[d_{M} = \sum_{i=0}^{n}\left | x_{i} - m_{i}\right |\,\,\,,\,\,d_{E}= \sum_{i=0}^{n} (x_{i} - m_{i})^{2}\] The loss function for N cluster Cj is defined by \[W(C)=\frac{1}{2}\sum_{k=0}^{N-1}\sum_{c_{i}=k}\sum_{C_{j}} d(x_{i},x_{j})\]  The goal is to find the centroid m, and clusters C, that minimize the loss function as: \[C^{*}\left (i \right ) = arg\min _{k\in [0,N-1]}d (x_{i}, m_{k})\]

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted
Distances and Observations
First we need to define the distance between each observation and the centroid of a cluster. The class hierarchy related to the distance can be implemented as nested classes as there is no reason to "expose" to client code. The interface, Distance,define the signature of the computation method. For sake of simplicity, the sample code implements only the Manhattan and Euclidean distances.  Exceptions, validation of method arguments, setter and getter methods are omitted for the sake of simplicity.

protected interface Distance {
  public double compute(double[] x, Centroid centroid);
}
    // Defintion of d(x,y) =|x-y|
protected class ManhattanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
     double sum = 0.0, xx = 0.0;
     for( int k = 0; k< x.length; k++) {
       xx = x[k] - centroid.get(k);
       if( xx < 0.0) {
         xx = -xx;
       }
       sum += xx;
     }
     return sum;
  }
}

  // Definition d(x,y)= sqr(x-y) 
protected class EuclideanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
    double sum = 0.0, xx = 0.0;
    for( int k = 0; k < x.length; k++) {
       xx = x[k] - centroid.get(k);
       sum += xx*xx;
    } 
    return Math.sqrt(sum);
  } 
}

Next, we define an observation (or data point) as a vector or array of floating point values, in our example.  An observation can support heterogeneous types (boolean, integer, float point,..) as long as they are normalized to [0,1]. In our example we simply normalized over the maximum values for all the observations.

public final class Observation {
   // use Euclidean distance that is shared between all the instances

  private static Distance metric = new EuclideanDistance();
  public static void setMetric(final Distance metric) {
    this.metric = metric;
  }
 
  private double[] _x  = null;
  private int  _index  = -1;

  public Observation(double[] x, int index) { 
    _x = x; 
    _index = index; 
  }
   // compute distance between each point and the centroid
  public double computeDistance(final Centroid centroid) {
     return metric.compute(_x, centroid);
  }
    // normalize the value of data points.
  public void normalize(double[] maxValues) {
     for( int k = 0; k < _x.length; k++) {
        _x[k] /= maxValues[k];
     }
  }
}



Clusters
Centroid for each cluster are computed iteratively to reduce the loss function.  The centroid values are computed using the mean of each feature across all the observations. The method Centroid.compute initialize the data points belonging to a cluster with the list of observations and compute the centroid values _x by normalizing with the number of points.
protected class Centroid {
  private double[] _x = null;       
       
  protected Centroid() {}
  protected Centroid(double[] x) {
    Array.systemCopy(_x, x, 0, x.length, sizeOf(double));
  }
    // Compute the centoid values _x by normalizing with the number of points.
  protected void compute(final List<Observation> observations)  {
    double[] x = new double[_x.length];
    Arrays.fill(x, 0.0);
           
    for( Observation point : observations ) {
      for(int k =0; k < x.length; k++) {
        x[k] += point.get(k);
      }
    }
    int numPoints = observations.size();
    for(int k =0; k < x.length; k++) {
      _x[k] = x[k]/numPoints;
    }
  }
}

A cluster is defined by its label (index in this example) a centroid, the list of observations it contains and the current loss associated to the cluster (sum of the distance between all observations and the centroid).
The cluster behavior is defined by the following methods:
  • computeCentroid: compute the sum of the distance between all the point in this cluster and the centroid.
  • attach: Attach or add a new observation to this cluster
  • detach: Remove an existing observations from this cluster.


public final class KmeansCluster {
  private int       _index   = -1;
  private Centroid  _centroid  = null; 
  private double    _sumDistances  = 0.0;
  private List<observation> _observations = new ArrayList<Observation>()

  public void computeCentroid() {
    _centroid.compute( _observations );
    for( Observation point : _observations  ) {
        point.computeDistance(_centroid);
    }
    computeSumDistances();
  }

     // Attach a new observation to this cluster.
  public void attach(final Observation point) { 
    point.computeDistance(_centroid);
    _observations.add(point);
    computeSumDistances();
  }

  public void detach(final Observation point) {
    _observations.remove(point);
    computeSumDistances();
  }
           
  private void computeSumDistances() { 
    _sumDistances = 0.0;     
    for( Observation point : _observations) {
      _sumDistances += point.computeDistance(_centroid);
    }
  }
      //....
}

Finally, the clustering class implements the training and run-time classification. The train method iterates across all the clusters and for all the observations to reassign the observations to each cluster. The iterative computation ends when either the loss value converges or the maximum number of iterations is reached. 
If the algorithm use K clusters with M observations with N variables the execution time for creating the clusters is K*M*N. If the algorithm converges after T iterations then the overall execution is T*K*M*N. For instance, the K-means classification of 20K observations and data with 25 dimension, using 10 clusters, converging after 50 iterations requires  250,000,000 evaluations! The constructor create the clustering algorithm with a predefined number of cluster, K, and a set of observations.
The method getCentroids retrieves the current list of centroids (value of centroid vectors)

public final class KmeansClustering { 
  private KmeansCluster[] _clusters = null;
  private Observation[] _obsList = null; 
  private double _totalDistance  = 0.0;
  private Centroid[] _centroids = null;
   
  public KmeansClustering(int numClusters, final Observation[] obsList) {   
     _clusters = new KmeansCluster[numClusters];
     for (int i = 0; i < numClusters; i++) {
        _clusters[i] = new KmeansCluster(i);
     }
     _obsList = obsList;
  }
 
  public final List<double[]> getCentroids() {
      List<double[]> centroidDataList = null;

      if(_clusters != null &&; _clusters.length < 0) {
         centroidDataList = new LinkedList<double[]>();
         for( KmeansCluster cluster : _clusters) {
            centroidDataList.add(cluster.getCentroid().getX());
         }
      }
      return centroidDataList;
  }
}


The second section (next post) describes the implementation of the training and classification tasks.


References
The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective 11.4.2.5 K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006

73 comments:

  1. As the interest of java programming application continues expanding, there is enormous interest for java experts in programming improvement businesses. Along these lines, taking preparing will help understudies to be talented java engineers in driving MNCs.
    JAVA Training in Chennai | JAVA course in Chennai | JAVA J2EE Training in Chennai

    ReplyDelete
  2. Selenium is most as used automation tool to test web application and browser. This automation tool offers precise and complete information about a software application or environment.
    Selenium Training in Chennai | Selenium Training institute in Chennai | Selenium Testing Training in Chennai

    ReplyDelete
  3. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete
  4. Thanks for sharing this program.It is really helpful.Continue sharing more like this.
    Regards,
    Java Training in Chennai | Java Training Institutes in Chennai | Java courses in Chennai

    ReplyDelete
  5. I feel satisfied with your blog, you have been delivering useful & unique information to our vision even you have explained the concept as deep clean without having any uncertainty, keep blogging.
    Hadoop Training in Chennai|Big Data Training in Chennai|Big Data Training

    ReplyDelete
  6. Thanks for sharing informative article on Salesforce technology. Your article helped me a lot to understand the career prospects in cloud computing technology. Salesforce Training in Chennai | Salesforce Training Institutes in Chennai

    ReplyDelete
  7. Excellent post!!! The future of cloud computing is on positive side. With most of the companies integrate Salesforce CRM to power their business; there is massive demand for Salesforce developers and administrators across the world.Salesforce Training in Chennai | Salesforce Training Institutes in Chennai

    ReplyDelete
  8. Amazing blog about the various informative information on the programming languages. Java Training in Chennai

    ReplyDelete
  9. Nowadays, most of the businesses rely on cloud based CRM tool to power their business process. They want to access the business from anywhere and anytime. In such scenarios, salesforce CRM will ensure massive advantage to the business owners.Cloud Computing Training in Chennai

    ReplyDelete
  10. Nice blog. Thanks for your valuable information and time to understand.
    Java Training in chenani

    ReplyDelete
  11. Really very informative post you shared here. Keep sharing this type of informative blog. If anyone wants to become a Java professional learn Java Training in Bangalore. Nowadays Java has tons of job opportunities for all professionals.

    ReplyDelete
  12. Great and informative article... thanks for sharing your views and ideas.. keep rocks...

    Java Training in chennai | Java Training institute in chennai

    ReplyDelete
  13. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    Java Training in Bangalore

    ReplyDelete
  14. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    https://www.besanttechnologies.com/training-courses/java-training

    ReplyDelete
  15. thank you very useful information admin, and pardon me permission to share articles here may help
    Best VLSI Project Center in Chennai | VLSI Project Center in Velachery

    ReplyDelete
  16. The strategy you have blogged on this technology helped me to get into the next level and had lot of information in it. Thanks for sharing.. Software Testing Training Institute in Chennai | Selenium Training Institute in Chennai | ISTQB Training Institute in Chennai

    ReplyDelete
  17. Thank you for sharing the valuable information here. This was nice and please keep update like this valuable information.
    Robotics Project Center in Chennai | IEEE Robotics Project Center in Chennai

    ReplyDelete
  18. Nice post on java. Advanced technologies are emerging in java and the scope of learning java is very high. Thanks for sharing the valuable blog and keep updating.
    Final Year Project Center in Chennai |
    IEEE Project Center in Chennai |
    Diploma Project Center in Chennai

    ReplyDelete
  19. Good and informative post.... it is very useful for me to learn and understand as a initial professional.... keep rocks and updating...Diploma Projects Center in Chennai | Diploma Projects Center in Velachery

    ReplyDelete
  20. Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.

    java training in bangalore

    ReplyDelete
  21. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.selenium training in bangalore

    ReplyDelete
  22. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging...
    Robotics Project Center in Chennai | Best Robotics Projects in Chennai | IEEE Robotics Project Center in Chennai | No.1 Robotics Project Center in Velachery

    ReplyDelete
  23. This comment has been removed by the author.

    ReplyDelete
  24. I really enjoyed while reading your article, the information you have delivered in this post was damn good. Keep sharing your post with efficient news.RPA Training Institute in Chennai | UI Path Training Institute in Chennai | Blue Prism Training Institute in Chennai

    ReplyDelete
  25. Informative blog and it was up to the point describing the information very effectively. Thanks to blog author for wonderful and informative post...
    Final Year Project Center in Chennai | Final Year Project Center in Velachery

    ReplyDelete
  26. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

    white label website builder

    mobile website builder

    ReplyDelete
  27. I can only express a word of thanks! Nothing else. Because your topic is nice, you can add knowledge. Thank you very much for sharing this information.

    Avriq India
    avriq
    pest control
    cctv camera

    ReplyDelete
  28. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    uipath training institute in chennai

    ReplyDelete
  29. Amazing write up..Thanks for sharing this valuable information.
    Citrix Exams in Chennai | Xenapp exam center in Chennai

    ReplyDelete
  30. I was looking for something like this Amazing post thanks for sharing for great information.
    OCJP Exam Center in Chennai | OCJP Exam Center in Velachery

    ReplyDelete
  31. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. 
    ISTQB Certification Training in Chennai | Java Exam Center in Chennai | Microsoft Dot net Certification in Chennai

    ReplyDelete
  32. Good one.I appreciate you for sharing this knowledge.Thank you so much for the examples.
    No.1 Android project Center in Chennai | No.1 Android project Center in Velachery

    ReplyDelete
  33. I feel satisfied with your blog, you have been delivering useful & unique information to our vision even you have explained the concept as deep clean without having any uncertainty, keep blogging.
    OCJP Exam Center in Chennai | OCJP Exam Center in Velachery

    ReplyDelete
  34. Great information in this blog given by you and very interesting to read this article. Our Oracle fusion financials online training supplier gained the high commonplace name through worldwide for its teaching.Thanks for sharing.
    Citrix Exams in Chennai | Xenapp exam center in Chennai

    ReplyDelete
  35. Hi, am a big follower of your blog. Nice information on clustering concept. I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post. Keep update your blog.
    OCJP Exam Center in Chennai | OCJP Exam Center in Velachery

    ReplyDelete
  36. Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of.
    VMware Exam Centers in Chennai | VMware Exam Centers in Velachery

    ReplyDelete
  37. Extremely exceptionally educational post you shared here. Continue sharing this sort of useful blog.
    Citrix Exams in Chennai | Xenapp exam center in Chennai

    ReplyDelete
  38. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work.
    VMware Exam Centers in Chennai | VMware Exam Centers in Velachery

    ReplyDelete
  39. Good one.I appreciate you for sharing this knowledge.Thank you so much for the examples.Its very helpful for me and newbies.I learned much .
    Citrix Exams in Chennai | Xenapp exam center in Chennai

    ReplyDelete
  40. Quite interesting post,Thanks for sharing the information.Keep updating good stuff...
    Best RPA Training Institute in Chennai | Best RPA Training Institute in Velachery

    ReplyDelete
  41. Excellent information you made in this blog, very helpful information. Thanks for sharing.
    No.1 UIPath Training Institute in Chennai | No.1 UIPath Training Institute in Velachery

    ReplyDelete
  42. Quite interesting post,Thanks for sharing the information.Keep updating good stuff...
    ISTQB Certifications Exams Course in Chennai | QA Testing in Meenambakkam

    ReplyDelete
  43. I believe that there will be very good opportunities for the people who are coming around this area.
    Robotics Project Center Training in Chennai | Best Robotics Project Course in Tambaram

    ReplyDelete
  44. Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of.
    No.1 Blue Prism Training Institute in Chennai | No.1 Blue Prism Training Institute in Velachery | No.1 Blue Prism Training Institute in Kanchipuram

    ReplyDelete
  45. It is really a great work and the way in which u r sharing the knowledge is excellent. Thanks for helping me to understand basic concepts. Thanks for your informative article. Java Training in Chennai | Pega Training in Chennai

    ReplyDelete
  46. Thank you for taking the time to provide us with your valuable information. Keep sharing your post regularly..
    Automation Anywhere Training with Placement in Chennai | Automation Anywhere Training with Placement in Tambaram

    ReplyDelete
  47. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    JAVA and J2EE Training Institute in Chennai | JAVA and J2EE Training Institute in Velachery

    ReplyDelete
  48. Hi there, I am so thrilled I found your website, I really found you by mistake, while I was browsing on Yahoo for something else, Anyhow I am here now and would just like to say thanks a lot for a tremendous post and an all-round exciting blog. Please do keep up the awesome job.
    No.1 Automation Anywhere Training Institute in Chennai | No.1 Automation Anywhere Training Institute in Velachery

    ReplyDelete
  49. I was curious if you ever considered changing the layout of your site? It’s very well written; I love what you’ve got to say. You’ve got an awful lot of text.
    No.1 Blue Prism Training Institute in Chennai | No.1 Blue Prism Training Institute in Kanchipuram

    ReplyDelete