Wednesday, October 10, 2012

K-means Clustering in Java I: Elements

Overview
Among the clustering methods have been developed over the years from Spectral clustering, Non-negative Matrix factorization, Canopy to Hierarchical and K-means clustering. The K-means algorithm is by far the easiest to implement. This simplicity comes with a high price in terms of scalability and even reliability. However, as a unsupervised learning technique, K-means is still valuable for reducing the number of model features or detecting anomalies.
The objective is to classify observations or data points by groups that share common attributes or features. The diagram below illustrates the clustering of observations (x,y) for a simple 2-feature model



Each cluster has a mean or centroid, m = ( .. m..). First we need to define a distance between an observation  X = (...x ..) and c. The Manhattan and Euclidean distances are respectively defined as:  \[d_{M} = \sum_{i=0}^{n}\left | x_{i} - m_{i}\right |\,\,\,,\,\,d_{E}= \sum_{i=0}^{n} (x_{i} - m_{i})^{2}\] The loss function for N cluster Cj is defined by \[W(C)=\frac{1}{2}\sum_{k=0}^{N-1}\sum_{c_{i}=k}\sum_{C_{j}} d(x_{i},x_{j})\]  The goal is to find the centroid m, and clusters C, that minimize the loss function as: \[C^{*}\left (i \right ) = arg\min _{k\in [0,N-1]}d (x_{i}, m_{k})\]

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted
Distances and Observations
First we need to define the distance between each observation and the centroid of a cluster. The class hierarchy related to the distance can be implemented as nested classes as there is no reason to "expose" to client code. The interface, Distance,define the signature of the computation method. For sake of simplicity, the sample code implements only the Manhattan and Euclidean distances.  Exceptions, validation of method arguments, setter and getter methods are omitted for the sake of simplicity.

protected interface Distance {
  public double compute(double[] x, Centroid centroid);
}
    // Defintion of d(x,y) =|x-y|
protected class ManhattanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
     double sum = 0.0, xx = 0.0;
     for( int k = 0; k< x.length; k++) {
       xx = x[k] - centroid.get(k);
       if( xx < 0.0) {
         xx = -xx;
       }
       sum += xx;
     }
     return sum;
  }
}

  // Definition d(x,y)= sqr(x-y) 
protected class EuclideanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
    double sum = 0.0, xx = 0.0;
    for( int k = 0; k < x.length; k++) {
       xx = x[k] - centroid.get(k);
       sum += xx*xx;
    } 
    return Math.sqrt(sum);
  } 
}

Next, we define an observation (or data point) as a vector or array of floating point values, in our example.  An observation can support heterogeneous types (boolean, integer, float point,..) as long as they are normalized to [0,1]. In our example we simply normalized over the maximum values for all the observations.

public final class Observation {
   // use Euclidean distance that is shared between all the instances

  private static Distance metric = new EuclideanDistance();
  public static void setMetric(final Distance metric) {
    this.metric = metric;
  }
 
  private double[] _x  = null;
  private int  _index  = -1;

  public Observation(double[] x, int index) { 
    _x = x; 
    _index = index; 
  }
   // compute distance between each point and the centroid
  public double computeDistance(final Centroid centroid) {
     return metric.compute(_x, centroid);
  }
    // normalize the value of data points.
  public void normalize(double[] maxValues) {
     for( int k = 0; k < _x.length; k++) {
        _x[k] /= maxValues[k];
     }
  }
}



Clusters
Centroid for each cluster are computed iteratively to reduce the loss function.  The centroid values are computed using the mean of each feature across all the observations. The method Centroid.compute initialize the data points belonging to a cluster with the list of observations and compute the centroid values _x by normalizing with the number of points.
protected class Centroid {
  private double[] _x = null;       
       
  protected Centroid() {}
  protected Centroid(double[] x) {
    Array.systemCopy(_x, x, 0, x.length, sizeOf(double));
  }
    // Compute the centoid values _x by normalizing with the number of points.
  protected void compute(final List<Observation> observations)  {
    double[] x = new double[_x.length];
    Arrays.fill(x, 0.0);
           
    for( Observation point : observations ) {
      for(int k =0; k < x.length; k++) {
        x[k] += point.get(k);
      }
    }
    int numPoints = observations.size();
    for(int k =0; k < x.length; k++) {
      _x[k] = x[k]/numPoints;
    }
  }
}

A cluster is defined by its label (index in this example) a centroid, the list of observations it contains and the current loss associated to the cluster (sum of the distance between all observations and the centroid).
The cluster behavior is defined by the following methods:
  • computeCentroid: compute the sum of the distance between all the point in this cluster and the centroid.
  • attach: Attach or add a new observation to this cluster
  • detach: Remove an existing observations from this cluster.


public final class KmeansCluster {
  private int       _index   = -1;
  private Centroid  _centroid  = null; 
  private double    _sumDistances  = 0.0;
  private List<observation> _observations = new ArrayList<Observation>()

  public void computeCentroid() {
    _centroid.compute( _observations );
    for( Observation point : _observations  ) {
        point.computeDistance(_centroid);
    }
    computeSumDistances();
  }

     // Attach a new observation to this cluster.
  public void attach(final Observation point) { 
    point.computeDistance(_centroid);
    _observations.add(point);
    computeSumDistances();
  }

  public void detach(final Observation point) {
    _observations.remove(point);
    computeSumDistances();
  }
           
  private void computeSumDistances() { 
    _sumDistances = 0.0;     
    for( Observation point : _observations) {
      _sumDistances += point.computeDistance(_centroid);
    }
  }
      //....
}

Finally, the clustering class implements the training and run-time classification. The train method iterates across all the clusters and for all the observations to reassign the observations to each cluster. The iterative computation ends when either the loss value converges or the maximum number of iterations is reached. 
If the algorithm use K clusters with M observations with N variables the execution time for creating the clusters is K*M*N. If the algorithm converges after T iterations then the overall execution is T*K*M*N. For instance, the K-means classification of 20K observations and data with 25 dimension, using 10 clusters, converging after 50 iterations requires  250,000,000 evaluations! The constructor create the clustering algorithm with a predefined number of cluster, K, and a set of observations.
The method getCentroids retrieves the current list of centroids (value of centroid vectors)

public final class KmeansClustering { 
  private KmeansCluster[] _clusters = null;
  private Observation[] _obsList = null; 
  private double _totalDistance  = 0.0;
  private Centroid[] _centroids = null;
   
  public KmeansClustering(int numClusters, final Observation[] obsList) {   
     _clusters = new KmeansCluster[numClusters];
     for (int i = 0; i < numClusters; i++) {
        _clusters[i] = new KmeansCluster(i);
     }
     _obsList = obsList;
  }
 
  public final List<double[]> getCentroids() {
      List<double[]> centroidDataList = null;

      if(_clusters != null &&; _clusters.length < 0) {
         centroidDataList = new LinkedList<double[]>();
         for( KmeansCluster cluster : _clusters) {
            centroidDataList.add(cluster.getCentroid().getX());
         }
      }
      return centroidDataList;
  }
}


The second section (next post) describes the implementation of the training and classification tasks.


References
The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective 11.4.2.5 K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006

15 comments:

  1. As the interest of java programming application continues expanding, there is enormous interest for java experts in programming improvement businesses. Along these lines, taking preparing will help understudies to be talented java engineers in driving MNCs.
    JAVA Training in Chennai | JAVA course in Chennai | JAVA J2EE Training in Chennai

    ReplyDelete
  2. Selenium is most as used automation tool to test web application and browser. This automation tool offers precise and complete information about a software application or environment.
    Selenium Training in Chennai | Selenium Training institute in Chennai | Selenium Testing Training in Chennai

    ReplyDelete
  3. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete
  4. Thanks for sharing this program.It is really helpful.Continue sharing more like this.
    Regards,
    Java Training in Chennai | Java Training Institutes in Chennai | Java courses in Chennai

    ReplyDelete
  5. I feel satisfied with your blog, you have been delivering useful & unique information to our vision even you have explained the concept as deep clean without having any uncertainty, keep blogging.
    Hadoop Training in Chennai|Big Data Training in Chennai|Big Data Training

    ReplyDelete
  6. Thanks for sharing informative article on Salesforce technology. Your article helped me a lot to understand the career prospects in cloud computing technology. Salesforce Training in Chennai | Salesforce Training Institutes in Chennai

    ReplyDelete
  7. Excellent post!!! The future of cloud computing is on positive side. With most of the companies integrate Salesforce CRM to power their business; there is massive demand for Salesforce developers and administrators across the world.Salesforce Training in Chennai | Salesforce Training Institutes in Chennai

    ReplyDelete
  8. Amazing blog about the various informative information on the programming languages. Java Training in Chennai

    ReplyDelete
  9. Nowadays, most of the businesses rely on cloud based CRM tool to power their business process. They want to access the business from anywhere and anytime. In such scenarios, salesforce CRM will ensure massive advantage to the business owners.Cloud Computing Training in Chennai

    ReplyDelete
  10. Nice blog. Thanks for your valuable information and time to understand.
    Java Training in chenani

    ReplyDelete
  11. Really very informative post you shared here. Keep sharing this type of informative blog. If anyone wants to become a Java professional learn Java Training in Bangalore. Nowadays Java has tons of job opportunities for all professionals.

    ReplyDelete
  12. Great and informative article... thanks for sharing your views and ideas.. keep rocks...

    Java Training in chennai | Java Training institute in chennai

    ReplyDelete
  13. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

    Java Training in Bangalore

    ReplyDelete