Wednesday, October 31, 2012

K-means Clustering in Java II: Classification

Overview
The basic components of the implementation of K-means clustering algorithms has been introduced in the previous post K-means clustering in Java: Elements

This second part on the implementation of K-means in Java describes the two main tasks of any unsupervised or supervised learning algorithms:
  • training: executed off-line during analysis of historical data
  • classification: executed at run-time to classify new obsdervations
Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Training
The learning method, train, implements the clustering algorithm. It iterates to minimize the sum of distances between all cluster data points & its centroid.
For each iteration (or epoch) the train method:
  1. assign observations to each cluster
  2. compute the centroid for each cluster, computeCentroid
  3. compute the total distance of all the observations with their respective centroid computeTotalDistance
  4. estimate the closest cluster for each observation
  5. re-assign the observation, updateCentroids

public int train() {
  int numIter = _maxIters, k = 0
  boolean inProgress = true;
   
  initialize();  
  while(inProgress) {
     for(KmeansCluster cluster : _clusters ) {
        cluster.attach(_obsList[k]);
        if( ++k >= _obsList.length) {
           inProgress = false;
           break;
        }
     }
  }
               
  computeTotalDistance();
  for(KmeansCluster cluster : _clusters ) {
     cluster.computeCentroid();
  }           
  computeTotalDistance();
  
  List<Observation> obsList = null; 
  KmeansCluster closestCluster = null;

   // main iterative method, that traverses all the clusters
   // computes the distance of observations relative to their centroid
   // and re-assign the observations.
  for(int i = 0; i < _maxIterations; i++) {
    for(KmeansCluster cluster : _clusters ) { 
      obsList = new ArrayList<Observation>();
      for( Observation point : cluster.getDataPointsList()) {
        obsList.add(point);
      }
   
      for( Observation point : obsList) {
        double minDistance = Double.MAX_VALUE, distance = 0.0;
        closestCluster = null;
     
        for(KmeansCluster cursor : _clusters ) {
          distance =  point.computeDistance(cursor.getCentroid());
         if( minDistance >  distance) {
            minDistance = distance;
            closestCluster = cursor;
         }
       }
       updateCentroids(point, cluster, closestCluster);
     }
   }
     // Simple convergence criteria              
   if( _convergeCounter >= _minNumConvergeIters ) {
     numIters= i;
     break;
   }
 }
 return numIters;
}

Classification
The classification of a new observations is simple and consists in evaluating the distance between the new data point and each centroid and selecting the cluster with the shortest distance. The classify method extract the index or label of the cluster that is the most suitable (closest in distance) to the new observation.

public int classify(double[] obs) {
  double bestScore = Double.MAX_VALUE, distance = 0.0;
  int clusterId = -1;
       
  for(int k = 0; k < _centroids.length; k++) {
     distance = _centroids[k].computeDistance(obs);
     if(distance < bestScore) {
        bestScore = distance;
        clusterId = k;
     }
  }
  return clusterId;
}

The code snippet below implements some of the supporting method to
- compute the loss function value (total distance) - initialize the centroid for each cluster - update the values of centroids.


private void computeTotalDistance() {
  float totalDistance = 0.0F;
     
  for(KmeansCluster cluster : _clusters ) {
     totalDistance += cluster.getSumDistances();
  }
  
  double error = Math.abs(_totalDistance - totalDistance);
  convergeCounter = ( error < _convergeCriteria) ? convergeCounter +1 : 0;      
  _totalDistance = totalDistance;
}

private void initialize() {
   double[] params = getParameters();
   int numVariables = params.length>>1
      
   double[] range = new double[numVariables];
   for( int k = 0, j = numVariables; k <numVariables; k++, j++ ) {
      range[k] = params[k] - params[j];
   }
        
   double[] x = new double[numVariables];
   int sz_1 = _clusters.length+1,  m = 1;
      
   for(KmeansCluster cluster : _clusters) {
      for( int k = 0, j = numVariables; k <numVariables; k++, j++ ) {
         x[k] = ((range[k]/sz_1)*m) + params[j];
      }
      cluster.setCentroid(x);
      m++;
   }
}
 
private void updateCentroids(Observation point,
                             KmeansCluster cluster, 
                             KmeansCluster bestCluster) {
  boolean update = bestCluster != null && bestCluster != cluster; 
  if( update ) {
     bestCluster.attach(point);
     cluster.detach(point);
     for(KmeansCluster cursor : _clusters ) {
        cursor.computeCentroid();
     }
     computeTotalDistance();
  }
}



References
The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective 11.4.2.5 K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006

Wednesday, October 10, 2012

K-means Clustering in Java I: Elements

Overview
Among the clustering methods have been developed over the years from Spectral clustering, Non-negative Matrix factorization, Canopy to Hierarchical and K-means clustering. The K-means algorithm is by far the easiest to implement. This simplicity comes with a high price in terms of scalability and even reliability. However, as a unsupervised learning technique, K-means is still valuable for reducing the number of model features or detecting anomalies.
The objective is to classify observations or data points by groups that share common attributes or features. The diagram below illustrates the clustering of observations (x,y) for a simple 2-feature model



Each cluster has a mean or centroid, m = ( .. m..). First we need to define a distance between an observation  X = (...x ..) and c. The Manhattan and Euclidean distances are respectively defined as:  \[d_{M} = \sum_{i=0}^{n}\left | x_{i} - m_{i}\right |\,\,\,,\,\,d_{E}= \sum_{i=0}^{n} (x_{i} - m_{i})^{2}\] The loss function for N cluster Cj is defined by \[W(C)=\frac{1}{2}\sum_{k=0}^{N-1}\sum_{c_{i}=k}\sum_{C_{j}} d(x_{i},x_{j})\]  The goal is to find the centroid m, and clusters C, that minimize the loss function as: \[C^{*}\left (i \right ) = arg\min _{k\in [0,N-1]}d (x_{i}, m_{k})\]

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted
Distances and Observations
First we need to define the distance between each observation and the centroid of a cluster. The class hierarchy related to the distance can be implemented as nested classes as there is no reason to "expose" to client code. The interface, Distance,define the signature of the computation method. For sake of simplicity, the sample code implements only the Manhattan and Euclidean distances.  Exceptions, validation of method arguments, setter and getter methods are omitted for the sake of simplicity.

protected interface Distance {
  public double compute(double[] x, Centroid centroid);
}
    // Defintion of d(x,y) =|x-y|
protected class ManhattanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
     double sum = 0.0, xx = 0.0;
     for( int k = 0; k< x.length; k++) {
       xx = x[k] - centroid.get(k);
       if( xx < 0.0) {
         xx = -xx;
       }
       sum += xx;
     }
     return sum;
  }
}

  // Definition d(x,y)= sqr(x-y) 
protected class EuclideanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
    double sum = 0.0, xx = 0.0;
    for( int k = 0; k < x.length; k++) {
       xx = x[k] - centroid.get(k);
       sum += xx*xx;
    } 
    return Math.sqrt(sum);
  } 
}

Next, we define an observation (or data point) as a vector or array of floating point values, in our example.  An observation can support heterogeneous types (boolean, integer, float point,..) as long as they are normalized to [0,1]. In our example we simply normalized over the maximum values for all the observations.

public final class Observation {
   // use Euclidean distance that is shared between all the instances

  private static Distance metric = new EuclideanDistance();
  public static void setMetric(final Distance metric) {
    this.metric = metric;
  }
 
  private double[] _x  = null;
  private int  _index  = -1;

  public Observation(double[] x, int index) { 
    _x = x; 
    _index = index; 
  }
   // compute distance between each point and the centroid
  public double computeDistance(final Centroid centroid) {
     return metric.compute(_x, centroid);
  }
    // normalize the value of data points.
  public void normalize(double[] maxValues) {
     for( int k = 0; k < _x.length; k++) {
        _x[k] /= maxValues[k];
     }
  }
}



Clusters
Centroid for each cluster are computed iteratively to reduce the loss function.  The centroid values are computed using the mean of each feature across all the observations. The method Centroid.compute initialize the data points belonging to a cluster with the list of observations and compute the centroid values _x by normalizing with the number of points.
protected class Centroid {
  private double[] _x = null;       
       
  protected Centroid() {}
  protected Centroid(double[] x) {
    Array.systemCopy(_x, x, 0, x.length, sizeOf(double));
  }
    // Compute the centoid values _x by normalizing with the number of points.
  protected void compute(final List<Observation> observations)  {
    double[] x = new double[_x.length];
    Arrays.fill(x, 0.0);
           
    for( Observation point : observations ) {
      for(int k =0; k < x.length; k++) {
        x[k] += point.get(k);
      }
    }
    int numPoints = observations.size();
    for(int k =0; k < x.length; k++) {
      _x[k] = x[k]/numPoints;
    }
  }
}

A cluster is defined by its label (index in this example) a centroid, the list of observations it contains and the current loss associated to the cluster (sum of the distance between all observations and the centroid).
The cluster behavior is defined by the following methods:
  • computeCentroid: compute the sum of the distance between all the point in this cluster and the centroid.
  • attach: Attach or add a new observation to this cluster
  • detach: Remove an existing observations from this cluster.


public final class KmeansCluster {
  private int       _index   = -1;
  private Centroid  _centroid  = null; 
  private double    _sumDistances  = 0.0;
  private List<observation> _observations = new ArrayList<Observation>()

  public void computeCentroid() {
    _centroid.compute( _observations );
    for( Observation point : _observations  ) {
        point.computeDistance(_centroid);
    }
    computeSumDistances();
  }

     // Attach a new observation to this cluster.
  public void attach(final Observation point) { 
    point.computeDistance(_centroid);
    _observations.add(point);
    computeSumDistances();
  }

  public void detach(final Observation point) {
    _observations.remove(point);
    computeSumDistances();
  }
           
  private void computeSumDistances() { 
    _sumDistances = 0.0;     
    for( Observation point : _observations) {
      _sumDistances += point.computeDistance(_centroid);
    }
  }
      //....
}

Finally, the clustering class implements the training and run-time classification. The train method iterates across all the clusters and for all the observations to reassign the observations to each cluster. The iterative computation ends when either the loss value converges or the maximum number of iterations is reached. 
If the algorithm use K clusters with M observations with N variables the execution time for creating the clusters is K*M*N. If the algorithm converges after T iterations then the overall execution is T*K*M*N. For instance, the K-means classification of 20K observations and data with 25 dimension, using 10 clusters, converging after 50 iterations requires  250,000,000 evaluations! The constructor create the clustering algorithm with a predefined number of cluster, K, and a set of observations.
The method getCentroids retrieves the current list of centroids (value of centroid vectors)

public final class KmeansClustering { 
  private KmeansCluster[] _clusters = null;
  private Observation[] _obsList = null; 
  private double _totalDistance  = 0.0;
  private Centroid[] _centroids = null;
   
  public KmeansClustering(int numClusters, final Observation[] obsList) {   
     _clusters = new KmeansCluster[numClusters];
     for (int i = 0; i < numClusters; i++) {
        _clusters[i] = new KmeansCluster(i);
     }
     _obsList = obsList;
  }
 
  public final List<double[]> getCentroids() {
      List<double[]> centroidDataList = null;

      if(_clusters != null &&; _clusters.length < 0) {
         centroidDataList = new LinkedList<double[]>();
         for( KmeansCluster cluster : _clusters) {
            centroidDataList.add(cluster.getCentroid().getX());
         }
      }
      return centroidDataList;
  }
}


The second section (next post) describes the implementation of the training and classification tasks.


References
The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective 11.4.2.5 K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006

Friday, October 5, 2012

Type Class Polymorphism and Monads

Overview
Most of software developers associate the concept of polymorphism to dynamic binding & class inheritance available in the most common object oriented programming languages such as C++ and Java. As part of its functional programming "DNA", Scala provides developers with  alternative polymorphism approaches (beside the ubiquitous method override).

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Object-Oriented Polymorphism

Java or C++ programmers are familiar with this implementation of polymorphism as described in the example below. A trait the key behavior of a collection by declaring two method += to add a new item into the collection and -= to retrieve the first item from the collection.
trait Collection[T <: Double] {                           
   def += (t:T) : Boolean
   def -= : T 
}

The class, OrderedCollection implements ordered collection of type inherited from Double by defining the two abstract methods += & -=. This class specializes the behavior of the trait by ordering the elements in the collection.
class OrderedCollection[T <: Double] extends Collection[T] {
   import scala.collection.mutable.TreeSet
 
   var c = TreeSet.empty(Ordering.fromLessThan[T]((x:Double, y:Double) => (x < y)))
   override def += (t:T) : Boolean = c.add(t)
   override def -= : T = { 
     val top = c.head
     c.remove(top)
     top 
   }
}
The following code snippet illustrate a simple use case for the ordered collection. Note the declaration of the type of the collection instance, thisCollection is not required and added for the sake of clarity.
val thisCollection: Collection[Double] = new OrderedCollection[Double]
thisCollection += 9.556
thisCollection += 45.7
thisCollection += 4.9
println(thisCollection)

Type Class Polymorphism 
Scala supports Type class polymorphism, also referred as Ah-hoc polymorphism in technical articles. Scala also supports Ah-Doc polymorphism which the ability to create and apply generic polymorphic operators or functions to arguments of different types. Ad hoc polymorphism lets you add features to a type on demand.
Let's define a Wrapper type (trait) which basic purpose is to wrap an element into a collection M[_] of this element. The Wrapper type class declares the operators apply and get that is to be implemented by wrapper associated to a predetermined collection.
- apply creates a collection M[T] of element of type T.
- get retrieves the first and only element of the collection.

Note: The type of the element in the collection is not a parameter type for the Wrapper: T is the type parameter for the collection and type M[_] is the parameter type for the wrapper. The Wrapper trait is defined as:
trait Container[M[_]] {
  def apply[T](t: T): M[T]
  def get[T](c: M[T]): T
}
We can create a type class ArrayBuffer to "wrap" any object by overriding the apply and get operators. The implicit conversion from ArrayBuffer to Wrapper is obviously independent from the type T of the contained element (validation of method argument is omitted).
implicit val arrayBuffer = new Container[ArrayBuffer] {
  override def apply[T](t: T) : ArrayBuffer[T] =  ArrayBuffer[T](t)
  override def get[T](ab: ArrayBuffer[T]) : T = ab.head 
}

def compose[M[_]: Container, X, Y](c1: M[X], c2: M[Y]) = {
   val c = implicitly[Container[M]]
   c(ab.get(c1), ab.get(c2))
}

Monads
One of the most known application of type class polymorphism is the monadic representation of collection. The concept and theoretical foundation of monads is discussed in the previous post. Monads define the minimum set of operations for a functional category of objects. Scala standard library includes quite a few of categories and their associated monadic representation: List[T], Option[T], ..
Let's extend our simple Wrapper to define a Monad for collection containing a single element. The monad must define a constructor, apply, a functor, map that converts a single element collection into a another single element collection by transforming its only element and a flatMap that generate a new single element collection from the element of an existing collection.

trait _Monad[M[_]] {
  def apply[T](t: T): M[T]
  def map[T, U](m: M[_])(f: T => U): M[U]
  def flatMap[T, U](m: M[_])(f: T => M[U]): M[U]
}
As with the trait wrapper we can create a implicit conversion from any single element ArrayBuffer into its associated monad to monadic operators can be used transparently by the client code.
implicit val mArrayBuffer = new _Monad[ArrayBuffer] {
  override def apply[T](t:T): ArrayBuffer[T] =  ArrayBuffer[T](t)
  override def map[T, U](m: ArrayBuffer[T])(f: T => U): ArrayBuffer[U] = ArrayBuffer[U](f(m.head))
  override def flatMap[T, U](m: ArrayBuffer[T])(f: T => ArrayBuffer[U]): ArrayBuffer[U] = f(m.head) 
}

Monads will be discussed in length in a future post.

References
* An Overview of the Scala Programming Language - 2nd Edition M. Odersky, P. Altherr, V. Cremet, I. Dragos, G. Dubochet, B. Emir, S. McDirmid, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, L. Spoon, M. Zenger
* Programming in Scala - 2nd Edition M. Odersky, L. Spoon, B. Venners - Artima 2011
* Fabulous adventures in coding - Monads, part one E. Lippert's - 2013
* github.com/prnicolas