Saturday, December 15, 2012

Scala for Java 7 Developers: Overview

The purpose of this post is to introduce the Scala programming language from the perspective of an experienced Java developer. It is assumed that the reader has a good command of design patterns and multi-threaded applications.
In a nutshell, Scala is:
- Object-oriented
- Functional
- Scalable (Actors and immutability)

The elements of the Scala programming language presented in this post are a small subset of the rich set of features developers would benefit from the language.

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Scala relies on immutable object and collections to reduce complexity associated with multi-threaded applications, eliminate race condition, simplify debugging.
The CRU (Create, Read, Update) object lifecycle in Java is reduced to CR (Create, Read) in Scala.
The Scala standard library offers a large variety of immutable collections.

val x = 23.5  // Read only
val mp = new immutable.Map[String, String]
val xs = List[Any]()

Traits vs. interface
Java interfaces are purely declarative. Scala traits allows:
* method implementation
* member value initialization
* abstract and lazy values (to be initialized through sub-class or on demand)
Scala traits cannot be instantiated and do not have constructors. Scala emulates multiple inheritance by stacking traits (Cake pattern).

trait MyObject[T] {
  val tobeDefined: Int
  def add(t: T): Option[T] = {}
  def head: T = { }

Lazy values and methods
Scala supports lazy values that are declared as member of a class but instantiated once and only once when they are accessed

class MyObject[T] {
  lazy val n = (n<<2) +

3rd party libraries and frameworks cannot always be extended through composition or inheritance. Scala implicit class/types and methods allows a "behind the scene" type conversion. The following code snippet converts a double value into a vector/array of double of size 1. The implicit is automatically invoke whenever it is imported into another section of the code base: for instance, the vector v is constructed through the implicit conversion dbl2Vec.

type DblVector = Array[Double]
  // Definition of implicit
implicit def dbl2Vec(x: Double): DblVector = Array.fill(1)(x)
val v: DblVector = 4.5

The following example extends the concept of Exception handling, Try with a concept of Either converting any instance of Try to an instance of Either using toEither method of the implicit class Try2Either. The implicit class conversion is used in handling a rounding error in the normalize function.

implicit class Try2Either[T](_try: Try[T]) {
  def toEither[U](r: ()=>U)(f: T=>T): Either[U, T] = _try match {
    case Success(res) => Right(f(res))
    case Failure(e) => Console.println(e.toString); Left(r())  

 * Method fails and recover if the denominator is too small
def normalize(x: DblVector): Either[Recover, DblVector] = Try {
  val sum = x.sum 
  if(sum > 0.01) _ / sum) 
  else throw new IllegalStateException("prob out of range")  
.toEither(() => new Recover)((v: DblVector) => Math.log(_)))

Tail recursion
Although convenient, traditional recursive implementation of computation can be ineffective, leaving developers to rely on iterative algorithms.
Scala's tail recursion is a very effective way to iterate through a data sequence without incurring a performance overhead because it avoids the creation of new stack frames during the recursion.
The following code snippet implements the Viterbi algorithm used to extract the most likely sequence of latent states in an hidden Markov model given a lambda model and a set of observations.

private def viterbi(t: Int): Double = {

  if( t == 0) initial   
  else { 
    Range(0, lambda.getN).foreach( updateMaxDelta(t, _) )   
    if( t ==  obs.size-1) {
        val idxMaxDelta = Range(0, lambda.getN).map(i => 
                    (i,, i))).maxBy(_._2)
        state.QStar.update(t+1, idxMaxDelta._1)
    else viterbi(t+1)

Actor model
Traditional multi-threaded applications rely on accessing data located in shared memory. The mechanism relies on synchronization monitors such as locks, mutexes or semaphores to avoid deadlocks and inconsistent mutable states. Those applications are difficult to debug because of race condition and incur the cost of a large number of context switches.
The Actor model addresses those issues by using immutable data structures (messages) and asynchronous (non-blocking) communication.
The actors exchange messages and data using immutable messages as described in the example below.

sealed abstract class Message(val id: Int)
case class Activate(i: Int, xt: Array[Double]) extends Message(i)
case class Completed(i: Int, xt: Array[Double]) extends Message(i)
case class Start(i: Int =0) extends Message(i)

An actor processes messages queued in its mailbox asynchronously, FIFO. The method, receive which returns a PartialFunction is the message processing loop, which correspond to the in Java

final class Worker(
     id: Int, 
     fct: DblSeries => DblSeries) 
extends Actor {
    // Actor event/message loop
  override def receive = {
    case msg: Activate => {
      val msgId =
      val output = fct(msg.xt)
      sender ! Completed(msgId, output)

Syntactic similarities

Java developers will find comfort knowing that Scala preserves a lot of constructs and idioms which make Java 7 such a wide spread and supported programming language.
Construct ScalaJava 7
Variable         var x: Float = 1.0F  float x = 1.0F;
Value val n = 4L
Constants final val MAX_TEMP: Float = 232.0F final float MAX_TEMP = 232.0F;
Class class Person { ...} public class Person { }
Constructor class Person(name: String, age: Int) {} public class Person {
....private String name;
....private int age;
....public Person(String name, int age) {}
Setters Not applicable
public class Person {
....public void setName(String name) {}
Getters class Person(val name: String, val age: Int{}
....//Public access to member values.
public class Person {
....public final String getName() {}
Method Arguments validation def add(index: Int): Unit = {
....require(index>=0 && index<size, .... s"$index out of bounds")
public void add(int index) {
....if( index < 0 || index >= size) {
........ throw new IllegalArgumentException(...);
Singleton object Scientist extends Person {} public class scientist extends Person {
.... private static Scientist instance = new Scientist();
.... private Scientist();
.... public static Scientist getInstance() {
........ return instance;
... }
Overriding override def getName(id : Int) : String {.. } @override
public final getName(int id) { ... }
Interfaces trait Searchable {
.....def search(id : Int) : String
interface Searchable {
....String search(int id);
Type Casting n : Int = o.asInstanceOf[Int] Integer n = (Integer)o;
Type Checking n.isInstanceOf[Int] n instanceOf Integer
Generics class Stack[T](val maxSize : Int) {
....def push(t : T) : Unit = { ... }
....def pop : T = { ... }
public class Stack <T> {
........private int maxSize;
........public Stack(int maxSize) { ... }
........public void push(T t) {.... }
........public <T> pop() {... }
Abstract Class abstract class Collection[T](val max : Int) {
....def add(t : T) : Unit
....def get(index : Int) : Int
abstract public class Collection <T> {
........abstract public void add(T t);
........abstract public <T> get(int index);
Exception Try { .... }
match { Success(res) =>{}
... case Failure(e) => {}
try { .. }
catch( NullPointerException ex) { ..}
catch( Exception ex) { ...}
Bounded type
class List[T <: Float] {  ... }
class List[T >: Node] { ...}
public class List<T extend Float> { ... }
public class List <T super Node> { .... }
class List[+T] {}
class List[-Node] { ...}
public class List<? extends T> { ... }
public class List<? super Node> { .... }
For loop
for( i <= 0 until n) { .. } for( int i = 0; i < n; i++) { .. }
for(index <- indices) { ... }for(int index : indices) { .. }

Key differences

There are quite a few important and tricky programmatic difference between Scala 2.11 and Java 7

  • == In Scala == compares instances similarly to equals. In Java == compares the reference to instances
  • Comparable integer java.lang.Integer supports comparable interface while scala.Int does not
  • Static method Java supports explicit static methods. Scala does not, although static methods are implicitly defined as method of objects (Singletons)
  • Inner class In Scala, Inner class are accessed through the reference Self to the outer class. Java accesses Inner class through the Outer class
  • Checked exceptions Scala does not support Checked exceptions, Java does.
  • Primitive types As a pure object oriented, Scala does not support primitive types. Java does
  • Expression as a value As a functional language, Scala defines expressions such as if ... else ... as value, that can be evaluated and assigned to. Java does not
  • Immutable collections Scala supports immutable collections by default. Java does not.
  • Localized imports and type aliases Scala has scoped import and type aliases. Java does not
  • Explicit operator overloading Scala allows developers to overload operator through the definition of new methods. Java does not.
  • Actor Scala provides full parallelism with actor and immutable messages. Java does not.
  • Futures Scala futures are asynchronous and can be updated thorough promises. Java does not support the concept
  • Higher order kind Scala has higher order kind, such as Monad. Java does not
  • Tail recursion (or elimination) Scala supports tail recursion in order to avoid stack overflow. Java does not
  • Monad and functors are core concept in Scala, but not in Java.

Note This comparative analysis relies on version 2.11.x of Scala language and Java 7. Future version of Scala and Java may produce a slightly different result.

So much more....
This post barely scratches the surface of the capabilities of Scala which include
  • Futures
  • Partial functions and partially applied functions
  • Dependency injection
  • Monadic representation
  • Macros
  • View bounded type parameterization
  • F-Bounded polymorphism
  • Closures
  • Delimited continuation
  • Method argument currying

Programming in Scala - Odersky, Spoon, Venners- Artima 2007
Scala for the Impatient - C. Horstman - Addison-Wesley 2012

Monday, November 12, 2012

Lazy Evaluation in C++ & Scala

Lazy evaluation is a not a new concept in software engineering. The concept is rather simple: initialize a variable or an object once it is needed. A typical use case is the instantiating of a large object in memory from the content of database: the lifecycle of the object is shorter than the application so it is convenient to allocate the memory when and if needed.
Both C++ and Scala support lazy evaluation using two very different approach: C++ relies on a class constructor to initialize iteratively its attribute member of a class while Scala supports lazy initialization natively.

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted
Lazy evaluation or fetching can be implemented in C++ using a design pattern. Let consider a large object which state is stored in a remote database or initialized through a request REST API. In many cases, only some of the attributes of the object need to be initialized from the remote data source (table for instance): load the content of the entire table would not make much sense.
How C++ address the problem?
Let's consider a class that contains all the information regarding the visit of a user in a user . 
class SiteVisit {
  int  id;
  mutable string*  url;
  mutable long  visitDate;
  mutable std::list* targets;
  mutable char[4] sources; 
  void loadFromDB(std::list*)

  SiteVisit(int id) : id(id), url(null), visitDate(-1L), targets(null), sources(null) { }    
  const string& getName() const;
  long          getVisitDate() const;
  std::list&    getTargets() const;
 const char[4]& getSources();

It is conceivable that not all the data describing the visit has to be retrieved for analysis or reporting. For instance, there is no need to load the targeted URLs, for all the visits from the database until a report needs to be created. In this case, the role of the constructor in C++ is to initialize the unique identifier for the object (or table unique primary key) and nothing else. The getTargets method retrieves the list of pages visited at a later stage in the application lifecycle. Therefore, this list is generated only once; the first time it is invoked. The method retrieveAllVisitedPages, iterates through all the pages of the site that have been visited.
std::list<string> SiteVisit::getTargets() {
  if( targets == null) {
     targets = new std::list
  return targets;

int retrieveAllVisitedPages(std::list&amp; visits, std::map* visitsMap) const {                           
  std::list::const_iterator visitsIt;
  std::list::const_iterator urlsIt;

  int counter = 0, totalPages = 0;
  for( visitsIt =visits.beging()l visitsIt !=visits.end(); visitsIt++) {
    std::list* urls = visitsIt.getTargets();   
    for( urlsIt = urls.begin(); urlsIt != urls.end(); urlsIt++) {
       counter = visitsMap[urlsIt];
       visitsMap-&gt;insert(std::make_pair(urlsIt, ++counter));
    totalPages += urls.size();
  return totalPages;

Although feasible, the implementation of the lazy initialization is quite cumbersome as it involves at a minimum, a constructor and a method (2 step initialization

Scala supports lazy values and lazy local functions. Under some circumstances, the user may want to initialize a value only when it is required. A database developer would qualify this pattern as a write once, read many times pattern.

Note: The developer needs to keep in mind that lazy values or expressions are checked whether the value has been indeed already initialized, each time it is accessed. Such validation is to be thread-safe, and consequently adds overhead to the run-time execution.
Let's evaluate or experience the concept by writing a simple example in Scala. The SiteVisits class implements the basic function to retrieve the list of URLs/pages a user identified by its id access on a specific web site.

class SiteVisits(userId: Int, webSite: String) {
  lazy val urls =  {
    val urlsList = new ArrayBuffer[String]
   // Initialize the urls list from the database
    query(userId, webSite)
  def isValidWebsite; Boolean = ....
The list of URLs is not loaded from the database when an Instance of SiteVisits is created but when the value urls is read by the client code. For example, the methods of the class SiteVisits such as isValidWebsite can be executed without having to access the list of URLs the user visited on this particular web site. Let's look how lazy evaluation can be applied.
val siteVisit = new SiteVisits(91841, "")
  if( siteVisit.isValidWebSite )  
     Console.println(siteVisit.urls.mkString(", ") 
In the example above, the list of URLS is not loaded if the site is not valid.

  • Effective C++   - S. Meyers  - Addison Wesley 1991
  • More Effective C++  - S. Meyers  - Addison Wesley 1993
  • Programming in Scala  - M. Odersky, L. Spoon, B. Venners  - Artima 2007

Wednesday, October 31, 2012

K-means Clustering in Java II: Classification

The basic components of the implementation of K-means clustering algorithms has been introduced in the previous post K-means clustering in Java: Elements

This second part on the implementation of K-means in Java describes the two main tasks of any unsupervised or supervised learning algorithms:
  • training: executed off-line during analysis of historical data
  • classification: executed at run-time to classify new obsdervations
Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

The learning method, train, implements the clustering algorithm. It iterates to minimize the sum of distances between all cluster data points & its centroid.
For each iteration (or epoch) the train method:
  1. assign observations to each cluster
  2. compute the centroid for each cluster, computeCentroid
  3. compute the total distance of all the observations with their respective centroid computeTotalDistance
  4. estimate the closest cluster for each observation
  5. re-assign the observation, updateCentroids

public int train() {
  int numIter = _maxIters, k = 0
  boolean inProgress = true;
  while(inProgress) {
     for(KmeansCluster cluster : _clusters ) {
        if( ++k >= _obsList.length) {
           inProgress = false;
  for(KmeansCluster cluster : _clusters ) {
  List<Observation> obsList = null; 
  KmeansCluster closestCluster = null;

   // main iterative method, that traverses all the clusters
   // computes the distance of observations relative to their centroid
   // and re-assign the observations.
  for(int i = 0; i < _maxIterations; i++) {
    for(KmeansCluster cluster : _clusters ) { 
      obsList = new ArrayList<Observation>();
      for( Observation point : cluster.getDataPointsList()) {
      for( Observation point : obsList) {
        double minDistance = Double.MAX_VALUE, distance = 0.0;
        closestCluster = null;
        for(KmeansCluster cursor : _clusters ) {
          distance =  point.computeDistance(cursor.getCentroid());
         if( minDistance >  distance) {
            minDistance = distance;
            closestCluster = cursor;
       updateCentroids(point, cluster, closestCluster);
     // Simple convergence criteria              
   if( _convergeCounter >= _minNumConvergeIters ) {
     numIters= i;
 return numIters;

The classification of a new observations is simple and consists in evaluating the distance between the new data point and each centroid and selecting the cluster with the shortest distance. The classify method extract the index or label of the cluster that is the most suitable (closest in distance) to the new observation.

public int classify(double[] obs) {
  double bestScore = Double.MAX_VALUE, distance = 0.0;
  int clusterId = -1;
  for(int k = 0; k < _centroids.length; k++) {
     distance = _centroids[k].computeDistance(obs);
     if(distance < bestScore) {
        bestScore = distance;
        clusterId = k;
  return clusterId;

The code snippet below implements some of the supporting method to
- compute the loss function value (total distance) - initialize the centroid for each cluster - update the values of centroids.

private void computeTotalDistance() {
  float totalDistance = 0.0F;
  for(KmeansCluster cluster : _clusters ) {
     totalDistance += cluster.getSumDistances();
  double error = Math.abs(_totalDistance - totalDistance);
  convergeCounter = ( error < _convergeCriteria) ? convergeCounter +1 : 0;      
  _totalDistance = totalDistance;

private void initialize() {
   double[] params = getParameters();
   int numVariables = params.length>>1
   double[] range = new double[numVariables];
   for( int k = 0, j = numVariables; k <numVariables; k++, j++ ) {
      range[k] = params[k] - params[j];
   double[] x = new double[numVariables];
   int sz_1 = _clusters.length+1,  m = 1;
   for(KmeansCluster cluster : _clusters) {
      for( int k = 0, j = numVariables; k <numVariables; k++, j++ ) {
         x[k] = ((range[k]/sz_1)*m) + params[j];
private void updateCentroids(Observation point,
                             KmeansCluster cluster, 
                             KmeansCluster bestCluster) {
  boolean update = bestCluster != null && bestCluster != cluster; 
  if( update ) {
     for(KmeansCluster cursor : _clusters ) {

The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006

Wednesday, October 10, 2012

K-means Clustering in Java I: Elements

Among the clustering methods have been developed over the years from Spectral clustering, Non-negative Matrix factorization, Canopy to Hierarchical and K-means clustering. The K-means algorithm is by far the easiest to implement. This simplicity comes with a high price in terms of scalability and even reliability. However, as a unsupervised learning technique, K-means is still valuable for reducing the number of model features or detecting anomalies.
The objective is to classify observations or data points by groups that share common attributes or features. The diagram below illustrates the clustering of observations (x,y) for a simple 2-feature model

Each cluster has a mean or centroid, m = ( .. m..). First we need to define a distance between an observation  X = (...x ..) and c. The Manhattan and Euclidean distances are respectively defined as:  \[d_{M} = \sum_{i=0}^{n}\left | x_{i} - m_{i}\right |\,\,\,,\,\,d_{E}= \sum_{i=0}^{n} (x_{i} - m_{i})^{2}\] The loss function for N cluster Cj is defined by \[W(C)=\frac{1}{2}\sum_{k=0}^{N-1}\sum_{c_{i}=k}\sum_{C_{j}} d(x_{i},x_{j})\]  The goal is to find the centroid m, and clusters C, that minimize the loss function as: \[C^{*}\left (i \right ) = arg\min _{k\in [0,N-1]}d (x_{i}, m_{k})\]

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted
Distances and Observations
First we need to define the distance between each observation and the centroid of a cluster. The class hierarchy related to the distance can be implemented as nested classes as there is no reason to "expose" to client code. The interface, Distance,define the signature of the computation method. For sake of simplicity, the sample code implements only the Manhattan and Euclidean distances.  Exceptions, validation of method arguments, setter and getter methods are omitted for the sake of simplicity.

protected interface Distance {
  public double compute(double[] x, Centroid centroid);
    // Defintion of d(x,y) =|x-y|
protected class ManhattanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
     double sum = 0.0, xx = 0.0;
     for( int k = 0; k< x.length; k++) {
       xx = x[k] - centroid.get(k);
       if( xx < 0.0) {
         xx = -xx;
       sum += xx;
     return sum;

  // Definition d(x,y)= sqr(x-y) 
protected class EuclideanDistance implements Distance {
  public double compute(double[] x, Centroid centroid) {
    double sum = 0.0, xx = 0.0;
    for( int k = 0; k < x.length; k++) {
       xx = x[k] - centroid.get(k);
       sum += xx*xx;
    return Math.sqrt(sum);

Next, we define an observation (or data point) as a vector or array of floating point values, in our example.  An observation can support heterogeneous types (boolean, integer, float point,..) as long as they are normalized to [0,1]. In our example we simply normalized over the maximum values for all the observations.

public final class Observation {
   // use Euclidean distance that is shared between all the instances

  private static Distance metric = new EuclideanDistance();
  public static void setMetric(final Distance metric) {
    this.metric = metric;
  private double[] _x  = null;
  private int  _index  = -1;

  public Observation(double[] x, int index) { 
    _x = x; 
    _index = index; 
   // compute distance between each point and the centroid
  public double computeDistance(final Centroid centroid) {
     return metric.compute(_x, centroid);
    // normalize the value of data points.
  public void normalize(double[] maxValues) {
     for( int k = 0; k < _x.length; k++) {
        _x[k] /= maxValues[k];

Centroid for each cluster are computed iteratively to reduce the loss function.  The centroid values are computed using the mean of each feature across all the observations. The method Centroid.compute initialize the data points belonging to a cluster with the list of observations and compute the centroid values _x by normalizing with the number of points.
protected class Centroid {
  private double[] _x = null;       
  protected Centroid() {}
  protected Centroid(double[] x) {
    Array.systemCopy(_x, x, 0, x.length, sizeOf(double));
    // Compute the centoid values _x by normalizing with the number of points.
  protected void compute(final List<Observation> observations)  {
    double[] x = new double[_x.length];
    Arrays.fill(x, 0.0);
    for( Observation point : observations ) {
      for(int k =0; k < x.length; k++) {
        x[k] += point.get(k);
    int numPoints = observations.size();
    for(int k =0; k < x.length; k++) {
      _x[k] = x[k]/numPoints;

A cluster is defined by its label (index in this example) a centroid, the list of observations it contains and the current loss associated to the cluster (sum of the distance between all observations and the centroid).
The cluster behavior is defined by the following methods:
  • computeCentroid: compute the sum of the distance between all the point in this cluster and the centroid.
  • attach: Attach or add a new observation to this cluster
  • detach: Remove an existing observations from this cluster.

public final class KmeansCluster {
  private int       _index   = -1;
  private Centroid  _centroid  = null; 
  private double    _sumDistances  = 0.0;
  private List<observation> _observations = new ArrayList<Observation>()

  public void computeCentroid() {
    _centroid.compute( _observations );
    for( Observation point : _observations  ) {

     // Attach a new observation to this cluster.
  public void attach(final Observation point) { 

  public void detach(final Observation point) {
  private void computeSumDistances() { 
    _sumDistances = 0.0;     
    for( Observation point : _observations) {
      _sumDistances += point.computeDistance(_centroid);

Finally, the clustering class implements the training and run-time classification. The train method iterates across all the clusters and for all the observations to reassign the observations to each cluster. The iterative computation ends when either the loss value converges or the maximum number of iterations is reached. 
If the algorithm use K clusters with M observations with N variables the execution time for creating the clusters is K*M*N. If the algorithm converges after T iterations then the overall execution is T*K*M*N. For instance, the K-means classification of 20K observations and data with 25 dimension, using 10 clusters, converging after 50 iterations requires  250,000,000 evaluations! The constructor create the clustering algorithm with a predefined number of cluster, K, and a set of observations.
The method getCentroids retrieves the current list of centroids (value of centroid vectors)

public final class KmeansClustering { 
  private KmeansCluster[] _clusters = null;
  private Observation[] _obsList = null; 
  private double _totalDistance  = 0.0;
  private Centroid[] _centroids = null;
  public KmeansClustering(int numClusters, final Observation[] obsList) {   
     _clusters = new KmeansCluster[numClusters];
     for (int i = 0; i < numClusters; i++) {
        _clusters[i] = new KmeansCluster(i);
     _obsList = obsList;
  public final List<double[]> getCentroids() {
      List<double[]> centroidDataList = null;

      if(_clusters != null &&; _clusters.length < 0) {
         centroidDataList = new LinkedList<double[]>();
         for( KmeansCluster cluster : _clusters) {
      return centroidDataList;

The second section (next post) describes the implementation of the training and classification tasks.

The Elements of Statistical Learning   - T. Hastie, R.Tibshirani, J. Friedman  - Springer 2001
Machine Learning: A Probabilisitc Perspective K-means algorithm - K. Murphy - MIT Press 2012
Pattern Recognition and Machine Learning: Chap 9 "Mixture Models and EM: K-means Clustering" C.Bishop - Springer Science 2006