Monday, August 8, 2016

Spark ML pipelines I - Features encoding

Target audience: Intermediate
Estimated reading time: 10'

Apache spark introduced machine learning (ML) pipeline in version 1.4.0. A pipeline is actually a workflow or sequence of tasks that cleanse, filter, train, classify, predict and validate data set. Those tasks are defined as stage of the pipeline.
Spark 2.0 extends ML pipelines to support persistency, while the MLlib package is slowly deprecated in favor of the data frame and data set based ML library.
ML pipeline allows data scientist to weave transformers and estimators into a single monotonic classification and predictive models.

This first post related to ML pipeline implement a feature encoding pipeline

ML Pipelines 101
This section is a very brief overview of ML pipelines. Further information is available at
The key ingredient of a ML pipeline are

  • Transformers are algorithms which can transform one DataFrame into another DataFrame. Transformers are stateless
  • Estimators are algorithm which can be fit on a DataFrame to produce a Transformer (i.e.
  • Pipelines are estimators that weave or chain multiple Transformers and Estimators together to specify an ML workflow
  • Pipeline stages are the basic element of the ML workflow. Transformers, estimators and pipelines are pipeline stages
  • Parameters encapsulate the tuning and configuration parameters requires for each Transformers and Estimators.
Note: The following implementation has been tested using Apache Spark 2.0

Features encoding pipeline
Categorical features have multiple values or instances which are represented as a string. Numerical value as categorized or bucketed through a conversion to a string.
Once converted to a string, the categorical values are converted into category indices using the StringIndex transformer. The resulting sequence of indices is ranked by decreasing order of frequency of the value in the training or validation set.
Next, the indices associated to particular features are encoded into a vector of binary value (0, 1) through the OneHotEncoder transformer. A feature instance is encoded as 1 if is defined in the data point, 0 otherwise.
Finally, the vector of binary values of all the feature are aggregated or assembled through the transformer VectorAssembler
Let's consider the following simple use case: A sales report list the date an item is sold, its id, the sales region and name of the sales person or agent.
   date         id         region    agent
   07/10/2014   23c9a89d   17        aa4
The encoding pipeline is illustrated in the following diagram:

The data source is a CSV sales report file. Each column or feature is converted to a string index, then encoded as a vector of binary values. Finally, the 4 vectors are assembled into a single features vector.

The first step is to create a Spark 2.x session. Let's wrap the spark session configuration into a single, reusable trait, SparkSessionManager

trait SparkSessionManager {
  protected[this] val sparkSession = SparkSession.builder()
    .config(new SparkConf()
    .set("spark.default.parallelism", "4")
    .set("spark.rdd.compress", "true")
    .set("spark.executor.memory", "8g")
    .set("spark.shuffle.spill", "true")
    .set("spark.shuffle.spill.compress", "true")
    .set("", "lzf")

  protected def csv2DF(dataFile: String): DataFrame ="header", true)
                    .option("inferSchema", true)

  def stop: Unit = sparkSession.stop

The SparkSession is instantiated with the same configuration SparkConf as the SparkContext used in Spark 0.x and 1.x versions.
The method csv2DF loads the content of CSV file and generate a data frame or data set.

The next step consists of creating the encoding workflow or pipeline, wrapped into the DataEncoding

trait DataEncoding {
  protected[this] val colNames: Array[String]
  lazy val vecColNames =
  lazy val pipelineStages: Array[PipelineStage] = => new StringIndexer()
       .setOutputCol(index(colName))) ++ => new OneHotEncoder()
       .setOutputCol(vector(colName))) ++
    Array[PipelineStage](new VectorAssembler()

  def index(colName: String): String = s"${colName}Index"
  def vector(colName: String): String = s"${colName}Vector"

The sequence of names of the columns (features) colNames is an abstract value that needs to be defined for the specific training set (line 2). The first two stages of the data encoding pipeline, String indexing (line 6-8) and encoding (line 9-11) are applied to each of the 4 features. The last stage, assembling the features vector (line 12, 13) is added to the array of the previous stages to complete the pipeline. Each stage is defined by its input column setInputCol and output column/feature setOutputCol

Let's leverage the Spark session and the data encoding to generate a classification model, ModelFactory. Classification and predictive models are built using ML pipelines as described in a future post. For now let's create a pipeline model using the data encoding stages

val dataEncoder = new DataEncoding {
  override protected[this] val colNames: Array[String]= 
    Array[String]("date", "id", "region", "agent")
  def pipelineModel(df: DataFrame): PipelineModel = 
     new Pipeline().setStages(pipelineStages).fit(df)

The pipeline model is generated in two steps:
  • Instantiate the pipeline from the date encoding stages
  • Generate the model from a input or training data frame, df by invoking the fit method.
The next post extends the data encoding pipeline with two estimators: a classifier and a cross validator.