Sunday, March 28, 2021

MLOps for data scientists

Target audience: Beginner
Estimated reading time: 20'



This post is a high level introduction of key components of MLOps from the data scientist perspective. MLOps addresses the issue of lack of reliability and transparency in the development and deployment of machine learning models.

AI Productization overview

MLOps is collection of tools supporting the lifecycle of data-centric AI: train models, conduct error analysis to identify the type of data the algorithm does poorly on, acquire more data via data augmentation, addresses inconsistent definition for the data labels, and ultimately use production data  for continuous refinement of the model.

MLOps seeks to automate the training, validation of ML models  and improve their quality, while also focusing on business and regulatory requirements. It integrates the functions of data engineering, data science and dev-ops into a single predictable process in the following areas:
  • Deployment and automation
  • Reproducibility of models and predictions
  • Diagnostics
  • Governance and regulatory compliance (Socs-2, HIPAA)
  • Scalability and latency
  • Collaboration
  • Business use cases & metrics
  • Monitoring and management
  • Technical support

Predictable ML lifecycle

MLOps defines the ML lifecycle management, such as integration with model generation, software development cycle (Jira, Github), continuous testing and delivery, orchestration, and deployment, health, diagnostics, performance governance, and business metrics. From the data science perspective, MLOps defines the continuous and iterative collection/pre-processing of data, model training and evaluation and deployment in production.

Data-centric AI

Andrew Ng introduced the concept of data-centric AI.  He propose to shift the focus of AI practitioners from model/algorithm development to the quality of the data they use to train the models. In the traditional, model-centric approach to AI, data is collected to train and validate a given model with limited regard for the quality of the data.
Data-centric AI improves the odds AI projects and machine learning models  succeed when they are deployed in the real world.
MLOps defines the continuous and iterative collection/pre-processing of data, model training and evaluation and deployment in production.


Fig 1. Overview of continuous development in data-centric AI - courtesy Andrew Ng

There are several difference between the traditional Model-centric AI and Data centric AI

Model-centric AI

Data-centric AI

Goal is to collect all the data you can and develop a model good enough to deal with noise to avoid overfitting.

Goal is to select a subset of the training data with the highest consistency and reliability so multiple models performs well.

Hold the data fixed and iteratively improve the model and code.

Hold the model and code fixes and iteratively improve the data.


Repeatable processes

Predictable delivery of ML products or services relies on three elements
  • Repeatable process
  • Lifecycle management tools
  • Product management
The goal is to apply known repeatable software development (Scrum, Kaban,..) and DevOps best practices to the training and validation of ML models. Moving model training, tuning and validation to operations and automation increases the number of tasks that are controllable and predictable.


Fig 2. Productization of training and validation of models


As illustrated in fig. 1, the deployment process in the model-centric AI leaves little room for integrating the training and validation of the model with new data. In the data-centric AI approach, the model is deployed very early in the development cycle, allowing for continuous integration and update of the model(s) with feedback and new data. 


AI lifecycle management tools

Quite a few open source have been introduced over the last 3 years to support introduction and implementation of MLOps throughout the entire engineering organization.

Although most of the development tools commonly used in software engineers are applicable to MLOps, some ML lifecycle tools have been introduced over the last couple of years.

  • DVC manages version control for ML projects
  • Polyaxon provides data scientists with lifecycle automation in a collaborative environment
  • MLFlow manages the entire ML lifecycle, from experimentation to deployment. It includes a model registry for managing various versions of model
  • Kubeflow is the workflow automation and deployment in Kubernetes containers
  • Metaflow manages the automation pipeline and deployment

AutoML frameworks are increasing used for rapid ML development similar to GUI development


Canary, frictionless release

A robust testing and deployment process is critical to the success of any AI project. The canary release allows the migration of model from development/staging environment to production be frictionless. The process consists of routing % of requests to a new version or sandbox according to a criteria defined by product manager (Modality, Customer, Metrics,…). This approach reduces the risk of failure in deployment to production because there is no need for roll-back. It is just matter of stopping the traffic to the new version.


References