Monday, June 21, 2021

Open Source Lambda architecture for deep learning

Target audience: Beginner
Estimated reading time: 15'

The world of data scientists accustomed to Python scientific libraries have been shaken up by the emergence of ’big data’ framework such as Apache Hadoop, Spark and Kafka. This presentation introduces a variant of the Lambda architecture and describes, very briefly the seamless integration of various open source components. This post is a high level overview of  to the key services of a typical architecture.

Core data flow

The concept and architecture are versatile enough to accommodate a variety of open source, commercial solutions and services beside the frameworks prescribed in this presentation. The open source framework PyTorch is used to illustrate the integration of big data framework such as Apache Kafka and Spark with deep learning library to train, validate and test deep learning models.

Alternative libraries such as Keras or Tensor Flow could be also used.

Let's consider the use case of training and validating a deep learning model, using Apache Spark to load, parallelize and preprocess the data. Apache Spark takes advantage of large number of servers and CPU cores.

In this simple design, the workflow is broken down into 6 steps
  1. Apache Spark load then parallelize training data from AWS S3 
  2. Spark distributed the data pre-processing, cleansing, normalization across multiple worker nodes
  3. Spark forward the processed data to PyTorch cluster
  4. Flask converts requests to prediction query to PyTorch model
  5. PyTorch model generate a prediction
  6. Run-time metrics are broadcast through Kafka

Key services

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.It extends the functionality of Numpy and Scikit-learn to support the training, evaluation and commercialization of complex machine learning models.

Apache Spark is an open source cluster computing framework for fast real-time processing. 
It supports Scala, Java, Python and R programming languages and includes streaming, graph and machine learning libraries.

Apache Kafka is an open-source distributed event streaming framework to large scale, real-time data processing and analytics. 
It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor. 

Ray-tune is a distributed hyper-parameters tuning framework particularly suitable to deep learning models.  It reduces significantly the cost of optimizing the configuration of a model. It is a wrapper around other open source library 

Apache Hive is an open source data warehouse platform that facilitates reading,  writing, and managing large datasets residing in distributed storages such as Hadoop and Apache Spark

Flask is Python-based web development platform built as a micro-framework to support REST protocol. Its minimalist approach to web interface makes is a very intuitive tool to be build micro-services.

Amazon Simple Storage Server (S3) is a highly available, secure object storage service with a very high durability factor (11 sigma) and scalability and support for versioning. It is versatile enough to accommodate any kind of data format.


Apache Spark   
Apache Kafka
Ray Tune
Apache Hive
Flask - Pallets project
Amazon S3

This informational post introduced the high level components of a Lambda architecture. Such orchestration of services is the foundation of iterative machine learning modeling concept known as MLOps. MLOps will be discussed in a future post.