Monday, June 21, 2021

Open Source Lambda architecture for deep learning

Target audience: Beginner
Estimated reading time: 10'

The world of data scientists accustomed to Python scientific libraries have been shaken up by the emergence of ’big data’ framework such as Apache Hadoop, Spark and Kafka. This presentation introduces a variant of the  Lambda architecture and describes the seamless integration of various open source components.

Basic data flow

The concept and architecture are versatile enough to accommodate a variety of open source, commercial solutions and services beside the frameworks prescribed in this presentation. The open source framework PyTorch is used to illustrate the integration of big data framework such as Apache Kafka and Spark with deep learning library to train, validate and test deep learning models.

Alternative libraries such as Keras or tensor flow could be also used.

Let's consider the use case of training and validating a deep learning model, using Apache Spark to load, parallelize and preprocess the data. Apache Spark takes advantage of large number of servers and CPU cores.

In this simple design, the workflow is broken down into 6 steps
  1. Apache Spark load then parallelize training data from AWS S3 
  2. Spark distributed the data pre-processing, cleansing, normalization across multiple worker nodes
  3. Spark forward the processed data to PyTorch cluster
  4. Flask converts requests to prediction query to PyTorch model
  5. PyTorch model generate a prediction
  6. Run-time metrics are broadcast through Kafka

Key services

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.It extends the functionality of Numpy and Scikit-learn to support the training, evaluation and commercialization of complex machine learning models.

Apache Spark is an open source cluster computing framework for fast real-time processing. 

It supports Scala, Java, Python and R programming languages and includes streaming, graph and machine learning libraries.

Apache Kafka is an open-source distributed event streaming framework to large scale, real-time data processing and analytics. 

It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor.

Ray-tune is a distributed hyper-parameters tuning framework particularly suitable to deep learning models.  It reduces significantly the cost of optimizing the configuration of a model. It is a wrapper around other open source libraries



  1. AppNexus Technology Pvt Ltd is a Software Development Company. We are a leading IT Company, Dealing With IT Services Such as Web Development, Software Development, SEO, Web Designing &Logo Designing, Book Cover Design, And Mobile Application, SMO, PPC, BlogCreation, ContentMarketing, e-commerce website,
    digital marketing, CMS, e-commerce website.
    Visit Here:


  2. This is a very nice one and gives in-depth information. I am really happy with the quality and presentation of the article. I’d really like to appreciate the efforts you get with writing this post. Thanks for sharing.
    Java training in pune