Monday, June 21, 2021

Open Source Lambda architecture for deep learning

Target audience: Beginner
Estimated reading time: 15'


The world of data scientists accustomed to Python scientific libraries have been shaken up by the emergence of ’big data’ framework such as Apache Hadoop, Spark and Kafka. This presentation introduces a variant of the Lambda architecture and describes, very briefly the seamless integration of various open source components. This post is a high level overview of  to the key services of a typical architecture.

Core data flow

The concept and architecture are versatile enough to accommodate a variety of open source, commercial solutions and services beside the frameworks prescribed in this presentation. The open source framework PyTorch is used to illustrate the integration of big data framework such as Apache Kafka and Spark with deep learning library to train, validate and test deep learning models.

Alternative libraries such as Keras or Tensor Flow could be also used.

Let's consider the use case of training and validating a deep learning model, using Apache Spark to load, parallelize and preprocess the data. Apache Spark takes advantage of large number of servers and CPU cores.



In this simple design, the workflow is broken down into 6 steps
  1. Apache Spark load then parallelize training data from AWS S3 
  2. Spark distributed the data pre-processing, cleansing, normalization across multiple worker nodes
  3. Spark forward the processed data to PyTorch cluster
  4. Flask converts requests to prediction query to PyTorch model
  5. PyTorch model generate a prediction
  6. Run-time metrics are broadcast through Kafka


Key services


PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.It extends the functionality of Numpy and Scikit-learn to support the training, evaluation and commercialization of complex machine learning models.



Apache Spark is an open source cluster computing framework for fast real-time processing. 
It supports Scala, Java, Python and R programming languages and includes streaming, graph and machine learning libraries.


Apache Kafka is an open-source distributed event streaming framework to large scale, real-time data processing and analytics. 
It captures data from various sources in real-time as a continuous flow and routes it to the appropriate processor. 



Ray-tune is a distributed hyper-parameters tuning framework particularly suitable to deep learning models.  It reduces significantly the cost of optimizing the configuration of a model. It is a wrapper around other open source library 



Apache Hive is an open source data warehouse platform that facilitates reading,  writing, and managing large datasets residing in distributed storages such as Hadoop and Apache Spark


Flask is Python-based web development platform built as a micro-framework to support REST protocol. Its minimalist approach to web interface makes is a very intuitive tool to be build micro-services.


Amazon Simple Storage Server (S3) is a highly available, secure object storage service with a very high durability factor (11 sigma) and scalability and support for versioning. It is versatile enough to accommodate any kind of data format.



References

PyTorch
Apache Spark   
PySpark
Apache Kafka
Ray Tune
Apache Hive
Flask - Pallets project
Amazon S3

This informational post introduced the high level components of a Lambda architecture. Such orchestration of services is the foundation of iterative machine learning modeling concept known as MLOps. MLOps will be discussed in a future post.


10 comments:

  1. AppNexus Technology Pvt Ltd is a Software Development Company. We are a leading IT Company, Dealing With IT Services Such as Web Development, Software Development, SEO, Web Designing &Logo Designing, Book Cover Design, And Mobile Application, SMO, PPC, BlogCreation, ContentMarketing, e-commerce website,
    digital marketing, CMS, e-commerce website.
    Visit Here:
    https://matmo.in/
    https://www.appnexustech.com/

    ReplyDelete

  2. This is a very nice one and gives in-depth information. I am really happy with the quality and presentation of the article. I’d really like to appreciate the efforts you get with writing this post. Thanks for sharing.
    Java training in pune

    ReplyDelete
  3. Many thanks for your kind and efficient service for booking a plot in park view city and blue world city.I will definitely recommend Imarah real estate services to others in the future.
    https://imarahmarketing.com/
    https://imarahmarketing.com/blue-world-city/
    https://imarahmarketing.com/park-view-city-islamabad/

    ReplyDelete
  4. Thanks for the informative and very usefull blog. Keep sharing
    Java Classes in Pune

    ReplyDelete
  5. I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end.
    Java Classes in Pune

    ReplyDelete
  6. You have done good work by publishing this article here. I found this article too much informative, and also it is beneficial to enhance our knowledge. erp company in Chennai India Grateful to you for sharing an article like this.

    ReplyDelete
  7. Exactly how is actually Google Adsense More advanced than Alternative Listing Cpa affiliate networks? dark web sites

    ReplyDelete
  8. The article traces the 5 most significant motivations behind why clients and advancement groups should supportive of effectively complete architecture appraisals. estudio arquitectura zaragoza

    ReplyDelete
  9. For entry to undergraduate programmes in government and private institutes, national level entrance examination is conducted by the Council of Architecture. https://shuttlesky.in/

    ReplyDelete