Saturday, November 29, 2014

Apache Spark/MLlib for K-means

This page illustrates the Apache Spark MLlib library with the plain-vanilla K-means clustering (unsupervised) algorithm.

Overview
Apache Spark attempts to address the limitation of Hadoop in terms of performance and real-time processing by implementing in-memory iterative computing, which is critical to most discriminative machine learning algorithms. Numerous benchmark tests have been performed and published to evaluate the performance improvement of Spark relative to Hadoop. In case of iterative algorithms, the time per iteration can be reduced by a ratio of 1:10 or more.
The core element of Spark is Resilient Distributed Datasets (RDD), which is a collection of elements partitioned across the nodes of a cluster and/or CPU cores of servers. An RDD can be created from local data structures such as list, array or hash tables, from the local file system or the Hadoop distributed file system (HDFS).

Note: The code presented in this post uses Apache Spark version 1.3.1. There is no guarantee that the implementation of the K-means in this post will be compatible with future version of Apache Spark.

Apache Spark RDDs
The operations on an RDD in Spark are very similar to the Scala higher order methods. These operations are performed concurrently over each partition. Operations on RDD can be classified as:
* Transformation: convert, manipulate and filter the elements of an RDD on each partition
* Action: aggregate, collect or reduce the elements of the RDD from all partitions
An RDD can persist, be serialized and cached for future computation. Spark provides a large array of pre-built transforms and actions which go well beyond the basic map-reduce paradigm. Those methods on RDDs are a natural extension of the Scala collections making code migration seamless for Scala developers.

Apache Spark supports fault-tolerant operations by allowing RDDs to persist both in memory and in the file systems. Persistency enables automatic recovery from node failures. The resiliency of Spark relies on the supervisory strategy of the underlying Akka actors, the persistency of their mailboxes and replication schemes of HDFS.
Spark is initialized through its context. For instance, a local Spark deployment on 8 cores, with 2 Gbytes allocated for data processing (RDDs) in memory only storage level and 512 Mbytes for the master process is defined by creating a spark configuration instance of type SparkConf

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.storage.StorageLevel
 
val sparkConf = new SparkConf()
            .setMaster("local[8]")
            .setAppName("SparkKMeans")
            .set("spark.executor.memory", "2048m")
            .set("spark.storageLevel", "MEMORY_ONLY")
            .set("spark.driver.memory", "512M")
            .set("spark.default.parallelism", "16")
 
implicit val sc = new SparkContext(sparkConf))

Apache Spark MLlib
MLlib is a scalable machine learning library built on top of Spark. As of version 1.0, the library is a work in progress. The main components of the library are:
  • Classification algorithms, including logistic regression, Naïve Bayes and support vector machines
  • Clustering limited to K-means in version 1.0
  • L1 & L1 Regularization
  • Optimization techniques such as gradient descent, logistic gradient and stochastic gradient descent and L-BFGS
  • Linear algebra such as Singular Value Decomposition
  • Data generator for K-means, logistic regression and support vector machines.
The machine learning byte code is conveniently included in the spark assembly jar file built with the simple build tool, sbt.


Let's consider the K-means clustering components bundled with Apache Spark MLlib. The K-means configuration parameters are:
  • K Number of clusters (line 4)
  • maxNumIters Maximum number of iterations for the minimizing the reconstruction error< (line 5)/li>
  • numRuns Number of runs or episode used for training the clusters (line 6)
  • caching Specify whether the resulting RDD has to be cached in memory (line 7)
  • xt The array of data points (type Array[Double]) (line 8)
  • sc Implicit Spark context

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
 
class SparkKMeans(
    K: Int, 
    maxNumIters: Int, 
    numRuns: Int,
    caching: Boolean,
    xt: Array[Array[Double]]) (implicit sc: SparkContext) {
   
 
  def train: Try[KMeansModel] = {
    val kmeans = new KMeans
    kmeans.setK(K)
    kmeans.setMaxIterations(maxNumIters)
    kmeans.setRuns(numRuns)
   
    val rdd = sc.parallelize(xt.map(new DenseVector(_)))
    rdd.persist(StorageLevel.MEMORY_ONLY)
    if( caching )
       rdd.cache
    kmeans.run(rdd)
  }
}

The clustering model is created by the train method (line 11). Once the Spark/MLlib K-means is instantiated and initialized (lined 12 -15), the ipnt data set xt is converted into a DenseVector then converted into a RDD (line 17). Finally the input RDD is fed to the Kmeans (kmeans.run)

References

70 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.



    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  2. Thanks for your informative post on Java application development. This open source platform assists software developers to create stunning mobile application with ease. Further, they can make use of this platform at free of cost.
    Java Training in Chennai
    .Net Training in Chennai
    PHP Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  3. In near future, big data handling and processing is going to the future of IT industry. Thus taking Hadoop Training in Chennai | Big Data Training in Chennai will prove beneficial for talented professionals.

    ReplyDelete
  4. Thanks for sharing this niche useful informative post to our knowledge, Actually SAP is ERP software that can be used in many companies for their day to day business activities it has great scope in future.
    Regards,
    SAP Training in Chennai|SAP Course in Chennai|SAP Training|SAP training in chennai

    ReplyDelete

  5. if you are looking for the Big Data Training in indore i would highly recommened you the ssi , they are the one the best education institute in indore

    ReplyDelete
  6. Updating to latest technology is one's responsibility. Article like this are truly inspiring and worth able to read. You have done a great job by posting it in here. Thanks for sharing.

    PHP Training in Chennai
    PHP Course in Chennai

    ReplyDelete
  7. Thanks for your informative article on software testing. Your post helped me to understand the future and career prospects in software testing. Keep on updating your blog with such awesome article. Best software testing training institute in Chennai | Software Testing Training in Chennai | Software testing training institute Chennai

    ReplyDelete
  8. This technical post helps me to improve my skills set, thanks for this wonder article I expect your upcoming blog, so keep sharing.
    Regards,
    ccna course in Chennai|ccna training in Chennai|ccna training institute in Chennai|ccna institutes in Chennai

    ReplyDelete
  9. Thanks of sharing this post…Python is the fastest growing language that helps to get your dream job in a developing area. It says every fundamental in a programming, so if you want to become a expertise in python get some training on that language.
    Regards,
    Python Training Institutes in Chennai|python training chennai|Python Course in Chennai

    ReplyDelete
  10. Automation will make any work to be completed so soon(Selenium training in chennai), in addition to this recording the actions is possible with automated tool. Your content explicitely states the same(Selenium training chennai). Thanks for sharing this worth able content in here. This was very useful to me as well. Keep blogging like this.

    ReplyDelete
  11. I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.

    Hadoop Training Chennai | Hadoop Training in Chennai | Big Data Course in Chennai

    ReplyDelete
  12. Testing is very important before launching a web application or a mobile application because it can detect the error at an early stage, and it reduces the work of the developer.
    manual testing training institute in chennai | mobile application training in chennai | FITA Academy Chennai

    ReplyDelete
  13. The best thing about HTML5 is that it allows the developers to embed the video files, audio files, and high quality graphics without any third party applications.
    html5 training in chennai | html5 training institutes in chennai | FITA Academy Chennai

    ReplyDelete
  14. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  15. The main thing which i like about web designing is that it needs creativity and we need to work differently acccording to our clients need this needs a creativity and innovation.
    web designing course in chennai|web designing training in chennai|web designing courses in chennai

    ReplyDelete
  16. Hi, actually I'am new to angularJs and infact I'am learning angularjs with online training. I'am having doubt, if you could solve the doubt for me that would be very helpful. The doubt is, how can I reset a “$timeout”, and disable a “$watch()”?
    Regards,
    angularjs training in Chennai | angularjs training | angularjs training Chennai

    ReplyDelete
  17. The best thing about HTML5 is that it allows the developers to embed the video files, audio files, and high quality graphics without any third party applications.
    html5 training in chennai | html5 training institutes in chennai | FITA Academy reviews

    ReplyDelete
  18. this blog is creative and informative too thanks for sharing those information it is really useful for me and it is really good.


    software testing training in chennai

    ReplyDelete
  19. Thanks for your post; selenium is most trusted automation tool to validate web application and browser. This tool provides precise and complete information about a software application or environment. Selenium Training in Chennai | Selenium Course in Chennai | Best Selenium training institute in Chennai

    ReplyDelete
  20. Selenium is the best tool for software testing automation and it is used globally by the top companies across globe. Get training to use selenium and star yur career as a siftware tester today.
    Selenium training in Chennai | Selenium course in Chennai | Selenium training institute in Chennai

    ReplyDelete
  21. In India thenumber of smartphone users have been on a rise. Among them also the people using android is way to high. Being an android developer would be the dorrect career choice.
    Android training in Chennai | Android course in Chennai | Android training institute in Chennai

    ReplyDelete
  22. Python is an object oriented high level programming language which is built in data structures combined with dynamic typing and dynamic binding making it very attractive for rapid application development.
    Python Training in Chennai | Python Course in Chennai

    ReplyDelete
  23. PHP provides the best option to build the website where we can design our website in a very interactive manner that provides better functioning in data management.
    PHP Training in Chennai | PHP course in Chennai

    ReplyDelete
  24. Does anyone know any good training in Chennai? I've searched high and low but haven't found anyone offering it.

    ReplyDelete
  25. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    digital marketing course in Chennai | digital marketing training in Chennai

    ReplyDelete
  26. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
    web designing course in Chennai | web designing training in Chennai

    ReplyDelete
  27. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it. The angular js programming language is very popular which are most widely used.
    Angularjs Training in Chennai | Angularjs training Chennai

    ReplyDelete
  28. Great information shared. Really valuable tips, these kind of tips really helpful for me.
    SEO Consultant Melbourne | Melbourne SEO Services | SEO Agency Melbourne

    ReplyDelete
  29. Thanks for sharing this informative content that guided me to know the details about the training offered in different technology.
    digital marketing course in chennai | digital marketing training in Chennai

    ReplyDelete
  30. I have read your blog its very attractive and impressive. I like it your blog.


    JavaEE Training in Chennai JavaEE Training in Chennai

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training Core Java 8 Training in Chennai Java 8 Training in Chennai

    ReplyDelete
  31. Thanks for posting this useful content, Good to know about new things here, Let me share this,
    AngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai

    ReplyDelete
  32. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    Dot net training in Chennai | dot net course in Chennai

    ReplyDelete
  33. Superb! Your blog is nice.I am happy to see this post.Thank you for sharing the great information.ERP in chennai|ERP software chennai

    ReplyDelete
  34. Thanks for sharing this valuable information with us it is a worth read
    Digital Marketing Courses in Chennai

    ReplyDelete
  35. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
    Software Testing Training

    ReplyDelete
  36. Thanks for your great information! Its interesting and informative.College Events|Online Event Registration

    ReplyDelete
  37. Nice sharing. R is a language and environment for statistical computing and graphics. Want to make a career in R Programming. Learn R Programming Online Training course @ GangBoard. We are the best provider of online training on evergreen technologies.

    ReplyDelete
  38. Hi, Really your post was very informative. Today's internet era learn Hadoop Online Training will helps you to reach your goal.Selenium Online Training

    ReplyDelete
  39. You have done really great job. Your blog is very unique and informative. Thanks. Devops Online Training | Data Science Online Training

    ReplyDelete
  40. Thanks for your great information. Sign up and Register Your Events Today!!!!!!!
    Upcoming Events in India. We make registration process easier and track number of students attending the event.

    ReplyDelete
  41. Thanks for your great information. Keep Updating.ERP in Chennai | ERP Providers in Chennai. Brave Technologies Private Limited is an one of the best Low Cost ERP Software Solution for all Industries.

    ReplyDelete
  42. Thanks for your informative post!!! After completing my graduation, I am confused whether to choose web design as my career. Your article helped me to make a right choice.Selenium Training in Chennai | Selenium Training

    ReplyDelete
  43. Nice blog. Thank you for sharing. The information you shared is very effective for learners I have got some important suggestions from it. erp providers in chennai.

    ReplyDelete
  44. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark and Scala, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus
    E-mail: sangita@maxmunus.com
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete
  45. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Scala, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Scala. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete


  46. HELLO GUYS SEARCHING FOR BEST DIGITAL MARKETING SEO TRAINING IN INDORE THAN YOU COME TO THE RIGHT WHERE YOU GOT THE BEST DIGITAL MARKETING INSTITUTION HERE IN INDORE.





    digital marketing training indore

    ReplyDelete
  47. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command .
    Selenium Training in Chennai

    ReplyDelete
  48. Nice and good article.. it is very useful for me to learn and understand easily.. thanks for sharing your valuable information and time.. please keep updating.

    Java Training in chennai | Java Training institute in chennai | Dot Net Training in chennai

    ReplyDelete
  49. Nice Info regarding the Apache spark My sincere thanks for sharing this post Please Continue to share this post
    Hadoop Training in Chennai

    ReplyDelete
  50. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    softwaretesting training in chennai



    ReplyDelete
  51. I read your article and it’s totally awesome. Keep updating this kind of useful information. thank you..
    Software Testing Training in Chennai | Big data Analytics Training in Chennai

    ReplyDelete
  52. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  53. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Saurabh Srivastava
    MaxMunus
    E-mail: saurabh@maxmunus.com
    Skype id: saurabhmaxmunus
    Ph:+91 8553576305 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete
  54. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.


    Hadoop Training in Marathallai



    Hadoop Training in BtmLayout

    ReplyDelete
  55. Really an amazing post..! By reading your blog post i gained more information.
    SEO MOBILE MARKETING

    ReplyDelete
  56. This comment has been removed by the author.

    ReplyDelete
  57. Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Blog.Please feel free to visit our site. Thank you for sharing.
    Java Training in Indore
    Keep Posting:)

    ReplyDelete
  58. seo services in chennai
    Really an amazing post..! By reading your blog post i gained more information.

    ReplyDelete
  59. This post is likely where I got the most valuable data for my exploration.
    Programmierung in Lüdenscheid

    ReplyDelete
  60. It is natural to make mistake while developing your application as a developer. Keep updating more knowledge on Software testing. Selenium is the best automation testing tool to test any application.

    Selenium Training in chennai |
    Selenium Courses in Chennai

    ReplyDelete