Sunday, December 07, 2014

Udacity course on Apache Storm

Update on 05th April, 2015: After a fair bit of time here, I have moved on to GitHub hosted Octopress blogs. Please find me on henceforth for all new updates.

I am a huge fan of Apache Storm for its simplicity and ease of use and more so the uncomplicated way of solving Big Data problems. I have given a session on this at Fifth Elephant, 2013 too.
For all Big Data projects I always try to utilize Storm whenever we deal with any real-time streaming use cases as such. Storm is good and a well-designed tool for solving real-time streaming issues and hence the reason its dubbed as Hadoop of the real-time. I have open sourced many projects on my GitHub account which use Storm as the processing engine.

Udacity is one of the 3 wonderful MOOCs we have right now, which also include Coursera and edX. Udacity already has a course on Apache Hadoop titled, "Intro to Hadoop and MapReduce". Udacity created this course collaborating with Cloudera. I have done this course last year though I have not opted for the [paid] verified certificate for this course, since I am already a Cloudera Certified Developer for Apache Hadoop [CCDH]. You can find the solutions to all the assignments of this course also on my GitHub account.

Now Udacity has started a new course on Storm, as part of their Data Science catalog. This particular course is in partnership with Twitter. Just in case if you are not aware, Storm was open sourced at Twitter and they are one of the power users of Storm for their use cases. Hence it makes great sense to have them teach and talk about Storm. And also the syllabus looks really interesting.

Lesson 1
Join instructor Karthik Ramasamy and the first Udacity-Twitter Storm Hackathon to cover the motivation and practice of real-time, distributed, fault-tolerant data processing. Dive into basic Storm Topologies by linking to a real-time d3 Word Cloud Visualization using Redis, Flask, and d3.
Lesson 2
Explore Storm basics by programming Bolts, linking Spouts, and finally connecting to the live Twitter API to process real-time tweets. Explore open source components by connecting a Rolling Count Bolt to your topology to visualize Rolling Top Tweeted Words.
Lesson 3
Go beyond Storm basics by exploring multi-language capabilities to download and parse real-time Tweeted URLs in Python using Beautiful Soup. Integrate complex open source bolts to calculate Top-N words to visualize real-time Top-N Hashtags. Finally, use stream grouping concepts to easily create streaming join to connect and dynamically process multiple streams.
Lesson 4
Work on your final project and we cover additional questions and topics brought up by Hackathon participants. Explore Vagrant, VirtualBox, Redis, Flask, and d3 further if you are interested!

I am really excited about this course and I am already half-way thru with the first lesson and its pretty pleasant experience till now.

Unfortunately, this course does not cover few of the important topics of Storm like Ack, Cluster and Trident. That's a dampener for an otherwise some pretty great content. Needless to say, if they would have included these wonderful topics, this course would have been a really wonderful course and a go-to one for any Storm related course on the internet. But alas!

Anyways, I am planning to go ahead and complete the course by the end of this month and if the course guidelines permit, I will put the solutions of the assignments up on my GitHub account.

If you are interested in or your work includes solving real-time Big Data use cases, you might want to checkout this course:

Happy learning! And all the very best too.

No comments:

Post a Comment