Sunday, December 07, 2014

Udacity course on Apache Storm

Update on 05th April, 2015: After a fair bit of time here, I have moved on to GitHub hosted Octopress blogs. Please find me on henceforth for all new updates.

I am a huge fan of Apache Storm for its simplicity and ease of use and more so the uncomplicated way of solving Big Data problems. I have given a session on this at Fifth Elephant, 2013 too.
For all Big Data projects I always try to utilize Storm whenever we deal with any real-time streaming use cases as such. Storm is good and a well-designed tool for solving real-time streaming issues and hence the reason its dubbed as Hadoop of the real-time. I have open sourced many projects on my GitHub account which use Storm as the processing engine.

Udacity is one of the 3 wonderful MOOCs we have right now, which also include Coursera and edX. Udacity already has a course on Apache Hadoop titled, "Intro to Hadoop and MapReduce". Udacity created this course collaborating with Cloudera. I have done this course last year though I have not opted for the [paid] verified certificate for this course, since I am already a Cloudera Certified Developer for Apache Hadoop [CCDH]. You can find the solutions to all the assignments of this course also on my GitHub account.

Now Udacity has started a new course on Storm, as part of their Data Science catalog. This particular course is in partnership with Twitter. Just in case if you are not aware, Storm was open sourced at Twitter and they are one of the power users of Storm for their use cases. Hence it makes great sense to have them teach and talk about Storm. And also the syllabus looks really interesting.

Lesson 1
Join instructor Karthik Ramasamy and the first Udacity-Twitter Storm Hackathon to cover the motivation and practice of real-time, distributed, fault-tolerant data processing. Dive into basic Storm Topologies by linking to a real-time d3 Word Cloud Visualization using Redis, Flask, and d3.
Lesson 2
Explore Storm basics by programming Bolts, linking Spouts, and finally connecting to the live Twitter API to process real-time tweets. Explore open source components by connecting a Rolling Count Bolt to your topology to visualize Rolling Top Tweeted Words.
Lesson 3
Go beyond Storm basics by exploring multi-language capabilities to download and parse real-time Tweeted URLs in Python using Beautiful Soup. Integrate complex open source bolts to calculate Top-N words to visualize real-time Top-N Hashtags. Finally, use stream grouping concepts to easily create streaming join to connect and dynamically process multiple streams.
Lesson 4
Work on your final project and we cover additional questions and topics brought up by Hackathon participants. Explore Vagrant, VirtualBox, Redis, Flask, and d3 further if you are interested!

I am really excited about this course and I am already half-way thru with the first lesson and its pretty pleasant experience till now.

Unfortunately, this course does not cover few of the important topics of Storm like Ack, Cluster and Trident. That's a dampener for an otherwise some pretty great content. Needless to say, if they would have included these wonderful topics, this course would have been a really wonderful course and a go-to one for any Storm related course on the internet. But alas!

Anyways, I am planning to go ahead and complete the course by the end of this month and if the course guidelines permit, I will put the solutions of the assignments up on my GitHub account.

If you are interested in or your work includes solving real-time Big Data use cases, you might want to checkout this course:

Happy learning! And all the very best too.

Saturday, October 18, 2014

Download Java / JDK / JRE from shell / terminal / command prompt

Its been really sometime since I blogged anything. Got into too much work both on personal and professional fronts. I will try to be regular henceforth though, hopefully [fingers crossed!!].

Most of my work happens on EC2 and on Linux as our Hadoop env is on EC2. I absolutely adore Linux and shell. And first and foremost thing I have to do - being a Java developer - is download Oracle JDK onto the Linux machines on EC2. And downloading Oracle JDK from Oracle website is difficult due to Oracle's mandatory license check, which you need to accept before downloading the JDK. With my Linux env being server-only-machines [i.e. without a desktop or GUI], there is no way I could download the JDK directly from Oracle website. So, I came up with this small shell script [extending an answer from Stackoverflow] to download JDK from Oracle website from command prompt.

Depending on the OS and platform of the JDK version you intend to download, just modify the array on line#20 in the following script and you can trigger the download on the shell directly.
Hope this script will be helpful for all those who live on and love command prompt.

Update on 05th April, 2015: After a fair bit of time here, I have moved on to GitHub hosted Octopress blogs. Please find me on henceforth for all new updates.

Wednesday, January 29, 2014

Learning Guava -- Calibrating time using Stopwatch

Many of our day-to-day applications would need calibrating time taken between 2 points. In Java world we either depend on System.currentTimeMillis() or System.nanoTime(). But the pain here is, we have to do the required computations of getting to a proper granularity to understand the time taken.Would n't it be great it to have such an utility class which will give the required information in the granularity we need with minimum amount of boilerplate code?

Stopwatch is one such small and wonderful utility class in Guava which helps in calibrating elapsed time / duration between any 2 points in the logic. The advantage of using Guava's Stopwatch is you can get the elapsed time in any measure i.e. right from nanoseconds to days. This is possible because you can pass an enum argument type of TimeUnit class to get the elapsed time in the desired granularity.

Code snippet for the usage of the Stopwatch class:

Few caveats for using Stopwatch are you should not start an already started Stopwatch. One has to check if the Stopwatch is already running by invoking isRunning() method.
Stopwatch documentation says the following on the same:
Stopwatch methods are not idempotent; it is an error to start or stop a stopwatch that is already in the desired state.

Also, once I got burned down by StopWatch class of Apache Commons Lang. As I was working in an IDE on a Maven project, I could not relate to the difference between Stopwatch of Guava and StopWatch of Commons Lang as the class got auto imported into the code and then spent some 20 minutes trying to check my classpath, IDE setup, etc. Yes how stupid of me, right? So, please be careful in choosing the correct class.

Update on 05th April, 2015: After a fair bit of time here, I have moved on to GitHub hosted Octopress blogs. Please find me on henceforth for all new updates.

Sunday, January 26, 2014

Learning Guava -- Load properties file using Guava

Guava code snippet for loading a properties file from classpath.

For more info, please check Resources class of Guava.

Update on 05th April, 2015: After a fair bit of time here, I have moved on to GitHub hosted Octopress blogs. Please find me on henceforth for all new updates.