Sunday, December 07, 2014

Udacity course on Apache Storm

I am a huge fan of Apache Storm for its simplicity and ease of use and more so the uncomplicated way of solving Big Data problems. I have given a session on this at Fifth Elephant, 2013 too.
For all Big Data projects I always try to utilize Storm whenever we deal with any real-time streaming use cases as such. Storm is good and a well-designed tool for solving real-time streaming issues and hence the reason its dubbed as Hadoop of the real-time. I have open sourced many projects on my GitHub account which use Storm as the processing engine.

Udacity is one of the 3 wonderful MOOCs we have right now, which also include Coursera and edX. Udacity already has a course on Apache Hadoop titled, "Intro to Hadoop and MapReduce". Udacity created this course collaborating with Cloudera. I have done this course last year though I have not opted for the [paid] verified certificate for this course, since I am already a Cloudera Certified Developer for Apache Hadoop [CCDH]. You can find the solutions to all the assignments of this course also on my GitHub account.

Now Udacity has started a new course on Storm, as part of their Data Science catalog. This particular course is in partnership with Twitter. Just in case if you are not aware, Storm was open sourced at Twitter and they are one of the power users of Storm for their use cases. Hence it makes great sense to have them teach and talk about Storm. And also the syllabus looks really interesting.

Lesson 1
Join instructor Karthik Ramasamy and the first Udacity-Twitter Storm Hackathon to cover the motivation and practice of real-time, distributed, fault-tolerant data processing. Dive into basic Storm Topologies by linking to a real-time d3 Word Cloud Visualization using Redis, Flask, and d3.
Lesson 2
Explore Storm basics by programming Bolts, linking Spouts, and finally connecting to the live Twitter API to process real-time tweets. Explore open source components by connecting a Rolling Count Bolt to your topology to visualize Rolling Top Tweeted Words.
Lesson 3
Go beyond Storm basics by exploring multi-language capabilities to download and parse real-time Tweeted URLs in Python using Beautiful Soup. Integrate complex open source bolts to calculate Top-N words to visualize real-time Top-N Hashtags. Finally, use stream grouping concepts to easily create streaming join to connect and dynamically process multiple streams.
Lesson 4
Work on your final project and we cover additional questions and topics brought up by Hackathon participants. Explore Vagrant, VirtualBox, Redis, Flask, and d3 further if you are interested!

I am really excited about this course and I am already half-way thru with the first lesson and its pretty pleasant experience till now.


Unfortunately, this course does not cover few of the important topics of Storm like Ack, Cluster and Trident. That's a dampener for an otherwise some pretty great content. Needless to say, if they would have included these wonderful topics, this course would have been a really wonderful course and a go-to one for any Storm related course on the internet. But alas!

Anyways, I am planning to go ahead and complete the course by the end of this month and if the course guidelines permit, I will put the solutions of the assignments up on my GitHub account.

If you are interested in or your work includes solving real-time Big Data use cases, you might want to checkout this course: https://www.udacity.com/course/ud381.


Happy learning! And all the very best too.

Saturday, October 18, 2014

Download Oracle JDK from command prompt

Its been really sometime since I blogged anything. Got into too much work both on personal and professional fronts. I will try to be regular henceforth though, hopefully [fingers crossed!!].

Most of my work happens on EC2 and on Linux as our Hadoop env is on EC2. I absolutely adore Linux and shell. And first and foremost thing, being a Java developer, I have to do is, download JDK onto the Linux machines on EC2. And downloading Oracle JDK from Oracle website is difficult due to mandatory license check, which you need to accept before downloading the JDK. With my Linux env being server-only-machines [i.e. without a desktop], there is no way I could download the JDK directly from Oracle website. So, I came up with this small shell script [extending an answer from Stackoverflow] to download JDK from Oracle website from command prompt.

Depending on the OS and platform of the JDK version you intend to download, just modify the array on line#18 in the following script and you can trigger the download on the shell directly.
Hope this script will be helpful for all those who live on and love command prompt.

Wednesday, January 29, 2014

Learning Guava -- Calibrating time using Stopwatch

Many of our day-to-day applications would need calibrating time taken between 2 points. In Java world we either depend on System.currentTimeMillis() or System.nanoTime(). But the pain here is, we have to do the required computations of getting to a proper granularity to understand the time taken.Would n't it be great it to have such an utility class which will give the required information in the granularity we need with minimum amount of boilerplate code?

Stopwatch is one such small and wonderful utility class in Guava which helps in calibrating elapsed time / duration between any 2 points in the logic. The advantage of using Guava's Stopwatch is you can get the elapsed time in any measure i.e. right from nanoseconds to days. This is possible because you can pass an enum argument type of TimeUnit class to get the elapsed time in the desired granularity.

Code snippet for the usage of the Stopwatch class:

Few caveats for using Stopwatch are you should not start an already started Stopwatch. One has to check if the Stopwatch is already running by invoking isRunning() method.
Stopwatch documentation says the following on the same:
Stopwatch methods are not idempotent; it is an error to start or stop a stopwatch that is already in the desired state.

Also, once I got burned down by StopWatch class of Apache Commons Lang. As I was working in an IDE on a Maven project, I could not relate to the difference between Stopwatch of Guava and StopWatch of Commons Lang as the class got auto imported into the code and then spent some 20 minutes trying to check my classpath, IDE setup, etc. Yes how stupid of me, right? So, please be careful in choosing the correct class.

Sunday, January 26, 2014

Learning Guava -- Load properties file using Guava

Guava code snippet for loading a properties file from classpath.

For more info, please check Resources class of Guava.

Wednesday, November 27, 2013

Learning Guava -- Google Guava blog series

​I have been a huge fan of Google Guava from the time I came across it 3 years back.
For starters, Guava is a project which contains many Google's core libraries like collections, caching, math, primitives, concurrency, networking, common annotations, string processing, I/O, refelction and many others.
It is very well designed API. Guava is designed, implemented and maintained by Google Engineers like Kevin Bourrillion and Kurt Alfred Kluever, etc.

Guava follows almost all the excellent patterns and practices mentioned in Effective Java book written by Joshua Bloch, who has designed the impeccable Java Collections API while he was at Sun. Later he joined Google. Under his mentor-ship, Google Guava got wings and became a very well designed and effective API, useful for many situations and scenarios with an ever-growing feature list. I ensure I add Guava dependency as the first thing to my Gradle or Maven build script. Guava makes Java code a lot more readable, clean, simple and elegant. It utilises the Java generics very well.

Consider the following example which I tweeted few months back.
Google Guava sample code

Which of the above versions looks fine? Obviously the second option, aint it?
There are many such examples where Guava wins by a margin compared to normal Java code and or other libraries like commons, etc.

Guava also helps for [in a way] functional programming too. There are few options which are really helpful there as well. Having said that, Guava creators implore the developers not to litter code with too much functional programming which might lead to unreadable code.

I will start with writing few posts on Google Guava with the tag, "LearningGuava". I have been using Guava extensively in almost every project of mine since few years. This will not only help some one else looking for info or starting on Google Guava, but as well as for me also so that I will remember in future if I need any quick snippet on something specific with Guava usage. That being the motivation, I hope it will be of good experience for you and me as well.

This post will have list of all the posts written for Google Guava. This post kinda serves as an Index and quick reference of my Google Guava posts.

  1. Load properties file using Guava
  2. Calibrating time using Stopwatch