Tuesday, March 17, 2015

Spark is your friend? - We'll see about that.

Spark likes to pretend that it is your friend. For example, it is a friend of Hadoop and uses its storage, HDFS. But Spark is more flexible and more powerful than Hadoop. So Spark may steal Hadoop's girlfriend, the HDFS, and continue on its way with her.

Spark also does not hesitate to borrow and re-use. And why not? In the open-source world, the old adage of "jealousy of the nerds only increases software quality" holds true. Look, for example, at Spark's use of the spark-shell. Things are easy here. For example, if I want to to read the input log lines and take only these that have the word "error" in them, I can write

val inputRDD = sc.textFile("log.txt") 
val errorsRDD = inputRDD.filter(line => line.contains("error"))

But what does this remind you of? Well, if you have ever seen the ease with which Scalding solves such problems, you will immediately recognize the Scalding constructs here.

Now, what is Scalding, you may ask. Well, easy. Since Hadoop developers always do the same things, such as reading data, filtering it, and joining it with other data, then it is understandable that someone may want to simplify this, and this someone is Chris Wensel, who invented Cascading. But since Cascading sounded so much like Scala already, it was then one step to re-write it in Scala, and this is how Scalding was born.

Only now you don't need it. The ease, elegance, and simplicity of Scalding are built right into the spark-shell, as my example above should show you.

Let me repeat the lines and explain them.

val inputRDD = sc.textFile("log.txt")

means that we are reading the file "log.txt", which may be a local file or an HDFS file (more on this later). "sc" is a "Spark Configuration" object created by Spark for you. And "val inputRDD" is a way to declare values (really, immutable variables) in Scala. So far all is well.

val errorsRDD = inputRDD.filter(line => line.contains("error"))

This line contains familiar elements. It means: "take that inputRDD which you just created, and filter it, by taking only lines with the word "error". Note how you define the filter function by using the => symbol. It means: take any line from the input and keep only that one which has "error".

By now you must have become proficient with the use of spark-shell, so the next two lines may need no explanation.

errorsRDD.take(10).foreach(println)

I don't have to tell that this means, "take the first 10 lines and print each one of them.

And that's it - you are a master of spark-shell programming, and you have absorbed the best practices that Scalding brought to Hadoop - and bypassed the Scalding learning curve.

Until the next edition of Sparklets, a personal story of learning Spark.

No comments:

Post a Comment