Spark Illuminated

Tuesday, March 17, 2015

Spark is your friend? - We'll see about that.

Spark likes to pretend that it is your friend. For example, it is a friend of Hadoop and uses its storage, HDFS. But Spark is more flexible and more powerful than Hadoop. So Spark may steal Hadoop's girlfriend, the HDFS, and continue on its way with her.

Spark also does not hesitate to borrow and re-use. And why not? In the open-source world, the old adage of "jealousy of the nerds only increases software quality" holds true. Look, for example, at Spark's use of the spark-shell. Things are easy here. For example, if I want to to read the input log lines and take only these that have the word "error" in them, I can write

val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))

But what does this remind you of? Well, if you have ever seen the ease with which Scalding solves such problems, you will immediately recognize the Scalding constructs here.

Now, what is Scalding, you may ask. Well, easy. Since Hadoop developers always do the same things, such as reading data, filtering it, and joining it with other data, then it is understandable that someone may want to simplify this, and this someone is Chris Wensel, who invented Cascading. But since Cascading sounded so much like Scala already, it was then one step to re-write it in Scala, and this is how Scalding was born.

Only now you don't need it. The ease, elegance, and simplicity of Scalding are built right into the spark-shell, as my example above should show you.

Let me repeat the lines and explain them.

val inputRDD = sc.textFile("log.txt")

means that we are reading the file "log.txt", which may be a local file or an HDFS file (more on this later). "sc" is a "Spark Configuration" object created by Spark for you. And "val inputRDD" is a way to declare values (really, immutable variables) in Scala. So far all is well.

val errorsRDD = inputRDD.filter(line => line.contains("error"))

This line contains familiar elements. It means: "take that inputRDD which you just created, and filter it, by taking only lines with the word "error". Note how you define the filter function by using the => symbol. It means: take any line from the input and keep only that one which has "error".

By now you must have become proficient with the use of spark-shell, so the next two lines may need no explanation.

errorsRDD.take(10).foreach(println)

I don't have to tell that this means, "take the first 10 lines and print each one of them.

And that's it - you are a master of spark-shell programming, and you have absorbed the best practices that Scalding brought to Hadoop - and bypassed the Scalding learning curve.

Until the next edition of Sparklets, a personal story of learning Spark.

Why is Spark a winner? - The second reason

In my last installment I described how Microsoft missed its chance to be the leader in the Big Data. Why? Why was Dryad killed, but Spark, Dryad's successor, is all the rage in the community? I suggested a simple enough reason: Dryad was to be placed in Microsoft's proprietary cloud, Azure, whereas Spark is completely open source.

However, there is yet another reason, just as important as the first one. You see, Dryad was an all-out Hadoop killer. It did not play nicely with Hadoop, in fact, it did not play with it at all. (Parenthetically, could Microsoft of 2008 allow to play into the competitors hand? - no, of course not!). However, Spark pretends to be Hadoop's friend, it uses Hadoop's storage, HDFS, as one of the input formats. So the two play together nicely, unless something else will happen. Will it or not? Tune in to the next edition of "Sparklets," a personal story of learning Spark.

Sunday, March 8, 2015

The irony of the story (Microsoft, Hadoop and Dryad)

Microsoft's relationship with Hadoop was for a long time ambiguous: from a rumor about "Hadoop on Azure" (back in 2008) to "never!" to "We will build our own" and finally to "Try Hadoop on Azure today!"

Meanwhile, in the MS labs, some very bright people were working on Dryad, "the Hadoop killer." Dryad is a pretty forest nymph, and one can guess that Dryad the software was intended to be as pliable and nimble. Microsoft allowed to publish the paper on Dryad and killed the project.

And then, literally in the last two years, there appeared Spark, and if Hadoop and Big Data were hot, then Spark is many times hotter - if you measure by the number of committers and commits for the project's repository. While people are preparing training courses and talking about the reason why Spark is so hot, the supreme irony just cannot escape my eyes: Spark inherits so many ideas from Dryad that it can be called the open source implementation of Dryad, much as Hadoop is the open source implementation of Google's MapReduce.

So what happened here? Microsoft had a Hadoop killer but killed it? The simple logical explanation could be this: to be a "Hadoop killer," Dryad needed to be an open source community project. Then it would fit in within the ecosystem of Big Data. And back then Microsoft was not into open source.

Of course, Spark is not really a "killer," it is rather another great tool in the Big Data universe, as evidenced by the Hadoop distros simply embracing and including it. But to be leading company for its adoption could have been nice for Microsoft.

Would you have another explanation?