The Limitations of Apache Spark

When Apache Spark was introduced at the Hadoop World Conference in California in July 2014, my boss Marcin Mejran - then Lead Data Scientist at HookLogic - came back to announce that we would be switching our entire Machine Learning Infrastructure over to Spark from Hadoop.

So we got to work. How could we not? Our boss was actually inviting us to use bleeding edge tools; it was one of those learning opportunities exciting enough to make you wonder why you are even being paid to do your job. 

Now I am not going to delve into the virtues of Spark which have been listed endlessly all over the web along with its apparently single flaw : "needs to mature".  No, I am going to spend time talking about a specific - and seemingly minor- limitation as a tool for experienced Machine Learning Engineers who want to build large-scale, end-to-end learning systems. Or perhaps this is just a long overdue rant after many sessions of intense debugging ; disguised as a technical review.  I hope you will indulge me.

What prompted this post (which I had originally intended to use as a guide for setting up spark-notebooks) was the design and implementation of a complete anomaly-detection system from scratch; in Scala using Apache Spark. 

The over-arching approach I decided on, was to calculate different density metrics as well as distance measures of relevant parameters and encode the outputs of those models as features of a Single-Class SVM. Not terribly daunting.

I am a big believer of taking advantage of (which is not the same thing as reliance) libraries ; even if you end up having to make lots of changes to turn it into something you want. Chances are pretty high that other people besides you have used the library and contributed to it to make it better than what you would have carved from scratch. Also, I like to spend my time working on interesting problems that remain unsolved. That being said, I have experience working with Mllib libraries and expected to have to make my own changes to mold their SVM into something to best serve my purposes. I also knew that Mllib was lacking in its number of available algorithms  (Tanimoto distance is a relevant example).

I also have very extensive experience using Apache Mahout, and while I have had some really fun times with it, I feel that it has become rather too heavy-weight. Even when we switched our Machine Learning operations to run on Spark, we maintained usage of Apache Mahout due to its extensive embedding in our system and lack of functional counterparts in Apache Spark's Mllib.

But this is a new job, a new system, a clean slate. So I got down to business coding some distances from scratch. Simple enough. Or so I thought. After all, the first order of business was to do some basic linear algebra between two Vectors.. you know.. some dot products, some norms, etc. To my horror, I could not find any proper (i.e. anything that isn't super inefficient and hacky) support for vector operations. What?

So I googled. First, as it turns out I am not the first to be shocked by this stunner. Some golden responses :

As I understand it, the Spark people do not want to expose third party APIs (including Breeze) so that it’s easier to change if they decide to move away from them.

You could always put just a simple implicit conversion class in that package and write the rest of your code in your own package. Not much better than just putting everything in there, but it makes it a little more obvious why you’re doing it.
— dlwh
putting code in the mllib.linalg package is not a viable solution for clients of the mllib framework
— javadba

WHAT?!?! This folly was intentional? Now, you may think, "Zoe, come on, it's not that bad, what's adding another class?" But that is not what I have a problem with; I see this as a topic of principle and what I found upon deeper digging on the matter necessitated this post. As it turns out, two Jira tickets (one as a follow-up to another) were created for the topic. 

Here's an excerpt from Sean Owen's gem (sarcasm) of a response:

I think the idea was that this is not supposed to become yet another vector/matrix library, and that you can manipulate the underlying breeze vector if needed
— Sean Owen

....Uh yeah that's nice, except you can't. 

Now, bear in mind I have a lot of respect for Sean Owen; he has made a lot of contributions to the Apache Spark project. But his reply on the subject is disappointing especially in light of the fact that as a top contributor, he has the power to shape the "culture" of this particular community - one that is becoming increasingly suffocating and exclusive.

I love the Apache Spark project, but I am willing to admit all that's Spark doesn't shine. So I write this not as a rant, but rather as an avid advocate and user of Apache Spark who wants to have a voice in the community and help make it better.