Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Scala: A Literary Review

Originally, I wanted to write a post about reviewing Apache Spark's Hyper-Parameter Optimization, the many ways in which it could be improved, and unveil a library I have been putting together which integrates seamlessly with Spark's existing ML library, etc. 

But over the holiday break, I was approached by many people around the web with questions like "Why is Scala so Popular?" or "Haskell vs Scala, which would you choose?" or "Should I use Scala or Python when using Spark?". So I decided to first write a post about this language I love so much. 

My appreciation goes beyond the technical norms in which this subject is usually covered - it transcends code and appeals to the overarching topic of human communication; modalities of thought and expression at large. Pondering on this, the image that appears in my head is one that has been etched onto my soul; that of Simone Weil.

Simone Weil could pack into a short sentence what many writers would need an entire volume to express, always supposing they had insights of such depth to express in the first place.

I don't think there is a medium that best encourages engineers to do just that in their domain other than Scala.  Writing short and expressive code (Ruby, Python?) to produce type-safe and high-performance apps (Java, C++ ?) is a duality that only merges most harmoniously when using Scala. The level of expressive syntax rarely seen in other compiled languages; code is strongly typed and supports both multiple inheritance and mixin features.

So one ends up with short, expressive syntax, cutting unnecessary punctuation, and condensing map, filter, reduce  and other higher order operations to simple one-liners.

On the outset, Scala really encourages switching from mutable data structures to immutable ones and from regular methods to pure functions. This has some immediate impacts starting with a reduction in issues rooted in unintentional side effects common in large code bases. So on the outset your code will be safer, more stable, and easier to understand.  Coding like so will imprint on the writer new ways t o view the concepts of data mutability, higher order functions,  sophisticated type systems, etc. 

And quite frankly, concepts like abstract algebra and type theory which would otherwise decompose along with the academic papers they were written on, are brought to life with Scala's sophisticated type system and its  option for custom typing declarations.

There are some downsides however, as there is no singularly adhered-to guide for best practices in Scala. Python, for example, aims to implement a single best practice - even has standard: PEP8. Scala has several different guides on slides - about which people continue to debate endlessly. This is partly achieved by the fact that while Scala encourages functional practices it does not enforce them as religiously as a language like, say, Haskell or Erlang would. But I tend to view such "deficiency" as an upside: to decipher Scala code is to first decipher the writer’s style, ideas, thought patterns, and persona. 

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More

The Limitations of Apache Spark

... all that's Spark doesn't shine.

Read More