Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

ONC Patient Matching Challenge: Part 1

This is the first of a two part series on tackling the ONC Patient Matching Challenge. The first part reviews background and high level topics while the second part covers the matching engine that was built (and how to use it). 

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Data Pipeline and Task Management: The Unsolvable Problem?

There’s probably more well known data pipeline dependency management and scheduling frameworks than you can say in one breath. Is there a reason for that beyond mere not invented here syndrome?

Read More
/

Spark, Turn Off the Light on the CLI

We know that Hadoop can be 235 times slower than the command line for smaller data sets, does the same apply for Spark?

Read More
/

Scala: A Literary Review

Originally, I wanted to write a post about reviewing Apache Spark's Hyper-Parameter Optimization, the many ways in which it could be improved, and unveil a library I have been putting together which integrates seamlessly with Spark's existing ML library, etc. 

But over the holiday break, I was approached by many people around the web with questions like "Why is Scala so Popular?" or "Haskell vs Scala, which would you choose?" or "Should I use Scala or Python when using Spark?". So I decided to first write a post about this language I love so much. 

My appreciation goes beyond the technical norms in which this subject is usually covered - it transcends code and appeals to the overarching topic of human communication; modalities of thought and expression at large. Pondering on this, the image that appears in my head is one that has been etched onto my soul; that of Simone Weil.

Simone Weil could pack into a short sentence what many writers would need an entire volume to express, always supposing they had insights of such depth to express in the first place.

I don't think there is a medium that best encourages engineers to do just that in their domain other than Scala.  Writing short and expressive code (Ruby, Python?) to produce type-safe and high-performance apps (Java, C++ ?) is a duality that only merges most harmoniously when using Scala. The level of expressive syntax rarely seen in other compiled languages; code is strongly typed and supports both multiple inheritance and mixin features.

So one ends up with short, expressive syntax, cutting unnecessary punctuation, and condensing map, filter, reduce  and other higher order operations to simple one-liners.

On the outset, Scala really encourages switching from mutable data structures to immutable ones and from regular methods to pure functions. This has some immediate impacts starting with a reduction in issues rooted in unintentional side effects common in large code bases. So on the outset your code will be safer, more stable, and easier to understand.  Coding like so will imprint on the writer new ways t o view the concepts of data mutability, higher order functions,  sophisticated type systems, etc. 

And quite frankly, concepts like abstract algebra and type theory which would otherwise decompose along with the academic papers they were written on, are brought to life with Scala's sophisticated type system and its  option for custom typing declarations.

There are some downsides however, as there is no singularly adhered-to guide for best practices in Scala. Python, for example, aims to implement a single best practice - even has standard: PEP8. Scala has several different guides on slides - about which people continue to debate endlessly. This is partly achieved by the fact that while Scala encourages functional practices it does not enforce them as religiously as a language like, say, Haskell or Erlang would. But I tend to view such "deficiency" as an upside: to decipher Scala code is to first decipher the writer’s style, ideas, thought patterns, and persona. 

Wikipedia Data in Spark and Scala

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

Setting Up Spark Notebooks on EC2 (No VPN)

Getting up and running with Spark Notebooks (Part 1)

Read More

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More