Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Confessions Of a Scala Enthusiast

I feel terrible dragging LadyDi into another controversy. But this is not so much a poor reflection on her as it is me.  You see, I have spoken endlessly about the merits of Scala and Functional Programming. For a time, I was even the most viewed writer on Scala on Quora because I wrote so much on the topic!

Unfortunately I have come to realize I have been a lot of talk and very little action. Whoops.

Recently, I have come to take a good hard look at my code as well as libraries I have released into the open source community- and you know what? Most of it is not functional. In most places, I just use (abuse really) the functional properties somewhat enforced in Scala to make my OOP more stable. But that's really just about it. In reality I have written very little functional code, and no strictly functional program. As evident from my passionate remarks on the topic, I am obviously really excited about FP, and even understand how to go about it; but somehow, when push comes to shove, I remain stuck in my un-functional ways. 

Upon speaking to other fellow engineers at work and elsewhere, I have come to see that there are many strong, experienced, creative, innovative engineers who have trouble picking up Scala as a functional programming language. But I had trouble understanding why it was so, considering that many concepts in Functional Programming have very long, deep roots in the history of computer science was well as analytical philosophy. 

Let's take a look at one of the most critical concepts in Functional Programming: referential transparency. Its origins date back to Bertrand Russell's Principia Mathematica. Sure, at first glance you may dismiss this as just another old part of "mathematica", but William Van Orman Quine adopted this concept in his Word and Object published in 1960 where he coined the term "referential transparency".

RT was then thrusted into the spotlight of computer programming just 7 years later in Christopher Strachey's monumental piece Fundamental Concepts in Programming Languages (1967). 

And that's just RT. I have not and will not even begin to talk of lambda calculus  - doing so would derail this post completely. 

So let me reel this post back in. If FP has such footing in both Mathematics and Computer Science, why do most engineers have such a hard time adopting it in how they think about code?

Well, to start, all the above are mainly theoretical concepts. Software Engineering is by nature an applied field so most software engineers are actually employed in industry rather than academia. And so most folks usually don't go beyond a Bachelors or Masters degree (which is a lot given that many of the most dominant and influential figures in the Tech Industry are college dropouts). A core part of almost any undergraduate CS curriculum is a subject typically titled "Data Structures and Algorithms". It is quite the fixture actually. Many of the most prestigious employers in technology (Google, FB, Amazon, ... all the way to Slack, etc) outright inform candidates - regardless of seniority - to expect some questions directly from "Data Structures and Algorithms". In fact Google is notorious for turning renown experts in fields it was hiring for because they failed the DSA part of this rigid interview process.

So what does a typical DSA course consist of and how is this pertinent to our topic here? 

It's actually most made up of search and sorting algorithms, problem-solving strategies, and space and time complexity problems. Here is the root of our headache. It is not really the content of the course, but how fundamental the course has become to landing a job in industry and having a career. 

Let's take just two minor topics that feature prominently in any DSA course:

  • hash tables
  • the QuickSort algorithm (or MergeSort or any D&C algorithm really)

From a functional outlook, any change to a hash table would require creating a copy of the original hash table. This is obviously very inefficient and fundamentally goes against the underlying premise of a course like DSA - solve problems and efficiently so, even if you are going to trade-off reusability and safety. As far as sorting is considered, again we are faced with the matter of efficiency - as in place modifications are axiomatically impossible from a functional perspective. 

So from a very impressionable point in our time as software engineers we are repeatedly told to think about programs almost strictly in terms of efficiency gains and trade-offs; so much so that "good code" in some places has become synonymous with one that provides "efficient" solutions. We are not given the chance to think, and even meditate on other avenues that contribute to writing "good" code. Functional programming is almost never held in the same regard as imperative programming in most CS programs.

So yeah, it's hard to code functionally, and even more importantly, to think functionally when the dominant culture is one that forbids a multi-faceted and a rich discussion on what is important to solving problems. 

Data Pipeline and Task Management: The Unsolvable Problem?

There’s probably more well known data pipeline dependency management and scheduling frameworks than you can say in one breath. Is there a reason for that beyond mere not invented here syndrome?

Read More

We're Back! Introducing LadyDi!

LadyDi: Code less, build more

Read More
/

Spark, Turn Off the Light on the CLI

We know that Hadoop can be 235 times slower than the command line for smaller data sets, does the same apply for Spark?

Read More

Scala: A Literary Review

Originally, I wanted to write a post about reviewing Apache Spark's Hyper-Parameter Optimization, the many ways in which it could be improved, and unveil a library I have been putting together which integrates seamlessly with Spark's existing ML library, etc. 

But over the holiday break, I was approached by many people around the web with questions like "Why is Scala so Popular?" or "Haskell vs Scala, which would you choose?" or "Should I use Scala or Python when using Spark?". So I decided to first write a post about this language I love so much. 

My appreciation goes beyond the technical norms in which this subject is usually covered - it transcends code and appeals to the overarching topic of human communication; modalities of thought and expression at large. Pondering on this, the image that appears in my head is one that has been etched onto my soul; that of Simone Weil.

Simone Weil could pack into a short sentence what many writers would need an entire volume to express, always supposing they had insights of such depth to express in the first place.

I don't think there is a medium that best encourages engineers to do just that in their domain other than Scala.  Writing short and expressive code (Ruby, Python?) to produce type-safe and high-performance apps (Java, C++ ?) is a duality that only merges most harmoniously when using Scala. The level of expressive syntax rarely seen in other compiled languages; code is strongly typed and supports both multiple inheritance and mixin features.

So one ends up with short, expressive syntax, cutting unnecessary punctuation, and condensing map, filter, reduce  and other higher order operations to simple one-liners.

On the outset, Scala really encourages switching from mutable data structures to immutable ones and from regular methods to pure functions. This has some immediate impacts starting with a reduction in issues rooted in unintentional side effects common in large code bases. So on the outset your code will be safer, more stable, and easier to understand.  Coding like so will imprint on the writer new ways t o view the concepts of data mutability, higher order functions,  sophisticated type systems, etc. 

And quite frankly, concepts like abstract algebra and type theory which would otherwise decompose along with the academic papers they were written on, are brought to life with Scala's sophisticated type system and its  option for custom typing declarations.

There are some downsides however, as there is no singularly adhered-to guide for best practices in Scala. Python, for example, aims to implement a single best practice - even has standard: PEP8. Scala has several different guides on slides - about which people continue to debate endlessly. This is partly achieved by the fact that while Scala encourages functional practices it does not enforce them as religiously as a language like, say, Haskell or Erlang would. But I tend to view such "deficiency" as an upside: to decipher Scala code is to first decipher the writer’s style, ideas, thought patterns, and persona. 

Wikipedia Data in Spark and Scala

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

Setting Up Spark Notebooks on EC2 (No VPN)

Getting up and running with Spark Notebooks (Part 1)

Read More

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More

The Limitations of Apache Spark

... all that's Spark doesn't shine.

Read More

Setting up a personal VPN to access AWS instances

The goal of this post is to lay the groundwork for running a Spark driver against a Spark cluster in a AWS VPC. Specifically by setting up a VPN to access VPC instances.

Read More

In the Spirit of Thanksgiving

We don't take ourselves seriously but are two curious folks passionate about applying Machine Learning and Deep Learning to industry and sharing that knowledge with the broader engineering community.

Read More