Mindful Machines Original Series, Big Data: Batch Processing

Presto? Spark? Flink? Redshift? MapReduce? How do they and others compare for processing your batch data? Find out in this first part of the Mindful Machines series on Big Data.

Read More
/

Building a Data Science Team

Data Science teams can provide immense value to an organization if built or it can provide no value at all. Sometime the difference in success comes down to the simple fact that you didn’t actually need a Data Science team to begin with. Other times it comes down to how you hire, manage, grow and nurture the team. In this post we’ll cover all these topics and more.

Read More
/

Mindful Machines Original Series, Big Data: Batch Storage

S3? HDFS? Druid? Cassandra? MySQL? How do they and others compare for storing your batch data? Find out in this first part of the Mindful Machines series on Big Data.

Read More
/

Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

ONC Patient Matching Challenge: Part 2

This is the second of a two part series on tackling the ONC Patient Matching Challenge In the first part we went over the background and high level approach while in this part we cover the matching engine that was built (and how to use it).

Read More

ONC Patient Matching Challenge: Part 1

This is the first of a two part series on tackling the ONC Patient Matching Challenge. The first part reviews background and high level topics while the second part covers the matching engine that was built (and how to use it). 

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Confessions Of a Scala Enthusiast

I feel terrible dragging LadyDi into another controversy. But this is not so much a poor reflection on her as it is me.  You see, I have spoken endlessly about the merits of Scala and Functional Programming. For a time, I was even the most viewed writer on Scala on Quora because I wrote so much on the topic!

Unfortunately I have come to realize I have been a lot of talk and very little action. Whoops.

Recently, I have come to take a good hard look at my code as well as libraries I have released into the open source community- and you know what? Most of it is not functional. In most places, I just use (abuse really) the functional properties somewhat enforced in Scala to make my OOP more stable. But that's really just about it. In reality I have written very little functional code, and no strictly functional program. As evident from my passionate remarks on the topic, I am obviously really excited about FP, and even understand how to go about it; but somehow, when push comes to shove, I remain stuck in my un-functional ways. 

Upon speaking to other fellow engineers at work and elsewhere, I have come to see that there are many strong, experienced, creative, innovative engineers who have trouble picking up Scala as a functional programming language. But I had trouble understanding why it was so, considering that many concepts in Functional Programming have very long, deep roots in the history of computer science was well as analytical philosophy. 

Let's take a look at one of the most critical concepts in Functional Programming: referential transparency. Its origins date back to Bertrand Russell's Principia Mathematica. Sure, at first glance you may dismiss this as just another old part of "mathematica", but William Van Orman Quine adopted this concept in his Word and Object published in 1960 where he coined the term "referential transparency".

RT was then thrusted into the spotlight of computer programming just 7 years later in Christopher Strachey's monumental piece Fundamental Concepts in Programming Languages (1967). 

And that's just RT. I have not and will not even begin to talk of lambda calculus  - doing so would derail this post completely. 

So let me reel this post back in. If FP has such footing in both Mathematics and Computer Science, why do most engineers have such a hard time adopting it in how they think about code?

Well, to start, all the above are mainly theoretical concepts. Software Engineering is by nature an applied field so most software engineers are actually employed in industry rather than academia. And so most folks usually don't go beyond a Bachelors or Masters degree (which is a lot given that many of the most dominant and influential figures in the Tech Industry are college dropouts). A core part of almost any undergraduate CS curriculum is a subject typically titled "Data Structures and Algorithms". It is quite the fixture actually. Many of the most prestigious employers in technology (Google, FB, Amazon, ... all the way to Slack, etc) outright inform candidates - regardless of seniority - to expect some questions directly from "Data Structures and Algorithms". In fact Google is notorious for turning renown experts in fields it was hiring for because they failed the DSA part of this rigid interview process.

So what does a typical DSA course consist of and how is this pertinent to our topic here? 

It's actually most made up of search and sorting algorithms, problem-solving strategies, and space and time complexity problems. Here is the root of our headache. It is not really the content of the course, but how fundamental the course has become to landing a job in industry and having a career. 

Let's take just two minor topics that feature prominently in any DSA course:

  • hash tables
  • the QuickSort algorithm (or MergeSort or any D&C algorithm really)

From a functional outlook, any change to a hash table would require creating a copy of the original hash table. This is obviously very inefficient and fundamentally goes against the underlying premise of a course like DSA - solve problems and efficiently so, even if you are going to trade-off reusability and safety. As far as sorting is considered, again we are faced with the matter of efficiency - as in place modifications are axiomatically impossible from a functional perspective. 

So from a very impressionable point in our time as software engineers we are repeatedly told to think about programs almost strictly in terms of efficiency gains and trade-offs; so much so that "good code" in some places has become synonymous with one that provides "efficient" solutions. We are not given the chance to think, and even meditate on other avenues that contribute to writing "good" code. Functional programming is almost never held in the same regard as imperative programming in most CS programs.

So yeah, it's hard to code functionally, and even more importantly, to think functionally when the dominant culture is one that forbids a multi-faceted and a rich discussion on what is important to solving problems.