Mindful Machines Original Series, Big Data: Batch Storage

S3? HDFS? Druid? Cassandra? MySQL? How do they and others compare for storing your batch data? Find out in this first part of the Mindful Machines series on Big Data.

Read More
/

Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Spark, Turn Off the Light on the CLI

We know that Hadoop can be 235 times slower than the command line for smaller data sets, does the same apply for Spark?

Read More
/

Scala: A Literary Review

Originally, I wanted to write a post about reviewing Apache Spark's Hyper-Parameter Optimization, the many ways in which it could be improved, and unveil a library I have been putting together which integrates seamlessly with Spark's existing ML library, etc. 

But over the holiday break, I was approached by many people around the web with questions like "Why is Scala so Popular?" or "Haskell vs Scala, which would you choose?" or "Should I use Scala or Python when using Spark?". So I decided to first write a post about this language I love so much. 

My appreciation goes beyond the technical norms in which this subject is usually covered - it transcends code and appeals to the overarching topic of human communication; modalities of thought and expression at large. Pondering on this, the image that appears in my head is one that has been etched onto my soul; that of Simone Weil.

Simone Weil could pack into a short sentence what many writers would need an entire volume to express, always supposing they had insights of such depth to express in the first place.

I don't think there is a medium that best encourages engineers to do just that in their domain other than Scala.  Writing short and expressive code (Ruby, Python?) to produce type-safe and high-performance apps (Java, C++ ?) is a duality that only merges most harmoniously when using Scala. The level of expressive syntax rarely seen in other compiled languages; code is strongly typed and supports both multiple inheritance and mixin features.

So one ends up with short, expressive syntax, cutting unnecessary punctuation, and condensing map, filter, reduce  and other higher order operations to simple one-liners.

On the outset, Scala really encourages switching from mutable data structures to immutable ones and from regular methods to pure functions. This has some immediate impacts starting with a reduction in issues rooted in unintentional side effects common in large code bases. So on the outset your code will be safer, more stable, and easier to understand.  Coding like so will imprint on the writer new ways t o view the concepts of data mutability, higher order functions,  sophisticated type systems, etc. 

And quite frankly, concepts like abstract algebra and type theory which would otherwise decompose along with the academic papers they were written on, are brought to life with Scala's sophisticated type system and its  option for custom typing declarations.

There are some downsides however, as there is no singularly adhered-to guide for best practices in Scala. Python, for example, aims to implement a single best practice - even has standard: PEP8. Scala has several different guides on slides - about which people continue to debate endlessly. This is partly achieved by the fact that while Scala encourages functional practices it does not enforce them as religiously as a language like, say, Haskell or Erlang would. But I tend to view such "deficiency" as an upside: to decipher Scala code is to first decipher the writer’s style, ideas, thought patterns, and persona. 

Wikipedia Data in Spark and Scala

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

Setting Up Spark Notebooks on EC2 (No VPN)

Getting up and running with Spark Notebooks (Part 1)

Read More

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More