Apache Spark: Scala vs. Java v. Python vs. R vs. SQL

One of Apache Spark’s selling points is the cross-language API that allows you to write Spark code in Scala, Java, Python, R or SQL (with others supported unofficially). However not all language APIs are created equal and in this post we'll look at the differences from both a syntax and performance point of view.

Read More
/

Mindful Machines Original Series, Big Data: Batch Processing

Presto? Spark? Flink? Redshift? MapReduce? How do they and others compare for processing your batch data? Find out in this first part of the Mindful Machines series on Big Data.

Read More
/

Building a Data Science Team

Data Science teams can provide immense value to an organization if built or it can provide no value at all. Sometime the difference in success comes down to the simple fact that you didn’t actually need a Data Science team to begin with. Other times it comes down to how you hire, manage, grow and nurture the team. In this post we’ll cover all these topics and more.

Read More
/

Mindful Machines Original Series, Big Data: Batch Storage

S3? HDFS? Druid? Cassandra? MySQL? How do they and others compare for storing your batch data? Find out in this first part of the Mindful Machines series on Big Data.

Read More
/

Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

ONC Patient Matching Challenge: Part 2

This is the second of a two part series on tackling the ONC Patient Matching Challenge In the first part we went over the background and high level approach while in this part we cover the matching engine that was built (and how to use it).

Read More

ONC Patient Matching Challenge: Part 1

This is the first of a two part series on tackling the ONC Patient Matching Challenge. The first part reviews background and high level topics while the second part covers the matching engine that was built (and how to use it). 

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More