Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

Peapod: A Scala and Spark Data Pipeline and Dependency Manager

Peapod is a new dependency and data pipeline management framework for Spark and Scala. The goals is to provide a framework that is simple to use, automatically saves/loads the output of tasks, and provides support for versioning.

Read More

Data Pipeline and Task Management: The Unsolvable Problem?

There’s probably more well known data pipeline dependency management and scheduling frameworks than you can say in one breath. Is there a reason for that beyond mere not invented here syndrome?

Read More
/

Wikipedia Data in Spark and Scala

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

In the Spirit of Thanksgiving

We don't take ourselves seriously but are two curious folks passionate about applying Machine Learning and Deep Learning to industry and sharing that knowledge with the broader engineering community.

Read More