Apache Spark: DataFrames and RDDs

Looking in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

Read More
/

Wikipedia Data in Apache Spark and Scala (Updated)

More than you possibly ever wanted to know about parsing various Wikipedia data sources in Spark and Scala.

Read More

The Flavors of Data Science and Engineering

Data Science means something different to everyone and is more of a marketing terms than a job description nowadays. That said, certain definition for it and the related disciplines are starting to emerge from what I've seen so I wanted to write down my perceptions.

Read More

Spark, Turn Off the Light on the CLI

We know that Hadoop can be 235 times slower than the command line for smaller data sets, does the same apply for Spark?

Read More
/

Setting Up Spark Notebooks on EC2 (No VPN)

Getting up and running with Spark Notebooks (Part 1)

Read More

Spark EC2 Setup and Workflow

So how do we run and deploy code using Scala and Spark in EC2?

Read More

Setting up a personal VPN to access AWS instances

The goal of this post is to lay the groundwork for running a Spark driver against a Spark cluster in a AWS VPC. Specifically by setting up a VPN to access VPC instances.

Read More