Spark EC2 Setup and Workflow

We wanted to talk a bit about the current workflow and setup that we use for developing Spark code and jobs on EC2.

This may seem like a lot of components however they’re all fairly static and decoupled. We like workflows which aren’t fragile or have many dependencies that cannot be swapped around. We also tend to define simplicity as simplicity in running rather than simplicity in setup so the setup is a bit more involved than it would be otherwise.

There are going to be future blog posts going into the details of various components similar to the post on deploying a VPN server.

There are different ways to use this infrastructure ; one involving directly running a driver on your local machines either as a Spark Notebook or in your IDE of choice.

Overview

The components in a nutshell are:

  • VPN Server: Running on an EC2 instance. This allows for accessing VPC instances easily and allows for a local Spark driver to be used
  • Forwarding DNS: This allows for the use of Route 53 with internal IPs for access even outside the VPC. The result is the easy ability to use a Spark Notebook or IDE on your personal machine as a Spark driver.
  • Maven: Dependencies and code deployment is managed using Maven. SBT was considered but we're more familiar with Maven and that is also what the Spark project uses internally.
  • Nexus Repository: Running on an EC2 instance and setup using docker. The docker deploy seemed the easiest way to run Nexus.
  • Spark Notebook: Running on an AWS instances and setup with the nexus repository as a remote repository. Optionally the Spark Notebook can be run locally on the user's machine and connected to a remote cluster for extra debugging and control. A custom version is build from the master branch to support Spark 1.5.2:
    • sbt -D"spark.version"="1.5.2" -D"hadoop.version"="2.4.0" -D"jets3t.version"="0.9.4" -Dmesos.version="0.24.0"-Dwith.hive=true -Dwith.parquet=true run
  • IntelliJ IDEA: Used for writing core code, the good type completion really helps in quickly building code; can optionally directly run the Spark driver against a remote cluster for improved debugging at scale.
  • Route53: This is used to managed DNS references to various machines in the cluster. There is a persistent and automatically updated DNS reference to the master node of the cluster based on it's name (ie: tm-spark.spark for the cluster named tm-spark). The VPN+DNS allows for the user's local machine to take advantage of the private Route53 DNS entries.
  • CLI/Spark EC2 Script: We're using a custom version (under ec2 in the directory tree) right now that has a few improvements:
    • Leverages Route53 and the VPN+DNS setup to create easy to use host names (ie: spark-test.spark rather than ec2-52-26-35-108.us-west-2.compute.amazonaws.com). These hostnames are persistent so if you create a new cluster called "spark-test" the dns will still be spark-test.spark even if it points to a new master node ip. 
    • Allows for spot master nodes
    • Allows for a spark standalone 2.4 deploy (rather than a yarn one), this is following my belief in incremental improvements to avoid being mired in debugging change is you switch out too many components from a previous work flow
  • Spark Cluster: The remote Spark cluster running on EC2, thanks to the modified Spark EC2 script and Route 53 there is a persistent and automatically updated DNS reference to the master node of the cluster based on it's name (ie: tm-spark.spark for the cluster named tm-spark). This reference is to the master nodes local IP and the VPN+DNS allows the user's local machine to access the Route 53 DNS entries.

Setup workflow

This is run once before developing on Spark such as in the morning:

  • VPN into the VPC
  • Create a spark cluster:
    • python spark_ec2.py -s 4 -k tm-spark -i "<>" -m c3.large -t c3.8xlarge --region=us-west-2 --zone=us-west-2b --spark-version=1.5.2 --hadoop-major-version=yarn --vpc-id=<> --subnet-id=<> --authorized-address="10.0.0.1/32" --additional-security-group="<>" --spot-price="0.4" --route53-domain="spark" --route53-record-set="<>" --spot-master --no-ganglia --use-standalone launch tm-spark
  • Connect to the Spark notebook
    • http://driver.spark:9000/
  • Open up the spark cluster UI for monitoring
    • http://tm-spark.spark:8080/
  • Open up the spark driver UI for monitoring
    • http://driver.spark:4040/

Code workflow

  • Write some core code in IntelliJ on my desktop
  • Compile and deploy to the Nexus Repo using maven
    • maven deploy
  • Load the latest deployed code in the Spark Notebook. Since we're using persistent and automatic DNS references for clusters (ie: tm-spark.spark) the notebook metadata doesn't need to be updated to refer to the new cluster.
    • :remote-repo tmrepo % default % http://repo.spark:8081/content/repositories/snapshots/ % maven
    • :dp io.teachingmachines % wiki-playground % 1.0-SNAPSHOT
  • Run spark code in spark notebook
  • Rinse and repeat

Teardown workflow

  • Terminate the spark cluster:
    • python spark_ec2.py --region=us-west-2 --zone=us-west-2b --vpc-id=<> --subnet-id=<> --route53-domain="spark" --route53-record-set="<>" destroy tm-spark

Future improvements and experiments

There's quiet a few more things we want to test and play with over time, and write about. These include:

  • Zepplin Notebook: This is a potential replacement for Spark Notebook.
  • Docker for DNS and Spark Notebook: This will make management of the infrastructure easier and consolidate instances so it costs us less to run things.
  • Elastic Map Reduce Clusters: The Spark EC2 script is hitting it's limits and isn't really being maintained much by the Spark community. We used it because we're familiar with it however we definitely want to test other options.
  • YARN: The latest and greatest, or so people say
  • Hadoop 2.7: The latest and greatest, comes with S3A, on which point...
  • S3A: An improvement over the existing S3N file system for accessing S3 data in Spark/Hadoop however the setup seems to be non-trivial.
  • Monitoring: The Cloudwatch metrics from AWS are very limited so we want something like Graphite to better monitor the Spark clusters and other instances we're running