Setting Up Spark Notebooks on EC2 (No VPN)

Some people like to work with new technologies because, well, they're new and by extension pose a particular kind of challenge. I am definitely one of those people; but beyond just trying new things out and learning along the way, I like to make choices with purpose. 

Besides just being cool or the new thing, getting up and running with spark notebooks solves one particular problem: bridging the gap between rapid prototyping and building large-scale systems. This problem presents itself most strongly in a start up environment where you are constantly making trade-offs between getting a product out the door and limiting your technical debt. At first we used iPython notebooks for rapid prototyping and switched to Scala once a project was ready to go to production - because seriously python for production? Whatevs. Just kidding. Please don't burn my house down - I know there's lots of Python lovers out there and you have my respect.

But we wanted to take advantage of Scala's many virtues without switching between programming languages and adding complexity to our technical stack. 

The basic premise is simple: 

The ability to easily write code in IntelliJ using scala and then deploy it to Spark-Notebooks. This is to empower developers to seamlessly build end-to-end learning systems while also doing rapid prototyping simultaneously. Additionally, this provides developers with the opportunity to test parts of their larger system in isolation.

Since we are not using a VPN here, the first step is to set up a bastion server on AWS and whitelist your local ip in its security group. 

After this preliminary setup, the first step is to create a server from which we can:

  • connect to Spark Notebooks
  • connect to other servers

To achieve this, first go to AWS services and choose EC2. Under the "Instances" section, click instances, then click "Launch Instance on the top-left side of the page:

A wizard will then appear where you are first asked to "Choose an Amazon Machine Image (AMI)". We picked "Ubuntu Server 14.04 LTS (HVM), EBS general purpose (SSD) Volume type." The next step in the wizard asks you to "Choose an Instance Type". We picked a "General Purpose" t2.medium - yes this is small but it makes sense as it is intended to serve as the "driver" machine. In step 3 "Configure Instance Details" we make minimal changes, selecting our designated VPC and subnet.

In step 4, "Add Storage" , we only changed the storage size from 8 GB to 20 GB . In step 5, "Tag Instance" we named the instance we just created "spark-notebook-driver" to indicate that it is intended to server as the driver machine. In step 6, "Configure Security Group", we selected an existing security group "spark-clusters". We then hit "Review and Launch" and in step 7 launched the instance!

Now while we wet for our "spark-notebook-driver" to finish initializing, we navigate to spark-notbook.io to generate and download a spark notebook tailored to our needs. 

The specs we used in this case are as follows:

  • Notebook Version - 0.6.1
  • Scala Version - 2.10
  • Spark Version - 1.4.1 (Note: this hast to match the spark version you are using in the rest of your infrastructure otherwise you will run into some really weird versioning issues.)
  • Hadoop Version - 2.0.0-cdh 4.2.0
  • added parquet support - this is good for dealing with hierarchical data
  • Packaging tar.gz - you may wonder why not choose debian since the instance you just created ubuntu? Good question. The answer is simple: we tried debian, it didn't work so we tried tar.gz and it worked

It is now fair to ask "What is a Spark-Notebook generator?"

Well, it works very similarly to how maven "builds" artifacts when you hit "install" on a maven project in IntelliJ - wherein (assuming the language is Java or Scala) the maven "install" builds a JAR file, the spark-notebook generator creates a custom build (which we then download onto "spark-notebook-driver") and makes it available to us in the selected packaging (in this case tar gz)

so... start generating!

Great. So far we have established what a spark-notebook generator is and we have also launched our "spark-notebook-driver". 

By now, a download link should be in our inbox for our custom spark-notebook build package. The goal is to 

  1. Download the package from spark-notebook.io onto the "spark-notebook-driver" instance we launched
  2. Install the package onto the same instance

Keep in mind we don't have a VPN set up yet so to achieve the above, we open up the terminal to ssh into our bastion server (referred to on my machine as "thezoebastion") and tunnel to  "spark-notebook-driver" from there:

Zoes-MacBook: ~zafshar$ cd .ssh/
Zoes-MacBook: .ssh zafshar$ ssh-add myKey.pem
Zoes-MacBook: .ssh zafshar$ ssh -A -L <bastions-listen-port>:<public 
DNS of bastion server>:<bastions-remote-port> thezoebastion 

Great. Now that we are in "thezoebastion" we must ssh into "spark-notebook-driver"  where we will download/install our custom spark-notebook build package. As the general framework for tunneling dictates, we need the remote host address we want to connect to. In this case, the remote host is the "Private IP" of "spark-notebook-driver". To find it, we go back to the AWS console in the browser, click on EC2 Intances and look for "spark-notebook-driver". In the instance description we copy/paste the "Private IP" (which happens to be 10.0.2.167) back to "thezoebastion" terminal:

ubuntu@ip-10-3-0-220:~$ ssh 10.0.2.167

WOOHOO! WE ARE NOW ON THE "spark-notebook-driver" terminal!

Here, on the "spark-notebook-driver" machine we must first install Java (yes, before Downloading\Installing  our custom spark-notebook build  package). To ensure a proper Java install, we begin by updating the package directory and how it manages packages by running the following command:

ubuntu@ip-10-0-2-167:~$ sudo apt-get update

... And now we can instal Java hassle-free!

ubuntu@ip-10-0-2-167:~$ sudo apt-get install default-jre

It's a good idea to now check the Java version to make sure it matches the Java version we use in the rest of our infrastructure

ubuntu@ip-10-0-2-167:~$ java -version

Awesome, we have installed OpenJDK(1.7) can finally start Dowloading+Installing our custom build of Spark Notebook emailed to us! We start by copying the link address in the email and pasting it into the "spark-notebook-driver" terminal:

ubuntu@ip-10-0-2-167:~$ wget https://url/to/buildpackage

WOOT! we have downloaded our build package! We are now ready to install it!

ubuntu@ip-10-0-2-167:~$ tar -xvf filenameOfBuildPackage

Yay! Now let's run this baby!

ubuntu@ip-10-0-2-167:~$ cd filenameOfBuildPackage/
ubuntu@ip-10-0-2-167:~/filenameOfBuildPackage$ ./bin/spark-notebook

WIN!!! Now open up your browser to 'localhost:9000" and enjoy!

In the follow-ups to this post we will discuss running spark-notebooks using a vpn and instructions on how to integrate it with maven, intellij! stay tuned!