Set Up a Jupyter Notebook with Spark from scratch on Google Cloud Platform (DIY in 10 min) on your Linux Machine
Hello there guys,
I had my share of trouble when I was starting up too. I found the best solutions from a bunch of online resources put together .So I figured why not document this thing so that no one else gets the same issues again anymore, and if they do , let this one solve all your problems at once. I will cut to the chase now.
So given you have a running instance of GCP and is all set to start diving in to the big data that you have been sitting on , this is the guide to get you started with a Spark enabled Jupyter Notebook. There are a bunch of obvious reasons why we need Spark of course. I will just let you figure that out yourselves.
I am running Ubuntu on my GCP and Windows 10 on my local machine. But the whole process of connecting to the cloud instance using SSH becomes a lot easier if you are running a Local Linux system. (You can also SSH directly from the Cloud Console Page as well). We will need a private key file to connect to GCP from bash , and bash can take care of this without the help of an additional installation like PuTTY for example, in Windows.
ssh-keygen -t rsa -f ~/.ssh/<FILE NAME> -C <USER NAME>
This line above will generate a .pub file in the hidden location “ssh” which sits in your home directory. The username is anything you like basically , since we will be adding this to the compute engine instance later on. You can open the .pub file and copy the contents by using :
cd ~/.ssh
cat <filename>.pub
Copy the whole thing and paste it in the SSH key section in the instance details page. (You can access this by clicking on the instance and then clicking on edit in the top panel)
Now most importantly we need to create a firewall rule in the instance to allow traffic through a set port. To do this the easiest way is to search for “Firewall Rules” in the top search bar of your instance page.
In the next page click on “Create Firewall Rule” and put in the following settings.
Once we allow traffic through this port, we can go ahead and set up Spark and Jupyter on our instance and allow this notebook to be hosted on our cloud machine through this port. We need to shift gears at this point, but I will try to keep it really simple.
Fire up our GCP instance by clicking start from the instance page. Note down the external IP address and the username you have given in the private key file. Open a up your terminal and type in :
ssh -i ~/.ssh/<private key filename>.pub username@ipaddress
Input “Yes” when prompted and you are in! Once you are in the instance we have a few things to set up. Just do these things inside the same window.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install python3-pip python-pip
sudo pip install jupyter
sudo pip3 install jupyter
sudo apt install jupyter-core jupyter-notebook
sudo apt-get install default-jre
sudo apt-get install scala
sudo pip install py4j
sudo pip3 install py4j
Note that I am installing Jupyter on python and python 3. For some reason my version of Jupyter Notebook didn't support python3 (The kernel kept dying on me 😅!). So am just being safe here. These commands all put together will download all the required packages we need except, *drum rolls please* , Spark! To download spark use wget. (Go here to figure out the version and copy the download link)
cd -- Preferably download to your home directory
wget <download link>
Once you finish downloading (~180 MB), decompress it and give permissions to the main folder and the folder named “python” inside it.
tar -xvzf <spark file name>.tgz
sudo chmod 777 <spark folder name>
cd <spark folder name>
sudo chmod 777 python
Note down the location of this folder, as this is where our Spark Library resides. If you followed my instruction by word 😛, then this location will be “/home/<username>/<spark folder name>”. Now ideally whenever we need to call this library from within python , we need to open up our python session from inside this location. But am here with another work around to help you lazy analysts, to not worry about a thing. There’s a python package called “findspark” which we can import inside the notebook to point to the location of Spark!! Pretty cool, isn’t it?
sudo pip install findspark
sudo pip3 install findspark
Okay time for business. We need to configure our notebook one last time before we fire it up. So go ahead and do all these things:
jupyter notebook –generate-config
cd ~/.jupyter/
vi jupyter_notebook_config.py
This will open up a visual editor where you can see a fancy bunch of lines, basically commented out. Press “i” on your keyboard to insert lines and insert a few stuff to make it look the following :
When done editing press “Esc” and type “:wq!” to “Write changes and Quit”, and hit “Enter”. We are all set to launch the notebook now. You know what to do:
jupyter notebook
This will open up a notebook server on your GCP instance’s external IP. To access the notebook open up your browser , in your local system and go to :
http://<external ip>:port_number
So there you go! You have your notebook up and running. Keep in mind that when you click new the drop down will show you python 2 and python 3 , because we had installed Jupyter for both versions of python. Choose the one that suits you and input the following to import the Spark Library to your notebook.
I think that’s it for now. Hope you were able to figure everything out too! For added security we can further encrypt our notebook. I think I should save it for the next one.
Now go do some super fast calculations powered by Google.
Peace! Till next one.