Monday, September 22, 2014

Learn Spark with Python

1. Install Spark
cd ~/tools/
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
tar -zxvf spark-1.1.0.tgz

2. Build spark for hadoop2
cd ~/tools/spark-1.1.0
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly


3. Install py4j
sudo pip install py4j


4. Modify ~/.bash_profile by adding two lines
export SPARK_HOME=$HOME/tools/spark-1.1.0
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

5. source the ~/.bash_profile
source ~/.bash_profile

6. Test. Start a python shell and type
import pyspark