Description
Apache Spark? is a fast and general engine for large-scale data processing.
Home Page
Documentation
http://spark.apache.org/docs/2.0.2/
License
Usage
Use
module avail spark
to see which versions of Spark are available. Use
module load spark/version
to get access to Spark.
Examples
Prerequisites
To ensure that Spark is run in a reasonably secure manner it is required that you enable authentication and set a secret that is only known to you. This should be set in a spark configuration file, per default $HOME/.spark/spark-defaults.conf (set in the environment variable $SPARK_CONFIG when you load the spark module).
E.g.:
$ cat $HOME/.spark/spark-defaults.conf spark.authenticate true spark.authenticate.secret 42
Tip: To generate a more reasonable secret than the one above, generate spark-defaults.conf as follows:
$ mkdir -p $HOME/.spark $ echo spark.authenticate true > $HOME/.spark/spark-defaults.conf $ echo spark.authenticate.secret $(openssl rand -base64 32) >> $HOME/.spark/spark-defaults.conf
Initiate Spark
Example batch script to set up Spark cluster on Colossus:
#!/bin/bash
# spark-init.batch - sbatch script to initialize Spark stand-alone cluster with SLURM
#SBATCH --account=staff
#SBATCH --nodes=3
# ntasks per node MUST be one, because multiple slaves per node doesn't
# work well with slurm + spark in this script (they would need increasing
# ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#SBATCH --time=02:00:00
module purge
module load spark/version
module list
# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
source /cluster/bin/jobsetup
source $SPARK_HOME/bin/sparksetup.sh
#get full path of this script
script=$(scontrol show job $SLURM_JOBID | grep Command | cut -d = -f 2)
srun bash $script 'srunning'
# If run by srun, then use start_spark_slurm.sh to start master and workers
else
source $SPARK_HOME/bin/start_spark_slurm.sh
fi
When spark is up a file JOBID_spark_master will be created in the same directory as the output file (usually in the directory from where you submitted spark-init.batch). This contains the Spark URL to the master host, hereby called $SPARK_MASTER.
Interactive use
(only applicable on Colossus)
Simple example:
$ module load spark $ run-example --master $SPARK_MASTER --properties-file $SPARK_CONFIG SparkPi 10
Python interface:
$ module load spark $ module load python2 $ # if ipython is prefered: $ # PYSPARK_DRIVER_PYTHON=ipython $ pyspark --master $SPARK_MASTER --properties-file $SPARK_CONFIG
R interface:
$ module load spark $ module load R $ sparkR --master $SPARK_MASTER --properties-file $SPARK_CONFIG
Batch use
Example script to run Spark tasks on a running Spark cluster:
#!/bin/bash
# spark-connect.batch - sbatch script to connect to Spark stand-alone cluster started with SLURM
# takes as input spark URL to master node and (optionally) SLURM jobid of the spark cluster
#SBATCH --account=staff
#SBATCH --nodes=1
# ntasks per node MUST be one, because multiple slaves per work doesn't
# work well with slurm + spark in this script (they would need increasing
# ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=01:00:00
module purge
module load spark/version
if [ -z "$1" ]; then
echo Usage: $0 SPARK_MASTER [SPARK_MASTER_JOBID]
exit 1
fi
SPARK_MASTER=$1
$SPARK