Spark

Table of Contents

Description

Apache Spark? is a fast and general engine for large-scale data processing.

Home Page

http://spark.apache.org

 

Documentation

http://spark.apache.org/docs/2.0.2/

 

License

Apache License

 

Usage

Use

module avail spark

to see which versions of Spark are available. Use

module load spark/version

to get access to Spark.

Examples

Prerequisites

To ensure that Spark is run in a reasonably secure manner it is required that you enable authentication and set a secret that is only known to you. This should be set in a spark configuration file, per default $HOME/.spark/spark-defaults.conf (set in the environment variable $SPARK_CONFIG when you load the spark module).

E.g.:

$ cat $HOME/.spark/spark-defaults.conf
spark.authenticate true
spark.authenticate.secret 42

Tip: To generate a more reasonable secret than the one above, generate spark-defaults.conf as follows:

$ mkdir -p $HOME/.spark
$ echo spark.authenticate true > $HOME/.spark/spark-defaults.conf
$ echo spark.authenticate.secret $(openssl rand -base64 32) >> $HOME/.spark/spark-defaults.conf

 

Initiate Spark

Example batch script to set up Spark cluster on Colossus:

#!/bin/bash
# spark-init.batch - sbatch script to initialize Spark stand-alone cluster with SLURM

#SBATCH --account=staff
#SBATCH --nodes=3
#  ntasks per node MUST be one, because multiple slaves per node doesn't
#  work well with slurm + spark in this script (they would need increasing
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#SBATCH --time=02:00:00

module purge
module load spark/version
module list

# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
    source /cluster/bin/jobsetup

    source $SPARK_HOME/bin/sparksetup.sh

    #get full path of this script
    script=$(scontrol show job $SLURM_JOBID | grep Command | cut -d = -f 2)
    srun bash $script 'srunning'

# If run by srun, then use start_spark_slurm.sh to start master and workers
else
    source $SPARK_HOME/bin/start_spark_slurm.sh
fi

When spark is up a file JOBID_spark_master will be created in the same directory as the output file (usually in the directory from where you submitted spark-init.batch). This contains the Spark URL to the master host, hereby called $SPARK_MASTER.

Interactive use

(only applicable on Colossus)

Simple example:

$ module load spark
$ run-example --master $SPARK_MASTER --properties-file $SPARK_CONFIG SparkPi 10

Python interface:

$ module load spark
$ module load python2
$ # if ipython is prefered:
$ # PYSPARK_DRIVER_PYTHON=ipython
$ pyspark --master $SPARK_MASTER --properties-file $SPARK_CONFIG

R interface:

$ module load spark
$ module load R
$ sparkR --master $SPARK_MASTER --properties-file $SPARK_CONFIG

Batch use

Example script to run Spark tasks on a running Spark cluster:

#!/bin/bash
# spark-connect.batch - sbatch script to connect to Spark stand-alone cluster started with SLURM
# takes as input spark URL to master node and (optionally) SLURM jobid of the spark cluster

#SBATCH --account=staff
#SBATCH --nodes=1
#  ntasks per node MUST be one, because multiple slaves per work doesn't
#  work well with slurm + spark in this script (they would need increasing
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=01:00:00

module purge
module load spark/version

if [ -z "$1" ]; then
    echo Usage: $0 SPARK_MASTER [SPARK_MASTER_JOBID]
    exit 1
fi

SPARK_MASTER=$1

$SPARK