Spark

Table of Contents

Description

Apache Spark? is a fast and general engine for large-scale data processing.

Usage

Use

module avail spark

to see which versions of Spark are available. Use

module load spark/version

to get access to Spark.

Examples

Prerequisites

To ensure that Spark is run in a reasonably secure manner it is required that you enable authentication and set a secret that is only known to you. This should be set in a spark configuration file, per default $HOME/.spark/spark-defaults.conf (set in the environment variable $SPARK_CONFIG when you load the spark module).

E.g.:

$ cat $HOME/.spark/spark-defaults.conf
spark.authenticate true
spark.authenticate.secret 42

Tip: To generate a more reasonable secret than the one above, generate spark-defaults.conf as follows:

$ mkdir -p $HOME/.spark
$ echo spark.authenticate true > $HOME/.spark/spark-defaults.conf
$ echo spark.authenticate.secret $(openssl rand -base64 32) >> $HOME/.spark/spark-defaults.conf

Initiate Spark

Example batch script to set up Spark cluster on Colossus:

#!/bin/bash
# spark-init.batch - sbatch script to initialize Spark stand-alone cluster with SLURM

#SBATCH --account=staff
#SBATCH --nodes=3
#  ntasks per node MUST be one, because multiple slaves per node doesn't
#  work well with slurm + spark in this script (they would need increasing
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#SBATCH --time=02:00:00

module purge
module load spark/version
module list

# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
    source /cluster/bin/jobsetup

    source $SPARK_HOME/bin/sparksetup.sh

    #get full path of this script
    script=$(scontrol show job $SLURM_JOBID | grep Command | cut -d = -f 2)
    srun bash $script 'srunning'

# If run by srun, then use start_spark_slurm.sh to start master and workers
else
    source $SPARK_HOME/bin/start_spark_slurm.sh
fi

When spark is up a file JOBID_spark_master will be created in the same directory as the output file (usually in the directory from where you submitted spark-init.batch). This contains the Spark URL to the master host, hereby called $SPARK_MASTER.

Interactive use

(only applicable on Colossus)

Simple example:

$ module load spark
$ run-example --master $SPARK_MASTER --properties-file $SPARK_CONFIG SparkPi 10

Python interface:

$ module load spark
$ module load python2
$ # if ipython is prefered:
$ # PYSPARK_DRIVER_PYTHON=ipython
$ pyspark --master $SPARK_MASTER --properties-file $SPARK_CONFIG

R interface:

$ module load spark
$ module load R
$ sparkR --master $SPARK_MASTER --properties-file $SPARK_CONFIG

Batch use

Example script to run Spark tasks on a running Spark cluster:

#!/bin/bash
# spark-connect.batch - sbatch script to connect to Spark stand-alone cluster started with SLURM
# takes as input spark URL to master node and (optionally) SLURM jobid of the spark cluster

#SBATCH --account=staff
#SBATCH --nodes=1
#  ntasks per node MUST be one, because multiple slaves per work doesn't
#  work well with slurm + spark in this script (they would need increasing
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=01:00:00

module purge
module load spark/version

if [ -z "$1" ]; then
    echo Usage: $0 SPARK_MASTER [SPARK_MASTER_JOBID]
    exit 1
fi

SPARK_MASTER=$1

$SPARK_HOME/bin/run-example --master $SPARK_MASTER --properties-file $SPARK_CONFIG SparkPi 10
$SPARK_HOME/bin/run-example --master $SPARK_MASTER --properties-file $SPARK_CONFIG SparkPi 5

if [ "$2" ]; then
    echo Finished, stopping Spark Cluster, jobid "$2"
    scancel $2
fi

Note that this script can be run in two ways. If only $SPARK_MASTER is given as parameter, the script will just run as normal and the Spark cluster will continue to run and take new tasks. However, to limit the CPU usage it may be preferable to stop the cluster after the Spark tasks are finished. This can be done by giving the jobid for the Spark cluster as input to the spark-connect.batch script.

E.g., if you want to run one single batch of tasks and then stop the cluster you can do:

$ # first get the job id for the spark cluster
$ jobid=$(sbatch spark-init.batch); jobid=${jobid##Submitted batch job }
$ # then wait for Spark cluster to start
$ while [ ! -f "$jobid"_spark_master ]; do sleep 1; done;
$ # submit batch of spark tasks and stop cluster when finished
$ sbatch spark-connect.batch `cat "$jobid"_spark_master` $jobid

Did you find what you were looking for?

Published June 21, 2021 10:35 AM