Installation Guide - Spark
Copyright © 2021 by Symetry, Inc. 14 Pine Street, Ste 6 Morristown, NJ 07960 All Rights Reserved April 15th, 2021
Introduction
Assumptions
You have a working installation of SymetryML with Jetty. For information about performing this task, refer to the Installation Guide.
Make sure that all the required libraries are in your
/opt/symetry/nativelib
folder and that yourLD_LIBRARY_PATH
is set correctly. Additionally, if you need graphics processor unit (GPU) support, refer to the Installation Guide - GPU and to GPU Information in this guide.Working installation of Spark on the same machine where the Jetty Web server is installed. SymetryML is certified to work with
Spark 2.4.5 hadoop 2.7,
Spark 2.4.6 hadoop 2.7
,Spark 3.0.1 hadoop 2.7
andSpark 3.0.2 hadoop 3.2
.
System Requirements
GPU Support
Currently certified on CUDA 10.x with NVIDIA GPU with Compute Capability >= 3.5. Consult the GPU Information for more information.
Spark Master
24 to 32 cores computer with high-speed Internet connection.
Spark Cluster worker memory
Minimum: 8 GB Recommended: 16 GB Start the number of workers on your node based on the amount of worker memory. For example, on Amazon S3: * c5.8xlarge instance: 8 workers. * c5.4xlarge instance: 4 workers.
Spark Information
SymetryML Spark Files
File
symetry.tar.gz
Contains files that the SymetryML REST application needs to communicate with a Spark cluster. This archive file should be decompressed in /opt/symetry
Once the symetry.tar.gz
file is decompressed, you should get the following in your /opt/symetry/
folder. The lib
and libExt
contain the files that are needed to communicate with a spark cluster using the various spark driver applications - spark-submit
, spark-shell
, pyspark
or the SML web application inside of Jetty.
├── lib
│ └── sym-spark-assembly.jar
├── libExt
│ ├── commons-pool2-2.0.jar
│ ├── jedis2.8.5.jar
│ ├── sym-core.jar
│ ├── sym-dao.jar
│ ├── sym-spark.jar
│ └── sym-util.jar
├── nativelib
│ ├── libblas.so -> libmkl_rt.so
│ ├── libblas.so.3 -> libmkl_rt.so
│ ├── libiomp5.so
│ ├── liblapack.so -> libmkl_rt.so
│ ├── liblapack.so.3 -> libmkl_rt.so
│ ├── libmkl_avx.so
│ ├── libmkl_core.so
│ ├── libmkl_def.so
│ ├── libmkl_gnu_thread.so
│ ├── libmkl_intel_lp64.so
│ ├── libmkl_intel_thread.so
│ ├── libmkl_rt.so
│ └── libsym-gpu.so
├── plugins
│ ├── csv2.dsplugin
│ ├── ds-plugin-csv2.jar
│ └── mysql-connector-java-5.1.36-bin.jar
├── python
│ ├── SMLDataFrameUtil.py
│ ├── SMLProjectUtil.py
│ └── SMLPy4JGateway.py
├── spark-support
│ ├── spark2.4.5
│ │ └── lib -> /opt/spark-2.4.5-bin-hadoop2.7/jars
│ ├── spark2.4.6
│ │ └── lib -> /opt/spark-2.4.6-bin-hadoop2.7/jars
│ ├── spark3.0.1
│ │ └── lib -> /opt/spark-3.0.1-bin-hadoop2.7/jars
│ ├── spark3.0.2
│ │ └── lib -> /opt/spark-3.0.2-bin-hadoop3.2/jars
│ └── spark-redshift-community
│ └── scala_2.11
│ └── version_4.0.1
│ ├── com.amazonaws_aws-java-sdk-1.7.4.jar
│ ├── com.eclipsesource.minimal-json_minimal-json-0.9.4.jar
│ ├── com.fasterxml.jackson.core_jackson-annotations-2.2.3.jar
│ ├── com.fasterxml.jackson.core_jackson-core-2.2.3.jar
│ ├── com.fasterxml.jackson.core_jackson-databind-2.2.3.jar
│ ├── commons-codec_commons-codec-1.4.jar
│ ├── commons-logging_commons-logging-1.1.3.jar
│ ├── io.github.spark-redshift-community_spark-redshift_2.11-4.0.1.jar
│ ├── joda-time_joda-time-2.10.6.jar
│ ├── org.apache.hadoop_hadoop-aws-2.7.7.jar
│ ├── org.apache.httpcomponents_httpclient-4.2.5.jar
│ ├── org.apache.httpcomponents_httpcore-4.2.5.jar
│ ├── org.apache.spark_spark-avro_2.11-2.4.2.jar
│ ├── org.slf4j_slf4j-api-1.7.5.jar
│ ├── org.spark-project.spark_unused-1.0.0.jar
│ └── RedshiftJDBC42-no-awssdk-1.2.36.1060.jar
Additional SymetryML Configuration for Spark Support
SymetryML relies on the existence of symbolic link (/opt/symetry/spark-support/spark2.4.5/lib
) that points to your $SPARK_HOME/jars
folder. Just make sure to create the symbolic link so that it points to the jars folder inside you spark installation:
Example with Spark 2.4.5 using hadoop 2.7:
(base) johndoe$ pwd
/opt/symetry/spark-support/spark2.4.5
(base) johndoe$ ls -l
total 0
lrwxr-xr-x 1 neil wheel 53 4 Mar 09:00 lib -> /opt/spark-2.4.5-bin-hadoop2.7/jars
Additional Jars Needed for Spark
SymetryML needs additional jars to be able to access files on AmazonS3. These files needs to be put in $SPARK_HOME/jars
folder. Depending on the version of Spark you are using you will need to put different jars file there.
Spark 2.4.x with Hadoop 2.7
Add the following jars to the
$SPARK_HOME/jars
folder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jar
in the$SPARK_HOME/jars
folder: e.g.hadoop-common-2.7.4.jar
for instance.hadoop-aws-2.7.4.jar
hadoop-azure-2.7.4.jar
Add the following jars to the
$SPARK_HOME/jars
folder:aws-java-sdk-1.7.4.jar
jets3t-0.9.4.jar
azure-storage-3.1.0.jar
xbean-asm6-shaded-4.10.jar
: For this one you will need to replace the pre-existingxbean-asm6-shaded-4.8.jar
withxbean-asm6-shaded-4.10.jar
. There is a bug in thexbean-asm6-shaded-4.8.jar
when trying to access S3 files.
Spark 3.0.x with Hadoop 2.7
Add the following jars to the
$SPARK_HOME/jars
folder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jar
in the$SPARK_HOME/jars
folder: e.g.hadoop-common-2.7.4.jar
for instance.hadoop-aws-2.7.4.jar
hadoop-azure-2.7.4.jar
Add the following jars to the
$SPARK_HOME/jars
folder:aws-java-sdk-1.7.4.jar
jets3t-0.9.4.jar
azure-storage-3.1.0.jar
Spark 3.0.x with Hadoop 3.2
Add the following jars to the
$SPARK_HOME/jars
folder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jar
in the$SPARK_HOME/jars
folder: e.g.hadoop-common-3.2.0.jar
for instance.hadoop-aws-3.2.0.jar
hadoop-azure-3.2.0.jar
Add the following jars to the
$SPARK_HOME/jars
folder:aws-java-sdk-bundle-1.11.375.jar
jets3t-0.9.4.jar
azure-storage-8.6.6.jar
Spark Cluster Configuration
For information about configuring your Spark Cluster, refer to your Spark documentation. SymetryML assumes that you have an up and running Spark cluster.
GPU Information
GPU Processing Support on Spark Worker
You can use a GPU on each worker node in your Spark cluster. If you do, be sure to install all required NVIDIA GPU drivers on each worker node in your cluster. This process is described in the next section.
Additional GPU Steps on Spark Worker
Perform the following procedure to configure the nodes that will be running your spark worker for use with the NVIDIA GPU. This applies to linux.
Download Cuda 10.x from https://developer.nvidia.com/
Install CUDA, and then use the nvidia-smi command to verify that CUDA is working.
Make sure that your Spark Worker
/opt/symetry/nativelib
contains the same.so
as your SymetryML host server. Please consult the SymetryML Installation Guide for more informationBe sure that Jetty user
LD_LIBRARY_PATH
is set correctly like in the following:
# download cuda
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
# run the installer
chmod +x cuda_10.2.89_440.33.01_linux.run
./cuda_10.2.89_440.33.01_linux.run
nvdia-smi
# edit /home/jetty/.bashrc
sudo su jetty
cd
emacs .bashrc
# /home/jetty/.bashrc additional entries
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/opt/symetry/nativelib
export LD_LIBRARY_PATH
GPU Support
CUDA library
Currently certified on CUDA 10.x
Intel MKL
Working with MKL version 11.0.0 and higher
Spark FAQs
Question: What does the following error message mean: ERROR 500: INTERNAL_SERVER_ERROR : Cannot assign requested address
.
Answer: Be sure the SymetryML configuration files (/opt/symetry/symetry-rest.txt) has the rtlm.option.spark.listener.host set correctly to your host.
Question: What does the following error message mean: java.lang.OutOfMemoryError: GC overhead limit exceeded
.
Answer: Increase your worker memory using spark configuration parameters.
Question: What does the following error message mean: 15/08/17 17:43:47 ERROR WorkerWatcher: Error was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://[email protected]:49991
Answer: This error is most likely caused by lack of memory so, verify worker logs and increase your worker memory.
Question: I see [java.net.BindException: Address already in use
message in my log.
Answer: You can usually ignore this message.
Last updated