Installation Guide - Spark
Copyright © 2021 by Symetry, Inc. 14 Pine Street, Ste 6 Morristown, NJ 07960 All Rights Reserved April 15th, 2021
Introduction
Assumptions
You have a working installation of SymetryML with Jetty. For information about performing this task, refer to the Installation Guide.
Make sure that all the required libraries are in your
/opt/symetry/nativelibfolder and that yourLD_LIBRARY_PATHis set correctly. Additionally, if you need graphics processor unit (GPU) support, refer to the Installation Guide - GPU and to GPU Information in this guide.Working installation of Spark on the same machine where the Jetty Web server is installed. SymetryML is certified to work with
Spark 2.4.5 hadoop 2.7,Spark 2.4.6 hadoop 2.7,Spark 3.0.1 hadoop 2.7andSpark 3.0.2 hadoop 3.2.
System Requirements
GPU Support
Currently certified on CUDA 10.x with NVIDIA GPU with Compute Capability >= 3.5. Consult the GPU Information for more information.
Spark Master
24 to 32 cores computer with high-speed Internet connection.
Spark Cluster worker memory
Minimum: 8 GB Recommended: 16 GB Start the number of workers on your node based on the amount of worker memory. For example, on Amazon S3: * c5.8xlarge instance: 8 workers. * c5.4xlarge instance: 4 workers.
Spark Information
SymetryML Spark Files
File
symetry.tar.gz
Contains files that the SymetryML REST application needs to communicate with a Spark cluster. This archive file should be decompressed in /opt/symetry
Once the symetry.tar.gz file is decompressed, you should get the following in your /opt/symetry/ folder. The lib and libExt contain the files that are needed to communicate with a spark cluster using the various spark driver applications - spark-submit, spark-shell, pyspark or the SML web application inside of Jetty.
├── lib
│ └── sym-spark-assembly.jar
├── libExt
│ ├── commons-pool2-2.0.jar
│ ├── jedis2.8.5.jar
│ ├── sym-core.jar
│ ├── sym-dao.jar
│ ├── sym-spark.jar
│ └── sym-util.jar
├── nativelib
│ ├── libblas.so -> libmkl_rt.so
│ ├── libblas.so.3 -> libmkl_rt.so
│ ├── libiomp5.so
│ ├── liblapack.so -> libmkl_rt.so
│ ├── liblapack.so.3 -> libmkl_rt.so
│ ├── libmkl_avx.so
│ ├── libmkl_core.so
│ ├── libmkl_def.so
│ ├── libmkl_gnu_thread.so
│ ├── libmkl_intel_lp64.so
│ ├── libmkl_intel_thread.so
│ ├── libmkl_rt.so
│ └── libsym-gpu.so
├── plugins
│ ├── csv2.dsplugin
│ ├── ds-plugin-csv2.jar
│ └── mysql-connector-java-5.1.36-bin.jar
├── python
│ ├── SMLDataFrameUtil.py
│ ├── SMLProjectUtil.py
│ └── SMLPy4JGateway.py
├── spark-support
│ ├── spark2.4.5
│ │ └── lib -> /opt/spark-2.4.5-bin-hadoop2.7/jars
│ ├── spark2.4.6
│ │ └── lib -> /opt/spark-2.4.6-bin-hadoop2.7/jars
│ ├── spark3.0.1
│ │ └── lib -> /opt/spark-3.0.1-bin-hadoop2.7/jars
│ ├── spark3.0.2
│ │ └── lib -> /opt/spark-3.0.2-bin-hadoop3.2/jars
│ └── spark-redshift-community
│ └── scala_2.11
│ └── version_4.0.1
│ ├── com.amazonaws_aws-java-sdk-1.7.4.jar
│ ├── com.eclipsesource.minimal-json_minimal-json-0.9.4.jar
│ ├── com.fasterxml.jackson.core_jackson-annotations-2.2.3.jar
│ ├── com.fasterxml.jackson.core_jackson-core-2.2.3.jar
│ ├── com.fasterxml.jackson.core_jackson-databind-2.2.3.jar
│ ├── commons-codec_commons-codec-1.4.jar
│ ├── commons-logging_commons-logging-1.1.3.jar
│ ├── io.github.spark-redshift-community_spark-redshift_2.11-4.0.1.jar
│ ├── joda-time_joda-time-2.10.6.jar
│ ├── org.apache.hadoop_hadoop-aws-2.7.7.jar
│ ├── org.apache.httpcomponents_httpclient-4.2.5.jar
│ ├── org.apache.httpcomponents_httpcore-4.2.5.jar
│ ├── org.apache.spark_spark-avro_2.11-2.4.2.jar
│ ├── org.slf4j_slf4j-api-1.7.5.jar
│ ├── org.spark-project.spark_unused-1.0.0.jar
│ └── RedshiftJDBC42-no-awssdk-1.2.36.1060.jarAdditional SymetryML Configuration for Spark Support
SymetryML relies on the existence of symbolic link (/opt/symetry/spark-support/spark2.4.5/lib) that points to your $SPARK_HOME/jars folder. Just make sure to create the symbolic link so that it points to the jars folder inside you spark installation:
Example with Spark 2.4.5 using hadoop 2.7:
(base) johndoe$ pwd
/opt/symetry/spark-support/spark2.4.5
(base) johndoe$ ls -l
total 0
lrwxr-xr-x 1 neil wheel 53 4 Mar 09:00 lib -> /opt/spark-2.4.5-bin-hadoop2.7/jarsAdditional Jars Needed for Spark
SymetryML needs additional jars to be able to access files on AmazonS3. These files needs to be put in $SPARK_HOME/jars folder. Depending on the version of Spark you are using you will need to put different jars file there.
Spark 2.4.x with Hadoop 2.7
Add the following jars to the
$SPARK_HOME/jarsfolder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jarin the$SPARK_HOME/jarsfolder: e.g.hadoop-common-2.7.4.jarfor instance.hadoop-aws-2.7.4.jarhadoop-azure-2.7.4.jar
Add the following jars to the
$SPARK_HOME/jarsfolder:aws-java-sdk-1.7.4.jarjets3t-0.9.4.jarazure-storage-3.1.0.jarxbean-asm6-shaded-4.10.jar: For this one you will need to replace the pre-existingxbean-asm6-shaded-4.8.jarwithxbean-asm6-shaded-4.10.jar. There is a bug in thexbean-asm6-shaded-4.8.jarwhen trying to access S3 files.
Spark 3.0.x with Hadoop 2.7
Add the following jars to the
$SPARK_HOME/jarsfolder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jarin the$SPARK_HOME/jarsfolder: e.g.hadoop-common-2.7.4.jarfor instance.hadoop-aws-2.7.4.jarhadoop-azure-2.7.4.jar
Add the following jars to the
$SPARK_HOME/jarsfolder:aws-java-sdk-1.7.4.jarjets3t-0.9.4.jarazure-storage-3.1.0.jar
Spark 3.0.x with Hadoop 3.2
Add the following jars to the
$SPARK_HOME/jarsfolder. Make sure their version matches the version ofhadoop-common-HADOOP_VERSION.jarin the$SPARK_HOME/jarsfolder: e.g.hadoop-common-3.2.0.jarfor instance.hadoop-aws-3.2.0.jarhadoop-azure-3.2.0.jar
Add the following jars to the
$SPARK_HOME/jarsfolder:aws-java-sdk-bundle-1.11.375.jarjets3t-0.9.4.jarazure-storage-8.6.6.jar
Spark Cluster Configuration
For information about configuring your Spark Cluster, refer to your Spark documentation. SymetryML assumes that you have an up and running Spark cluster.
GPU Information
GPU Processing Support on Spark Worker
You can use a GPU on each worker node in your Spark cluster. If you do, be sure to install all required NVIDIA GPU drivers on each worker node in your cluster. This process is described in the next section.
Additional GPU Steps on Spark Worker
Perform the following procedure to configure the nodes that will be running your spark worker for use with the NVIDIA GPU. This applies to linux.
Download Cuda 10.x from https://developer.nvidia.com/
Install CUDA, and then use the nvidia-smi command to verify that CUDA is working.
Make sure that your Spark Worker
/opt/symetry/nativelibcontains the same.soas your SymetryML host server. Please consult the SymetryML Installation Guide for more informationBe sure that Jetty user
LD_LIBRARY_PATHis set correctly like in the following:
# download cuda
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
# run the installer
chmod +x cuda_10.2.89_440.33.01_linux.run
./cuda_10.2.89_440.33.01_linux.run
nvdia-smi
# edit /home/jetty/.bashrc
sudo su jetty
cd
emacs .bashrc
# /home/jetty/.bashrc additional entries
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/opt/symetry/nativelib
export LD_LIBRARY_PATHGPU Support
CUDA library
Currently certified on CUDA 10.x
Intel MKL
Working with MKL version 11.0.0 and higher
Spark FAQs
Question: What does the following error message mean: ERROR 500: INTERNAL_SERVER_ERROR : Cannot assign requested address.
Answer: Be sure the SymetryML configuration files (/opt/symetry/symetry-rest.txt) has the rtlm.option.spark.listener.host set correctly to your host.
Question: What does the following error message mean: java.lang.OutOfMemoryError: GC overhead limit exceeded.
Answer: Increase your worker memory using spark configuration parameters.
Question: What does the following error message mean: 15/08/17 17:43:47 ERROR WorkerWatcher: Error was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://[email protected]:49991
Answer: This error is most likely caused by lack of memory so, verify worker logs and increase your worker memory.
Question: I see [java.net.BindException: Address already in use message in my log.
Answer: You can usually ignore this message.
Last updated