SymetryML6.1
  • Introduction
  • Guides
    • Onboarding Guide
    • Technical Requirements
    • Admin User Guide
    • Installation Guide
      • Installation Guide - GPU
      • Installation Guide - Spark
  • SymetryML GUI
    • ML Toolkit
      • The SymetryML Difference
      • Data Mining Lifecycle
      • SymetryML Concepts
      • Data Sources
      • Streams
      • Encoders
      • Projects
      • Models
    • Sequence Models
    • SymetryML Federated Learning
      • Creating the Federation
      • Load data to local project
      • Requesting Federation Information from Admin Node
      • Joining a Federation with a peer node
      • Federated Data & Modelling
      • Appendix
    • DEM Generator
  • SymetryML Rest Client
    • REST API Reference Guide
      • SymetryML REST API Security
      • SymetryML JSON API Objects
      • Encoder Object REST API
      • SymetryML Projects REST API
      • About Federated Learning
      • Hipaa Compliance and Federated Learning
      • Federated Learning API
        • Federated Learning Topologies
        • Federated Learning with Nats
        • Federated Learning with AWS
        • Fusion Projects
      • Exploration API
      • Modeling API
      • Exporting and Importing Model
      • Third Party Model Rest API
      • SymetryML Job Information
      • Prediction API
      • Data Source API
      • Project Data Source Logs
      • Stream Data Source API
      • AutoML with SymetryML
      • Transform Dataframe
      • Select Model with SymetryML
      • Auto Select with SymetryML
      • Tasks API
      • Miscellaneous API
      • WebSocket API
      • Appendix A JSON Data Structure Schema
      • Appendix B Sample Code
  • SymetryML SaaS
    • SaaS Homepage
    • SaaS Dashboard
    • SaaS Account
    • SaaS Users
    • SaaS Licence
Powered by GitBook
On this page
  • Introduction
  • Assumptions
  • System Requirements
  • Spark Information
  • SymetryML Spark Files
  • Additional SymetryML Configuration for Spark Support
  • Additional Jars Needed for Spark
  • Spark 2.4.x with Hadoop 2.7
  • Spark 3.0.x with Hadoop 2.7
  • Spark 3.0.x with Hadoop 3.2
  • Spark Cluster Configuration
  • GPU Information
  • GPU Processing Support on Spark Worker
  • Spark FAQs
  1. Guides
  2. Installation Guide

Installation Guide - Spark

PreviousInstallation Guide - GPUNextML Toolkit

Last updated 2 years ago

Copyright © 2021 by Symetry, Inc. 14 Pine Street, Ste 6 Morristown, NJ 07960 All Rights Reserved April 15th, 2021

Introduction

Assumptions

  • You have a working installation of SymetryML with Jetty. For information about performing this task, refer to the .

  • Make sure that all the required libraries are in your /opt/symetry/nativelib folder and that your LD_LIBRARY_PATH is set correctly. Additionally, if you need graphics processor unit (GPU) support, refer to the and to in this guide.

  • Working installation of Spark on the same machine where the Jetty Web server is installed. SymetryML is certified to work with Spark 2.4.5 hadoop 2.7, Spark 2.4.6 hadoop 2.7, Spark 3.0.1 hadoop 2.7 and Spark 3.0.2 hadoop 3.2.

System Requirements

Requirement
Description

GPU Support

Spark Master

24 to 32 cores computer with high-speed Internet connection.

Spark Cluster worker memory

Minimum: 8 GB Recommended: 16 GB Start the number of workers on your node based on the amount of worker memory. For example, on Amazon S3: * c5.8xlarge instance: 8 workers. * c5.4xlarge instance: 4 workers.

Spark Information

SymetryML Spark Files

Type
Name
Description

File

symetry.tar.gz

Contains files that the SymetryML REST application needs to communicate with a Spark cluster. This archive file should be decompressed in /opt/symetry

Once the symetry.tar.gz file is decompressed, you should get the following in your /opt/symetry/ folder. The lib and libExt contain the files that are needed to communicate with a spark cluster using the various spark driver applications - spark-submit, spark-shell, pyspark or the SML web application inside of Jetty.

├── lib
│   └── sym-spark-assembly.jar
├── libExt
│   ├── commons-pool2-2.0.jar
│   ├── jedis2.8.5.jar
│   ├── sym-core.jar
│   ├── sym-dao.jar
│   ├── sym-spark.jar
│   └── sym-util.jar
├── nativelib
│   ├── libblas.so -> libmkl_rt.so
│   ├── libblas.so.3 -> libmkl_rt.so
│   ├── libiomp5.so
│   ├── liblapack.so -> libmkl_rt.so
│   ├── liblapack.so.3 -> libmkl_rt.so
│   ├── libmkl_avx.so
│   ├── libmkl_core.so
│   ├── libmkl_def.so
│   ├── libmkl_gnu_thread.so
│   ├── libmkl_intel_lp64.so
│   ├── libmkl_intel_thread.so
│   ├── libmkl_rt.so
│   └── libsym-gpu.so
├── plugins
│   ├── csv2.dsplugin
│   ├── ds-plugin-csv2.jar
│   └── mysql-connector-java-5.1.36-bin.jar
├── python
│   ├── SMLDataFrameUtil.py
│   ├── SMLProjectUtil.py
│   └── SMLPy4JGateway.py
├── spark-support
│   ├── spark2.4.5
│   │   └── lib -> /opt/spark-2.4.5-bin-hadoop2.7/jars
│   ├── spark2.4.6
│   │   └── lib -> /opt/spark-2.4.6-bin-hadoop2.7/jars
│   ├── spark3.0.1
│   │   └── lib -> /opt/spark-3.0.1-bin-hadoop2.7/jars
│   ├── spark3.0.2
│   │   └── lib -> /opt/spark-3.0.2-bin-hadoop3.2/jars
│   └── spark-redshift-community
│       └── scala_2.11
│           └── version_4.0.1
│               ├── com.amazonaws_aws-java-sdk-1.7.4.jar
│               ├── com.eclipsesource.minimal-json_minimal-json-0.9.4.jar
│               ├── com.fasterxml.jackson.core_jackson-annotations-2.2.3.jar
│               ├── com.fasterxml.jackson.core_jackson-core-2.2.3.jar
│               ├── com.fasterxml.jackson.core_jackson-databind-2.2.3.jar
│               ├── commons-codec_commons-codec-1.4.jar
│               ├── commons-logging_commons-logging-1.1.3.jar
│               ├── io.github.spark-redshift-community_spark-redshift_2.11-4.0.1.jar
│               ├── joda-time_joda-time-2.10.6.jar
│               ├── org.apache.hadoop_hadoop-aws-2.7.7.jar
│               ├── org.apache.httpcomponents_httpclient-4.2.5.jar
│               ├── org.apache.httpcomponents_httpcore-4.2.5.jar
│               ├── org.apache.spark_spark-avro_2.11-2.4.2.jar
│               ├── org.slf4j_slf4j-api-1.7.5.jar
│               ├── org.spark-project.spark_unused-1.0.0.jar
│               └── RedshiftJDBC42-no-awssdk-1.2.36.1060.jar

Additional SymetryML Configuration for Spark Support

SymetryML relies on the existence of symbolic link (/opt/symetry/spark-support/spark2.4.5/lib) that points to your $SPARK_HOME/jars folder. Just make sure to create the symbolic link so that it points to the jars folder inside you spark installation:

Example with Spark 2.4.5 using hadoop 2.7:

(base) johndoe$ pwd
/opt/symetry/spark-support/spark2.4.5
(base) johndoe$ ls -l
total 0
lrwxr-xr-x  1 neil  wheel  53  4 Mar 09:00 lib -> /opt/spark-2.4.5-bin-hadoop2.7/jars

Additional Jars Needed for Spark

SymetryML needs additional jars to be able to access files on AmazonS3. These files needs to be put in $SPARK_HOME/jars folder. Depending on the version of Spark you are using you will need to put different jars file there.

Spark 2.4.x with Hadoop 2.7

  • Add the following jars to the $SPARK_HOME/jars folder. Make sure their version matches the version of hadoop-common-HADOOP_VERSION.jar in the $SPARK_HOME/jars folder: e.g. hadoop-common-2.7.4.jar for instance.

    • hadoop-aws-2.7.4.jar

    • hadoop-azure-2.7.4.jar

  • Add the following jars to the $SPARK_HOME/jars folder:

  • aws-java-sdk-1.7.4.jar

  • jets3t-0.9.4.jar

  • azure-storage-3.1.0.jar

  • xbean-asm6-shaded-4.10.jar: For this one you will need to replace the pre-existing xbean-asm6-shaded-4.8.jar with xbean-asm6-shaded-4.10.jar. There is a bug in the xbean-asm6-shaded-4.8.jar when trying to access S3 files.

Spark 3.0.x with Hadoop 2.7

  • Add the following jars to the $SPARK_HOME/jars folder. Make sure their version matches the version of hadoop-common-HADOOP_VERSION.jar in the $SPARK_HOME/jars folder: e.g. hadoop-common-2.7.4.jar for instance.

    • hadoop-aws-2.7.4.jar

    • hadoop-azure-2.7.4.jar

  • Add the following jars to the $SPARK_HOME/jars folder:

  • aws-java-sdk-1.7.4.jar

  • jets3t-0.9.4.jar

  • azure-storage-3.1.0.jar

Spark 3.0.x with Hadoop 3.2

  • Add the following jars to the $SPARK_HOME/jars folder. Make sure their version matches the version of hadoop-common-HADOOP_VERSION.jar in the $SPARK_HOME/jars folder: e.g. hadoop-common-3.2.0.jar for instance.

    • hadoop-aws-3.2.0.jar

    • hadoop-azure-3.2.0.jar

  • Add the following jars to the $SPARK_HOME/jars folder:

  • aws-java-sdk-bundle-1.11.375.jar

  • jets3t-0.9.4.jar

  • azure-storage-8.6.6.jar

Spark Cluster Configuration

For information about configuring your Spark Cluster, refer to your Spark documentation. SymetryML assumes that you have an up and running Spark cluster.

GPU Information

GPU Processing Support on Spark Worker

You can use a GPU on each worker node in your Spark cluster. If you do, be sure to install all required NVIDIA GPU drivers on each worker node in your cluster. This process is described in the next section.

Additional GPU Steps on Spark Worker

Perform the following procedure to configure the nodes that will be running your spark worker for use with the NVIDIA GPU. This applies to linux.

  1. Install CUDA, and then use the nvidia-smi command to verify that CUDA is working.

  2. Be sure that Jetty user LD_LIBRARY_PATH is set correctly like in the following:

# download cuda
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run

# run the installer
chmod +x cuda_10.2.89_440.33.01_linux.run
./cuda_10.2.89_440.33.01_linux.run
nvdia-smi
# edit /home/jetty/.bashrc
sudo su jetty
cd
emacs .bashrc
# /home/jetty/.bashrc additional entries
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/opt/symetry/nativelib
export LD_LIBRARY_PATH

GPU Support

GPU Support
Description

CUDA library

Currently certified on CUDA 10.x

Intel MKL

Working with MKL version 11.0.0 and higher

Spark FAQs

Question: What does the following error message mean: ERROR 500: INTERNAL_SERVER_ERROR : Cannot assign requested address.

Answer: Be sure the SymetryML configuration files (/opt/symetry/symetry-rest.txt) has the rtlm.option.spark.listener.host set correctly to your host.

Question: What does the following error message mean: java.lang.OutOfMemoryError: GC overhead limit exceeded.

Answer: Increase your worker memory using spark configuration parameters.

Question: What does the following error message mean: 15/08/17 17:43:47 ERROR WorkerWatcher: Error was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkWorker@boson.local:49991

Answer: This error is most likely caused by lack of memory so, verify worker logs and increase your worker memory.

Question: I see [java.net.BindException: Address already in use message in my log.

Answer: You can usually ignore this message.

Currently certified on CUDA 10.x with NVIDIA GPU with Compute Capability >= 3.5. Consult the for more information.

Download Cuda 10.x from

Make sure that your Spark Worker /opt/symetry/nativelib contains the same .so as your SymetryML host server. Please consult the for more information

https://developer.nvidia.com/
SymetryML Installation Guide
Installation Guide
Installation Guide - GPU
GPU Information
GPU Information