Data Source API
Using a Data Source
In SymetryML, a data source is an abstraction of a CSV file that resides somewhere and that can be used by:
SymetryML projects to learn new data
Models to make predictions and assessments
Encoder to update their internal encoding table
SymetryML supports various types of data sources:
Secure File Transfer Protocol (SFTP)
HTTP/HTTPS URL
Amazon Simple Storage Service (S3)
Microsoft Azure Blob Storage
Google Cloud Storage
Oracle OCI Object Storage
Amazon RedShift
Spark Processing: Amazon S3, Google cloud storage, Oracle OCI object storage and Microsoft Azure blob storage data sources can be processed in parallel leveraging a Spark Cluster.
SymetryML data source plugins
JDBC
Local Data Source, that is allows to browse the local file system of the jetty web server with same privileges as the user running the Jetty web server.
Required DSInfo Fields
type
Type of Data Source
- Secure FTP (SFTP) data source = sftp
- HTTP/HTTPS data source = http
- Amazon S3 = s3
- Oracle OCI Object Storage with S3 Compatibility = s3oci
- Google Clound Storage = gcs
- Amazon Redshift = redshift
- Data Source Plug ins. - jdbc
- Local file = localfile
- Amazon Elastic Map Reduce = emr
- Microsoft Azure Blob Storage = abs
name
Name of the data source.
info
Additional Information Stored in Data Source
The info
field of a data source contains specific information based on the type of data source. The following tables describes what they are for each different type of data source
HTTP/HTTPS Data Source
path
http:// or https:// URL.
Secure FTP (SFTP)
path
path to the file on the server.
sftpuser
user name used to connect to the SFTP server.
sftppasswd
user password used to connect to the SFTP server.
sftphost
host to which you want to connect.
Amazon S3
path
path to file on the server, excluding the Amazon S3 bucket
s3accessKey
Amazon S3 access key to use to connect to S3.
s3secretKey
Amazon S3 secret key to use to connect to S3.
s3bucket
Amazon S3 bucket to use.
Oracle OCI Object Storage with S3 Compatibility
path
path to file on the server, excluding the Oracle OCI Object Storage bucket
s3accessKey
Oracle Oracle OCI Object Storage access key to use to connect to Oracle OCI Object Storage.
s3secretKey
Oracle Oracle OCI Object Storage secret key to use to connect to Oracle OCI Object Storage.
s3bucket
Oracle Oracle OCI Object Storage bucket to use.
ocinamespace
Oracle Oracle OCI Object Storage namespace
ociregion
Oracle Oracle OCI Object Storage region
Google Cloud Storage
path
Path to file
gcsaccessKey
gcssecretKey
gcsbucket
GCS Bucket
gcsproject
GCS Project
gcsmarker
Optional marker parameter indicating where in the GCS bucket to begin listing. The list will only include keys that occur lexicographically after the marker.
gcsdelimiter
GCS File/Folder delimiter. / is used by default
Microsoft Azure Blob Storage
path
Path to file
azure.credentials.connection.string
Connection string that specifies credentials to authorize access to Azure Blob Storage. Use one of this, account key or SAS token.
azure.account.name
Name of the Azure account to use
azure.credentials.sharedkey.account.key
Account key that specifies credentials to authorize access to Azure Blob Storage. Use one of this, connection string or SAS token.
azure.credentials.sharedkey.sas.token
SAS token (account or service) that specifies credentials to authorize access to Azure Blob Storage. Use one of this, connection string or account key.
azure.blob.container.name
Name of the Azure Blob Storage container that contains the blob
azure.blob.inputstream.chunk.size.max.bytes
Maximum size in bytes of each chunk of data when reading the blob contents chunk by chunk. Default: 4194304
azure.blob.path.delimiter
String that separates elements of the path to the blob file. Default: /
azure.blob.list.marker
Marker that specifies the beginning of the next page of a list of Azure Blob Storage items to fetch from Azure. This property is used internally.
Amazon Redshift
path
name of the table to use.
rsuser
Redshift database user
rspasswd
Redshift user password
rsurl
Redshift connection url
Spark Map Reduce
sparkmaster
address of the spark’s cluster master
spark.job.process.jvm.heap.size.min
Mininum JVM size used for the spark Driver process launched by the Jetty Rest Server. Default : 1024m
spark.job.process.jvm.heap.size.max
Maximum JVM size used for the spark Driver process launched by the Jetty Rest Server. Default: 2048m
Any Spark parameters be used also like: spark.executor.memory
or spark.executor.cores
To pass such parameters, prefix them with ‘sml.sparkenv.’ as in the following examples: - sml.sparkenv.spark.executor.cores - sml.sparkenv.spark.cores.max
spark.automl.sample.random.seed
If AutoML is used, one can set the randomizer seed that will be used to select a random sample of tuple from the data source to be used to bootstrap the AutoML environement.
Spark Map Reduce Data Source Type
Please note that the following matrix of supported version of Spark versus data source as well as how to name the data source for a given combination:
Oracle OCIs3
N
N
N
sparkocis3_mr_3_0_2
Amazon S3
sparks3_mr_2_4_5
sparks3_mr_2_4_6
sparks3_mr_3_0_1
sparks3_mr_3_0_2
Google Cloud Storage
sparkgcs_mr_2_4_5
sparkgcs_mr_2_4_6
sparkgcs_mr_3_0_1
sparkgcs_mr_3_0_2
Microsoft Azure Blob
sparkabs_mr_2_4_5
sparkabs_mr_2_4_6
sparkabs_mr_3_0_1
sparkabs_mr_3_0_2
JDBC
driver
host
port
database
user
password
Amazon EMR
chunksize
Optional
SymetryML process the data chunk by chunk. This parameters specifies the chunk size. Default: 5000
emr.client.aws.region
Optional
AWS region of the EMR cluster. Default: us-east-1
emr.cluster.ec2.key.name
Required
EC2 key pair name for the cluster.
emr.cluster.ec2.subnet.id
Optional
EC2 subnet id for the cluster. Default: null
emr.cluster.instance.count
Required
# of EC2 instances in the EMR cluster.
emr.cluster.instance.master.type
Required
Instance type of the master EC2 instance.
emr.cluster.instance.slave.type
Required
Instance type of the slave EC2 instances.
emr.cluster.log.storage.enable
Optional
Boolean enabling for storing the EMR logs. Default: false
emr.cluster.log.storage.uri
Optional
URI of the EMR logs. Default: null
emr.job.flow.role
Optional
EMR role for EC2 that is used by EC2 instances within the cluster. Default: AWS EMR_EC2_DefaultRole
emr.s3.job.bucket.name
Required
S3 bucket that stores the files needed for the Spark cluster job; it can include the directory that stores the EMR logs.
emr.service.role
Optional
Amazon EMR role, which defines the allowable actions for Amazon EMR. Default: AWS EMR_DefaultRole
path
Required
Path of data source to process, can be a folder. This is the 'data path' without the 'bucket part'
s3accessKey
Required
AWS access key
s3bucket
Required
AWS S3 bucket where data resides
s3marker
Required
Optional marker parameter indicating where in the S3 bucket to begin listing. The list will only include keys that occur lexicographically after the marker.
s3secretKey
Required
AWS secret key
sml.sparkenv.*
Required
Allows to specify any Apache Spark environment configuration like: spark.cores.max
e.g. use:sml.sparkenv.spark.cores.max
or spark.executor.memory
e.g. use:sml.sparkenv.spark.executor.memory
sparksymproject
Required
Name of the project.
Additional CSV Options
You can specify additional parameters that describe the ‘type of csv files’. You can also add the following parameters to a data source to change how SymetryML parses your data:
csv_entry_separator
Specifies which character to use as the delimiter for each record for a given tuple.
csv_quote_character
Specifies the quote character.
csv_strict_quotes
Setting this option to true discards characters outside the quotes. If there are no quotes between delimiters, an empty string is generated.
csv_header_missing
Specifies that this data source does not have any header. SymetryML can then generate a header automatically.
Additional Information on Spark S3 Data Source
SymetryML can leverage a spark cluster to speed up processing of large amounts of data significantly. Currently, your data must reside on Amazon S3. Depending on the size of your data, it may take more or less time for the job to start, as the Spark Cluster must compute the partitions of your data before starting the job. Consequently, if your data is very large, this may take a few minutes.
Best practices for Spark S3 Data Source:
Performance may vary depending on Amazon resource utilization when you run your job.
Be sure all executor nodes in your cluster reside in the same Amazon EC2 placement group.
About Data Source Plugins (DSPlugins)
The SymetryML data source API allows you to create a new data source by the form of Java library (jar) that can be added the server. Instead of transforming data into CSV files, for example, you can write a DS plugin that reads the data natively.
Data Source Encryption
Data sources might contain sensitive information that should never be passed in the clear. To avoid having to use HTTPS for these services, the SymetryML REST API forces you to pass such information in encrypted form. This can be done easily, as each SymetryML secret key is also a 128-bit Advanced Encryption Standard (AES) secret key.
Extract the JSON string from that data structure.
Encrypt the JSON string representation using your SymetryML secret key:
Initialization vector in Base 64: LzM5QUtXZXWHm7HJ4wAePg==
Block cipher algorithm: AES/CBC/PKCS5Padding
Data Source Create
This API function creates a new data source.
URL
HTTP Responses
202
CREATED
Success.
409
CONFLICT
A data source with the specified name already exists.
HTTP Response Entity
None.
Sample Request/Response
Data Source Update
This API function update an existing data source.
URL
HTTP Responses
200
OK
Success.
404
NOT FOUND
If data source with the specified name does not exist.
HTTP Response Entity
None.
Sample Request/Response
List Customer Data Sources
This API function returns all the data sources that belong to a user.
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
{"statusCode":"OK","statusString":"OK","values":{"stringList":{"values":["Iris_SymetryML.csv-predict.csv:s3","h1:http","BigData11g_Test.csv:http","Iris_SymetryML.csv:s3","Smaato_Bids_20130812_CTR.csv:s3"]}}}
Sample Request/Response
Delete Data Source
This API function deletes a data source.
URL
HTTP Responses
200
OK
Success.
409
CONFLICT
Data source cannot be deleted. A SymetryML project might be using the data source. The response contains an error string with further details.
HTTP Response Entity
None
Data Source Information
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
DSInfo (encrypted)
Sample Request/Response
Data Source Browsing
This API function lists content about a remote data source directory.
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
Contains listing information about the requested directory or folder.
Sample Request/Response
Fetching Sample Data 1
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
Contains a sample of the data source up to 128 lines.
Sample Request/Response
Fetching Sample Data 2
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
DataFrame
Sample Request/Response
SymetryML Project Data Source API
Add Data Source to a SymetryML project
This API function lets you add a data source to a project.
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
None
Remove Data Source from a SymetryML project
This API function lets you remove a data source from a project.
URL
HTTP Responses
200
OK
Success.
HTTP Response Entity
None
Learning Data from a Data Source
This API function lets you learn from a previously created data source.
URL
HTTP Responses
200
OK
Success. Includes an HTTP Location header specifying the location of the job ID that was created to handle the request. Example: {"statusCode":"ACCEPTED","statusString":"Job Created","values":{}}
HTTP Response Entity
None
Sample Request/Response
Forgetting Data from a Data Source
This API function lets you forget data from a previously created data source.
URL
HTTP Responses
200
OK
Success. Includes an HTTP Location header specifying the location of the job ID that was created to handle the request. For example: {"statusCode":"ACCEPTED","statusString":"Job Created","values":{}}
HTTP Response Entity
None
Sample Request/Response
Prediction Based on a Data Source
After a model is built, you can use this API function to make predictions using a data source. This action can be performed on very large files if they reside on Amazon S3. A prediction file is created that contains the original rows, along with additional prediction information based on the type of model used.
URL
Query Parameters
indsname
Required
Data source to use as input file for prediction.
outdsname
Required
Data source to use as output file for prediction.
impute
Optional
Boolean parameter specifying whether to impute missing values.
HTTP Responses
202
ACCEPTED
Success. Includes an HTTP Location header specifying the location of the job ID that was created to handle the request.
500
INTERNAL SERVER ERROR
If the server refuses to accept the new job, it notifies the client with the error "Job execution was refused by server."
HTTP Response Entity
None
Sample Request/Response
Encoder Data Source API
Updating an Encoder with a Data Source
This API function updates an Encoder with data from a data source.
URL
HTTP Responses
202
ACCEPTED
Success. Includes an HTTP Location header specifying the location of the job ID that was created to handle the request.
500
INTERNAL SERVER ERROR
If the server refuses to accept the new job, it notifies the client with the error "Job execution was refused by server."
HTTP Response Entity
None
Sample Request/Response
Listing a Data Source Used by an Encoder
This API function lists the data source(s) that were used to update an Encoder.
URL
HTTP Responses
202
ACCEPTED
Success. Includes an HTTP Location header specifying the location of the job ID that was created to handle the request.
HTTP Response Entity
StringList
Sample Request/Response
Data Source Job Status
When invoking a JobStatus for a job that was initiated for learning, forgetting, or making predictions based on a data source, the response might contain an entity. The following sections describe these cases.
Learning a Data Source
HTTP Responses
200
OK
No entity.
202
ACCEPTED
Forgetting a Data Source
HTTP Responses
200
OK
No entity.
202
ACCEPTED
Prediction Based on Data Source
HTTP Responses
200
OK
DataFrame that contains a sample of the predictions (up to 128 lines). Because Amazon files can be very large, it is not possible to return the prediction result file in its entirety within a REST call. Use your favorite tool to fetch the prediction results from the data source (S3 or SFTP). The prediction result file contains all the original file columns, plus the additional prediction column for each row. Any additional columns depend on the type of model used to make the predictions.
202
ACCEPTED
Last updated