Exploration API
Last updated
Last updated
The Exploration API allows for the request of statistical metrics of SymetryML projects using univariate, bivariate, Hypothesis Testing (ztest, test, ftest, ANOVA) as well as Information Gain analyses. The body of the request contains a which in turn is a list of , one for each feature for which exploration data is needed.
If the project is part of a federation it might be possible to ask that the exploration result is fetched from a given node. This can be done by specifying which peer to use with the following extra parameters inside a . Please consult the section for details.
fed_peer_uuid_for_explore
Peer ID to use for exploration results.
metric
Required
async
Optional
200
OK
Success.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
{"statusCode":"OK","statusString":"OK","values":{"KSVDMap":{"values":[{"count":150.0,"mean":5.843333333333334,"variance":0.6811222222222204,"skewness":0.31808651443435426,"stddev":0.8253012917851398}]}}}
A project info entity is also returned to reflect the state of the project that might have changed.
Depending on the metric query parameter, different information is returned in the KSVDMap JSON response entity. The KSVDMap JSON entity is an array of map<key, value>, where key is a string and value is an IEEE 64-bit double floating-point number. The following table enumerates the keys that the response will contain for all valid metric query parameters.
uni
Returns all univariate metrics for the attribute listed in each MLContext in the ExploreContext. If an MLContext has more than one attribute, only the first one is used.
- count = count of the attribute.
- Mean = mean of the attribute.
- variance = variance of the attribute.
- stddev = standard deviation of the attribute.
- skewness = skewness of the attribute. - stderr = the standard deviation divided by the square root of the count for this feature. - meanMarginErrorC95 = Mean standard error 95% confidence level.
uni (with histogram)
- min = minimum value for this feature.
- max = maximum value for this feature.
- median = median value for this feature.
bi
Returns all bivariate metrics for the two attributes listed in each MLContext in the ExploreContext. Each MLContext should contain at least two attribute indexes. If there are more than two, only the first two are used.
- covar = covariance between the two attributes.
- linCorr = linear correlation between the two attributes.
- condMean = conditional mean
- condCount = conditional count
- condVariance = conditional variance
ztest
Returns the z-test metrics using all of the MLContext in the ExploreContext request body.
- zn1 = count between attribute and target 1.
- zn2 = count between attribute and target 2.
- zm1 = mean between attribute and target 1.
- zm2 = mean between attribute and target 2.
- zs1 = variance between attribute and target 1.
- zs2 = variance between attribute and target 2.
- z = z test value.
- zp = z test value probability.
ztestp
Returns the z-test-proportion metrics using all of the MLContext in the ExploreContext request body.
- z = z test value.
- zp = z test value probability.
ztestmu
Returns the z-test - against a known mean - metrics using all of the MLContext in the ExploreContext request body.
- z = z test value.
- zp = z test value probability.
ftest
Returns the f-test metrics using all of the MLContext in the ExploreContext request body.
- fn1 = count between attribute and target 1.
- fn2 = count between attribute and target 2.
- fm1 = mean between attribute and target 1.
- fm2 = mean between attribute and target 2.
- fs1 = variance between attribute and target 1.
- fs2 = variance between attribute and target 2.
- fdf1 = degree of freedom between attribute and target 1.
- fdf2 = degree of freedom between attribute and target 2.
- f = F test value.
- fp = F test value probability.
ftestmu
Returns the f-test - against a known variance - metrics using all of the MLContext in the ExploreContext request body.
- f = F test value.
- fp = F test value probability.
ttest
Returns the t-test metrics using all of the MLContext in the ExploreContext request body.
- tn1 = count between attribute and target 1.
- tn2 = count between attribute and target 2.
- tm1 = mean between attribute and target 1.
- tm2 = mean between attribute and target 2.
- ts1 = variance between attribute and target 1.
- ts2 = variance between attribute and target 2.
- t = t test value.
- tdf = t test value degree of freedom.
- tp = t test value probability.
- tu = t test value for unequal variance.
- tdfu = t test value degree of freedom for unequal variance.
- tpu = t test value probability for unequal variance.
ttestmu
Returns the T-test - against a known mean - metrics using all of the MLContext in the ExploreContext request body.
- t = T test value.
- tp = T test value probability.
anova
Returns the ANOVA metrics using all of the MLContext in the ExploreContext request body.
- SSb = sum square between groups.
- SSw = sum square within groups.
- Dfb = degree of freedom between groups.
- Dfw = degree of freedom within groups.
- MSb = mean square between groups.
- MSw = mean square within groups.
- F = F value from F distribution.
- Ssgamma = total sum of square.
- Dfgamma* = total degree of freedom.
- P = probability of F.
chi2
Returns chi-square information using all of the MLContext in the ExploreContext request body.
- chiStat = chi-square value.
- df = degree of freedom.
- pval = probability.
- coef = contingency coefficient. Additionally, for all the pair wise values the observed count will be returned as a value using the following format for the key. Example: obs$1:4
gain
Returns information gain for a given attribute –input - given another attributes – target.
Note ztestmu, ttestmu, and ftestsima require the following parameters to be set in MLContext.extraParameters
respectively: sml_explore_ztest_known_mu, sml_explore_ttest_known_mu, and sml_explore_ftest_known_sigma.
Hypothesis Test
MLContext Content
Z test Compare means of 2 continuous features.
inputAttributes
: contains 2 continuous attributes ids
Z test Compare mean of 1 continuous feature against a known mean
* inputAttributes
: contains 1 continuous attribute id * extraParameters
: contains one key sml_explore_ztest_known_mu
with known mean
Z test Compare means of a continuous feature conditioned on 2 binary features
inputAttributes
: contains 3 attribute id, first one is continuous followed by 2 binary ids
Z test proportion Compare two proportions originated from two binary features
inputAttributes
: contains 2 binary attribute id
F test Compare variance of 2 continuous features.
inputAttributes
: contains 2 continuous attributes id
F test Compare variance of continuous feature against a known variance
* inputAttributes
: contains 1 continuous attribute id * extraParameters
: contains one key sml_explore_ftest_known_sigma
with known variance / sigma
F test Compare variances of a continuous feature conditioned on 2 binary features
inputAttributes
: contains 3 attribute id, first one is continuous followed by 2 binary ids
T test Compare means of 2 continuous features.
inputAttributes
: contains 2 continuous attributes id
T test Compare mean of 1 continuous feature against a known mean
* inputAttributes
: contains 1 continuous attribute id * extraParameters
: contains one key sml_explore_ttest_known_mu
with known mean
T test Compare means of a continuous feature conditioned on 2 binary features
inputAttributes
: contains 3 attribute id, first one is continuous followed by 2 binary ids
Anova Compare means of 2 or more continuous features
inputAttributes
: contains 2 or more continuous attribute id
Anova Compare means of a continuous feature conditioned on 2 or more binary features
* inputAttributes
: contains 1 continuous attribute id * targets
: contains 1 or more binary attribute id
Chi square Determine the association between binary features.
Chi square Determine the association between binary features based on category names
202
OK
Job accepted.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" + Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
This REST endpoint allows for the request of the principal component analysis (PCA) of a set of attributes.
explainedCovariance
Optional
This parameter controls how many Eigen vectors/values will be returned in term of explained covariance. Default is 1 which correspond to 100% of explained covariance. Note that in order to avoid sending too much large response body, the maximum number of Eigen values/vectors that will be returned is 100.
The request body is gonna be a MLContext that contains information about the inputs to use. It can also contains the following extra parameter pcaNumDimension
to specify a maximum number of dimension to return. The default value is 100.
200
OK
Success.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" + Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
The response contains three keys:
pcaVectors
pcaValues
pcaSumValues
pcaVectors
pcaValues
pcaSumValues
A number representing the total sum of the PCA values. You can use that number to compute the % of covariance explained by a given Eigen vectors.
This REST endpoint allows to get the singular values decomposition (SVD) of a set of attributes.
200
OK
Success.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" + Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
A map<String, String> map containing singular values along their attributes names. Each key/value pair in the result map contains: key = a SVD computed number, that is a singular value value = The name of the attribute
This REST endpoint allows to use singular value decomposition to perform feature selection. This endpoint will return a subset of attributes that are not singular based on the SVD algorithms. The SVD algorithms might be used multiple time internally to eventually reach the solution.
200
OK
Success.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" + Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
A map<String, String> map containing singular values along their attributes names. Each key/value pair in the result map contains: key = a SVD computed number, that is a singular value value = The name of the attribute
This method returns the density estimate / histogram for any continuous
or binary
attributes in your SymetryML project. Histogram building must have been enabled - for the wanted symetry project - before invoking this rest endpoint. Histogram building for a project can be enabled by 2 means:
hist_bins
Required
Number of bins in the histogram
hist_max
Optional
Minimum value of the histogram on the x axis
hist_min
Optional
Maximum value of the histogram on the x axis
200
OK
Success.
400
BAD REQUEST
Unknown SymetryML project. {"statusCode":"BAD_REQUEST","statusString":" + Cannot Find SYMETRYML id[r2] for Customer id [c1]","values":{}}
Metric to use for exploration. For more information, see the section
If set to true then the exploration will be done asynchronously and the result will be fetched using the
The response is a entity. This structure is an array of map<key,value>
, where keys are string and values are an Institute of Electrical and Electronics Engineers (IEEE) 64-bit double floating-point number. One such map is added to the response for each in the request body. The keys in each map depend on the metric query parameter. For more information, see the section .
For project with histogram enabled, additional information is returned. Please consult to learn how to enable histogram for a project.
Its possible to perform variations on ztest, ftest, ttest, anova, and chi-squared test. Depending on which test needs to be performed the must be populated differently.
*inputAttributes
: contains 2 or more binary attribute ids * targets
: contains 2 or more binary attribute ids see for example.
inputAttributes
: list of category name See for example.
This rest endpoint return the variance inflation factor for a list of features. This API function is asynchronous. If it succeeds, it returns a 202 response, along with a Location header that specifies the job URL. For more information about SymetryML asynchronous job please refer the section on .
The response is a , the key are the input index as specified in the body and the value is the variance inflation factor for that index.
This API function is asynchronous. If it succeeds, it returns a 202 response, along with a Location header that specifies the job URL. For more information about SymetryML asynchronous job please refer the section on .
The value for the pcaVectors and pcaValues keys is a Matrix stored in a DataFrame JSON data structure. The pcaSumValues key consists of a number that is the sum of all the Eigen values. See for information on the DataFrame JSON structure.
A containing the PCA vectors.
A containing the PCA values.
This API function is asynchronous. If it succeeds, it returns a 202 response, along with a Location header that specifies the job URL. For more information about SymetryML asynchronous job please refer the section .
The request body needs to contains the list of input attributes id to be used while computing the singular value decomposition. Refer to section for details.
This API function is asynchronous. If it succeeds, it returns a 202 response, along with a Location header that specifies the job URL. For more information about SymetryML asynchronous job please refer the section .
The request body needs to contains the list of input attributes id to be used while computing the singular value decomposition. Refer to section for details.
When creating the project with the enableHistogram
query parameter. See endpoint.
Explicitly invoking the rest endpoint on any given project.
For each features / attributes for which histogram are requested a is needed inside the . The following tables describes the extra parameters can be used:
see below