Glue / Client / start_data_quality_ruleset_evaluation_run

start_data_quality_ruleset_evaluation_run#

Glue.Client.start_data_quality_ruleset_evaluation_run(**kwargs)#

Once you have a ruleset definition (either recommended or your own), you call this operation to evaluate the ruleset against a data source (Glue table). The evaluation computes results which you can retrieve with the GetDataQualityResult API.

See also: AWS API Documentation

Request Syntax

response = client.start_data_quality_ruleset_evaluation_run(
    DataSource={
        'GlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            }
        }
    },
    Role='string',
    NumberOfWorkers=123,
    Timeout=123,
    ClientToken='string',
    AdditionalRunOptions={
        'CloudWatchMetricsEnabled': True|False,
        'ResultsS3Prefix': 'string',
        'CompositeRuleEvaluationMethod': 'COLUMN'|'ROW'
    },
    RulesetNames=[
        'string',
    ],
    AdditionalDataSources={
        'string': {
            'GlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                }
            }
        }
    }
)
Parameters:
  • DataSource (dict) –

    [REQUIRED]

    The data source (Glue table) associated with this run.

    • GlueTable (dict) – [REQUIRED]

      An Glue table.

      • DatabaseName (string) – [REQUIRED]

        A database name in the Glue Data Catalog.

      • TableName (string) – [REQUIRED]

        A table name in the Glue Data Catalog.

      • CatalogId (string) –

        A unique identifier for the Glue Data Catalog.

      • ConnectionName (string) –

        The name of the connection to the Glue Data Catalog.

      • AdditionalOptions (dict) –

        Additional options for the table. Currently there are two keys supported:

        • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

        • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

        • (string) –

          • (string) –

  • Role (string) –

    [REQUIRED]

    An IAM role supplied to encrypt the results of the run.

  • NumberOfWorkers (integer) – The number of G.1X workers to be used in the run. The default is 5.

  • Timeout (integer) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).

  • ClientToken (string) – Used for idempotency and is recommended to be set to a random ID (such as a UUID) to avoid creating or starting multiple instances of the same resource.

  • AdditionalRunOptions (dict) –

    Additional run options you can specify for an evaluation run.

    • CloudWatchMetricsEnabled (boolean) –

      Whether or not to enable CloudWatch metrics.

    • ResultsS3Prefix (string) –

      Prefix for Amazon S3 to store results.

    • CompositeRuleEvaluationMethod (string) –

      Set the evaluation method for composite rules in the ruleset to ROW/COLUMN

  • RulesetNames (list) –

    [REQUIRED]

    A list of ruleset names.

    • (string) –

  • AdditionalDataSources (dict) –

    A map of reference strings to additional data sources you can specify for an evaluation run.

    • (string) –

      • (dict) –

        A data source (an Glue table) for which you want data quality results.

        • GlueTable (dict) – [REQUIRED]

          An Glue table.

          • DatabaseName (string) – [REQUIRED]

            A database name in the Glue Data Catalog.

          • TableName (string) – [REQUIRED]

            A table name in the Glue Data Catalog.

          • CatalogId (string) –

            A unique identifier for the Glue Data Catalog.

          • ConnectionName (string) –

            The name of the connection to the Glue Data Catalog.

          • AdditionalOptions (dict) –

            Additional options for the table. Currently there are two keys supported:

            • pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.

            • catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

            • (string) –

              • (string) –

Return type:

dict

Returns:

Response Syntax

{
    'RunId': 'string'
}

Response Structure

  • (dict) –

    • RunId (string) –

      The unique run identifier associated with this run.

Exceptions

  • Glue.Client.exceptions.InvalidInputException

  • Glue.Client.exceptions.EntityNotFoundException

  • Glue.Client.exceptions.OperationTimeoutException

  • Glue.Client.exceptions.InternalServiceException

  • Glue.Client.exceptions.ConflictException