Glue / Client / get_crawler

get_crawler#

Glue.Client.get_crawler(**kwargs)#

Retrieves metadata for a specified crawler.

See also: AWS API Documentation

Request Syntax

response = client.get_crawler(
    Name='string'
)
Parameters:

Name (string) –

[REQUIRED]

The name of the crawler to retrieve metadata for.

Return type:

dict

Returns:

Response Syntax

{
    'Crawler': {
        'Name': 'string',
        'Role': 'string',
        'Targets': {
            'S3Targets': [
                {
                    'Path': 'string',
                    'Exclusions': [
                        'string',
                    ],
                    'ConnectionName': 'string',
                    'SampleSize': 123,
                    'EventQueueArn': 'string',
                    'DlqEventQueueArn': 'string'
                },
            ],
            'JdbcTargets': [
                {
                    'ConnectionName': 'string',
                    'Path': 'string',
                    'Exclusions': [
                        'string',
                    ],
                    'EnableAdditionalMetadata': [
                        'COMMENTS'|'RAWTYPES',
                    ]
                },
            ],
            'MongoDBTargets': [
                {
                    'ConnectionName': 'string',
                    'Path': 'string',
                    'ScanAll': True|False
                },
            ],
            'DynamoDBTargets': [
                {
                    'Path': 'string',
                    'scanAll': True|False,
                    'scanRate': 123.0
                },
            ],
            'CatalogTargets': [
                {
                    'DatabaseName': 'string',
                    'Tables': [
                        'string',
                    ],
                    'ConnectionName': 'string',
                    'EventQueueArn': 'string',
                    'DlqEventQueueArn': 'string'
                },
            ],
            'DeltaTargets': [
                {
                    'DeltaTables': [
                        'string',
                    ],
                    'ConnectionName': 'string',
                    'WriteManifest': True|False,
                    'CreateNativeDeltaTable': True|False
                },
            ]
        },
        'DatabaseName': 'string',
        'Description': 'string',
        'Classifiers': [
            'string',
        ],
        'RecrawlPolicy': {
            'RecrawlBehavior': 'CRAWL_EVERYTHING'|'CRAWL_NEW_FOLDERS_ONLY'|'CRAWL_EVENT_MODE'
        },
        'SchemaChangePolicy': {
            'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
            'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
        },
        'LineageConfiguration': {
            'CrawlerLineageSettings': 'ENABLE'|'DISABLE'
        },
        'State': 'READY'|'RUNNING'|'STOPPING',
        'TablePrefix': 'string',
        'Schedule': {
            'ScheduleExpression': 'string',
            'State': 'SCHEDULED'|'NOT_SCHEDULED'|'TRANSITIONING'
        },
        'CrawlElapsedTime': 123,
        'CreationTime': datetime(2015, 1, 1),
        'LastUpdated': datetime(2015, 1, 1),
        'LastCrawl': {
            'Status': 'SUCCEEDED'|'CANCELLED'|'FAILED',
            'ErrorMessage': 'string',
            'LogGroup': 'string',
            'LogStream': 'string',
            'MessagePrefix': 'string',
            'StartTime': datetime(2015, 1, 1)
        },
        'Version': 123,
        'Configuration': 'string',
        'CrawlerSecurityConfiguration': 'string',
        'LakeFormationConfiguration': {
            'UseLakeFormationCredentials': True|False,
            'AccountId': 'string'
        }
    }
}

Response Structure

  • (dict) –

    • Crawler (dict) –

      The metadata for the specified crawler.

      • Name (string) –

        The name of the crawler.

      • Role (string) –

        The Amazon Resource Name (ARN) of an IAM role that’s used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.

      • Targets (dict) –

        A collection of targets to crawl.

        • S3Targets (list) –

          Specifies Amazon Simple Storage Service (Amazon S3) targets.

          • (dict) –

            Specifies a data store in Amazon Simple Storage Service (Amazon S3).

            • Path (string) –

              The path to the Amazon S3 target.

            • Exclusions (list) –

              A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

              • (string) –

            • ConnectionName (string) –

              The name of a connection which allows a job or crawler to access data in Amazon S3 within an Amazon Virtual Private Cloud environment (Amazon VPC).

            • SampleSize (integer) –

              Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.

            • EventQueueArn (string) –

              A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.

            • DlqEventQueueArn (string) –

              A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.

        • JdbcTargets (list) –

          Specifies JDBC targets.

          • (dict) –

            Specifies a JDBC data store to crawl.

            • ConnectionName (string) –

              The name of the connection to use to connect to the JDBC target.

            • Path (string) –

              The path of the JDBC target.

            • Exclusions (list) –

              A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

              • (string) –

            • EnableAdditionalMetadata (list) –

              Specify a value of RAWTYPES or COMMENTS to enable additional metadata in table responses. RAWTYPES provides the native-level datatype. COMMENTS provides comments associated with a column or table in the database.

              If you do not need additional metadata, keep the field empty.

              • (string) –

        • MongoDBTargets (list) –

          Specifies Amazon DocumentDB or MongoDB targets.

          • (dict) –

            Specifies an Amazon DocumentDB or MongoDB data store to crawl.

            • ConnectionName (string) –

              The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.

            • Path (string) –

              The path of the Amazon DocumentDB or MongoDB target (database/collection).

            • ScanAll (boolean) –

              Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.

              A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.

        • DynamoDBTargets (list) –

          Specifies Amazon DynamoDB targets.

          • (dict) –

            Specifies an Amazon DynamoDB table to crawl.

            • Path (string) –

              The name of the DynamoDB table to crawl.

            • scanAll (boolean) –

              Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.

              A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.

            • scanRate (float) –

              The percentage of the configured read capacity units to use by the Glue crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second.

              The valid values are null or a value between 0.1 to 1.5. A null value is used when user does not provide a value, and defaults to 0.5 of the configured Read Capacity Unit (for provisioned tables), or 0.25 of the max configured Read Capacity Unit (for tables using on-demand mode).

        • CatalogTargets (list) –

          Specifies Glue Data Catalog targets.

          • (dict) –

            Specifies an Glue Data Catalog target.

            • DatabaseName (string) –

              The name of the database to be synchronized.

            • Tables (list) –

              A list of the tables to be synchronized.

              • (string) –

            • ConnectionName (string) –

              The name of the connection for an Amazon S3-backed Data Catalog table to be a target of the crawl when using a Catalog connection type paired with a NETWORK Connection type.

            • EventQueueArn (string) –

              A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.

            • DlqEventQueueArn (string) –

              A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.

        • DeltaTargets (list) –

          Specifies Delta data store targets.

          • (dict) –

            Specifies a Delta data store to crawl one or more Delta tables.

            • DeltaTables (list) –

              A list of the Amazon S3 paths to the Delta tables.

              • (string) –

            • ConnectionName (string) –

              The name of the connection to use to connect to the Delta table target.

            • WriteManifest (boolean) –

              Specifies whether to write the manifest files to the Delta table path.

            • CreateNativeDeltaTable (boolean) –

              Specifies whether the crawler will create native tables, to allow integration with query engines that support querying of the Delta transaction log directly.

      • DatabaseName (string) –

        The name of the database in which the crawler’s output is stored.

      • Description (string) –

        A description of the crawler.

      • Classifiers (list) –

        A list of UTF-8 strings that specify the custom classifiers that are associated with the crawler.

        • (string) –

      • RecrawlPolicy (dict) –

        A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

        • RecrawlBehavior (string) –

          Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run.

          A value of CRAWL_EVERYTHING specifies crawling the entire dataset again.

          A value of CRAWL_NEW_FOLDERS_ONLY specifies crawling only folders that were added since the last crawler run.

          A value of CRAWL_EVENT_MODE specifies crawling only the changes identified by Amazon S3 events.

      • SchemaChangePolicy (dict) –

        The policy that specifies update and delete behaviors for the crawler.

        • UpdateBehavior (string) –

          The update behavior when the crawler finds a changed schema.

        • DeleteBehavior (string) –

          The deletion behavior when the crawler finds a deleted object.

      • LineageConfiguration (dict) –

        A configuration that specifies whether data lineage is enabled for the crawler.

        • CrawlerLineageSettings (string) –

          Specifies whether data lineage is enabled for the crawler. Valid values are:

          • ENABLE: enables data lineage for the crawler

          • DISABLE: disables data lineage for the crawler

      • State (string) –

        Indicates whether the crawler is running, or whether a run is pending.

      • TablePrefix (string) –

        The prefix added to the names of tables that are created.

      • Schedule (dict) –

        For scheduled crawlers, the schedule when the crawler runs.

        • ScheduleExpression (string) –

          A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

        • State (string) –

          The state of the schedule.

      • CrawlElapsedTime (integer) –

        If the crawler is running, contains the total time elapsed since the last crawl began.

      • CreationTime (datetime) –

        The time that the crawler was created.

      • LastUpdated (datetime) –

        The time that the crawler was last updated.

      • LastCrawl (dict) –

        The status of the last crawl, and potentially error information if an error occurred.

        • Status (string) –

          Status of the last crawl.

        • ErrorMessage (string) –

          If an error occurred, the error information about the last crawl.

        • LogGroup (string) –

          The log group for the last crawl.

        • LogStream (string) –

          The log stream for the last crawl.

        • MessagePrefix (string) –

          The prefix for a message about this crawl.

        • StartTime (datetime) –

          The time at which the crawl started.

      • Version (integer) –

        The version of the crawler.

      • Configuration (string) –

        Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler’s behavior. For more information, see Setting crawler configuration options.

      • CrawlerSecurityConfiguration (string) –

        The name of the SecurityConfiguration structure to be used by this crawler.

      • LakeFormationConfiguration (dict) –

        Specifies whether the crawler should use Lake Formation credentials for the crawler instead of the IAM role credentials.

        • UseLakeFormationCredentials (boolean) –

          Specifies whether to use Lake Formation credentials for the crawler instead of the IAM role credentials.

        • AccountId (string) –

          Required for cross account crawls. For same account crawls as the target data, this can be left as null.

Exceptions

  • Glue.Client.exceptions.EntityNotFoundException

  • Glue.Client.exceptions.OperationTimeoutException