Glue / Client / get_crawlers

get_crawlers#

Glue.Client.get_crawlers(**kwargs)#

Retrieves metadata for all crawlers defined in the customer account.

Request Syntax

response = client.get_crawlers(
    MaxResults=123,
    NextToken='string'
)

Parameters:

MaxResults (integer) – The number of crawlers to return on each call.
NextToken (string) – A continuation token, if this is a continuation request.

Return type:

dict

Returns:

Response Syntax

{
    'Crawlers': [
        {
            'Name': 'string',
            'Role': 'string',
            'Targets': {
                'S3Targets': [
                    {
                        'Path': 'string',
                        'Exclusions': [
                            'string',
                        ],
                        'ConnectionName': 'string',
                        'SampleSize': 123,
                        'EventQueueArn': 'string',
                        'DlqEventQueueArn': 'string'
                    },
                ],
                'JdbcTargets': [
                    {
                        'ConnectionName': 'string',
                        'Path': 'string',
                        'Exclusions': [
                            'string',
                        ],
                        'EnableAdditionalMetadata': [
                            'COMMENTS'|'RAWTYPES',
                        ]
                    },
                ],
                'MongoDBTargets': [
                    {
                        'ConnectionName': 'string',
                        'Path': 'string',
                        'ScanAll': True|False
                    },
                ],
                'DynamoDBTargets': [
                    {
                        'Path': 'string',
                        'scanAll': True|False,
                        'scanRate': 123.0
                    },
                ],
                'CatalogTargets': [
                    {
                        'DatabaseName': 'string',
                        'Tables': [
                            'string',
                        ],
                        'ConnectionName': 'string',
                        'EventQueueArn': 'string',
                        'DlqEventQueueArn': 'string'
                    },
                ],
                'DeltaTargets': [
                    {
                        'DeltaTables': [
                            'string',
                        ],
                        'ConnectionName': 'string',
                        'WriteManifest': True|False,
                        'CreateNativeDeltaTable': True|False
                    },
                ],
                'IcebergTargets': [
                    {
                        'Paths': [
                            'string',
                        ],
                        'ConnectionName': 'string',
                        'Exclusions': [
                            'string',
                        ],
                        'MaximumTraversalDepth': 123
                    },
                ]
            },
            'DatabaseName': 'string',
            'Description': 'string',
            'Classifiers': [
                'string',
            ],
            'RecrawlPolicy': {
                'RecrawlBehavior': 'CRAWL_EVERYTHING'|'CRAWL_NEW_FOLDERS_ONLY'|'CRAWL_EVENT_MODE'
            },
            'SchemaChangePolicy': {
                'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
                'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
            },
            'LineageConfiguration': {
                'CrawlerLineageSettings': 'ENABLE'|'DISABLE'
            },
            'State': 'READY'|'RUNNING'|'STOPPING',
            'TablePrefix': 'string',
            'Schedule': {
                'ScheduleExpression': 'string',
                'State': 'SCHEDULED'|'NOT_SCHEDULED'|'TRANSITIONING'
            },
            'CrawlElapsedTime': 123,
            'CreationTime': datetime(2015, 1, 1),
            'LastUpdated': datetime(2015, 1, 1),
            'LastCrawl': {
                'Status': 'SUCCEEDED'|'CANCELLED'|'FAILED',
                'ErrorMessage': 'string',
                'LogGroup': 'string',
                'LogStream': 'string',
                'MessagePrefix': 'string',
                'StartTime': datetime(2015, 1, 1)
            },
            'Version': 123,
            'Configuration': 'string',
            'CrawlerSecurityConfiguration': 'string',
            'LakeFormationConfiguration': {
                'UseLakeFormationCredentials': True|False,
                'AccountId': 'string'
            }
        },
    ],
    'NextToken': 'string'
}

Response Structure

(dict) –
- Crawlers (list) –
  
  A list of crawler metadata.
  - (dict) –
    
    Specifies a crawler program that examines a data source and uses classifiers to try to determine its schema. If successful, the crawler records metadata concerning the data source in the Glue Data Catalog.
    - Name (string) –
      
      The name of the crawler.
    - Role (string) –
      
      The Amazon Resource Name (ARN) of an IAM role that’s used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.
    - Targets (dict) –
      
      A collection of targets to crawl.
      - S3Targets (list) –
        
        Specifies Amazon Simple Storage Service (Amazon S3) targets.
        
        (dict) –
        
        Specifies a data store in Amazon Simple Storage Service (Amazon S3).
        
        Path (string) –
        
        The path to the Amazon S3 target.
        
        Exclusions (list) –
        
        A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.
        
        (string) –
        
        ConnectionName (string) –
        
        The name of a connection which allows a job or crawler to access data in Amazon S3 within an Amazon Virtual Private Cloud environment (Amazon VPC).
        
        SampleSize (integer) –
        
        Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.
        
        EventQueueArn (string) –
        
        A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.
        
        DlqEventQueueArn (string) –
        
        A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.
      - JdbcTargets (list) –
        
        Specifies JDBC targets.
        
        (dict) –
        
        Specifies a JDBC data store to crawl.
        
        ConnectionName (string) –
        
        The name of the connection to use to connect to the JDBC target.
        
        Path (string) –
        
        The path of the JDBC target.
        
        Exclusions (list) –
        
        A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.
        
        (string) –
        
        EnableAdditionalMetadata (list) –
        
        Specify a value of RAWTYPES or COMMENTS to enable additional metadata in table responses. RAWTYPES provides the native-level datatype. COMMENTS provides comments associated with a column or table in the database.
        
        If you do not need additional metadata, keep the field empty.
        
        (string) –
      - MongoDBTargets (list) –
        
        Specifies Amazon DocumentDB or MongoDB targets.
        
        (dict) –
        
        Specifies an Amazon DocumentDB or MongoDB data store to crawl.
        
        ConnectionName (string) –
        
        The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.
        
        Path (string) –
        
        The path of the Amazon DocumentDB or MongoDB target (database/collection).
        
        ScanAll (boolean) –
        
        Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.
        
        A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.
      - DynamoDBTargets (list) –
        
        Specifies Amazon DynamoDB targets.
        
        (dict) –
        
        Specifies an Amazon DynamoDB table to crawl.
        
        Path (string) –
        
        The name of the DynamoDB table to crawl.
        
        scanAll (boolean) –
        
        Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.
        
        A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.
        
        scanRate (float) –
        
        The percentage of the configured read capacity units to use by the Glue crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second.
        
        The valid values are null or a value between 0.1 to 1.5. A null value is used when user does not provide a value, and defaults to 0.5 of the configured Read Capacity Unit (for provisioned tables), or 0.25 of the max configured Read Capacity Unit (for tables using on-demand mode).
      - CatalogTargets (list) –
        
        Specifies Glue Data Catalog targets.
        
        (dict) –
        
        Specifies an Glue Data Catalog target.
        
        DatabaseName (string) –
        
        The name of the database to be synchronized.
        
        Tables (list) –
        
        A list of the tables to be synchronized.
        
        (string) –
        
        ConnectionName (string) –
        
        The name of the connection for an Amazon S3-backed Data Catalog table to be a target of the crawl when using a Catalog connection type paired with a NETWORK Connection type.
        
        EventQueueArn (string) –
        
        A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.
        
        DlqEventQueueArn (string) –
        
        A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.
      - DeltaTargets (list) –
        
        Specifies Delta data store targets.
        
        (dict) –
        
        Specifies a Delta data store to crawl one or more Delta tables.
        
        DeltaTables (list) –
        
        A list of the Amazon S3 paths to the Delta tables.
        
        (string) –
        
        ConnectionName (string) –
        
        The name of the connection to use to connect to the Delta table target.
        
        WriteManifest (boolean) –
        
        Specifies whether to write the manifest files to the Delta table path.
        
        CreateNativeDeltaTable (boolean) –
        
        Specifies whether the crawler will create native tables, to allow integration with query engines that support querying of the Delta transaction log directly.
      - IcebergTargets (list) –
        
        Specifies Apache Iceberg data store targets.
        
        (dict) –
        
        Specifies an Apache Iceberg data source where Iceberg tables are stored in Amazon S3.
        
        Paths (list) –
        
        One or more Amazon S3 paths that contains Iceberg metadata folders as s3://bucket/prefix.
        
        (string) –
        
        ConnectionName (string) –
        
        The name of the connection to use to connect to the Iceberg target.
        
        Exclusions (list) –
        
        A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.
        
        (string) –
        
        MaximumTraversalDepth (integer) –
        
        The maximum depth of Amazon S3 paths that the crawler can traverse to discover the Iceberg metadata folder in your Amazon S3 path. Used to limit the crawler run time.
    - DatabaseName (string) –
      
      The name of the database in which the crawler’s output is stored.
    - Description (string) –
      
      A description of the crawler.
    - Classifiers (list) –
      
      A list of UTF-8 strings that specify the custom classifiers that are associated with the crawler.
      - (string) –
    - RecrawlPolicy (dict) –
      
      A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.
      - RecrawlBehavior (string) –
        
        Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run.
        
        A value of CRAWL_EVERYTHING specifies crawling the entire dataset again.
        
        A value of CRAWL_NEW_FOLDERS_ONLY specifies crawling only folders that were added since the last crawler run.
        
        A value of CRAWL_EVENT_MODE specifies crawling only the changes identified by Amazon S3 events.
    - SchemaChangePolicy (dict) –
      
      The policy that specifies update and delete behaviors for the crawler.
      - UpdateBehavior (string) –
        
        The update behavior when the crawler finds a changed schema.
      - DeleteBehavior (string) –
        
        The deletion behavior when the crawler finds a deleted object.
    - LineageConfiguration (dict) –
      
      A configuration that specifies whether data lineage is enabled for the crawler.
      - CrawlerLineageSettings (string) –
        
        Specifies whether data lineage is enabled for the crawler. Valid values are:
        
        ENABLE: enables data lineage for the crawler
        
        DISABLE: disables data lineage for the crawler
    - State (string) –
      
      Indicates whether the crawler is running, or whether a run is pending.
    - TablePrefix (string) –
      
      The prefix added to the names of tables that are created.
    - Schedule (dict) –
      
      For scheduled crawlers, the schedule when the crawler runs.
      - ScheduleExpression (string) –
        
        A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
      - State (string) –
        
        The state of the schedule.
    - CrawlElapsedTime (integer) –
      
      If the crawler is running, contains the total time elapsed since the last crawl began.
    - CreationTime (datetime) –
      
      The time that the crawler was created.
    - LastUpdated (datetime) –
      
      The time that the crawler was last updated.
    - LastCrawl (dict) –
      
      The status of the last crawl, and potentially error information if an error occurred.
      - Status (string) –
        
        Status of the last crawl.
      - ErrorMessage (string) –
        
        If an error occurred, the error information about the last crawl.
      - LogGroup (string) –
        
        The log group for the last crawl.
      - LogStream (string) –
        
        The log stream for the last crawl.
      - MessagePrefix (string) –
        
        The prefix for a message about this crawl.
      - StartTime (datetime) –
        
        The time at which the crawl started.
    - Version (integer) –
      
      The version of the crawler.
    - Configuration (string) –
      
      Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler’s behavior. For more information, see Setting crawler configuration options.
    - CrawlerSecurityConfiguration (string) –
      
      The name of the SecurityConfiguration structure to be used by this crawler.
    - LakeFormationConfiguration (dict) –
      
      Specifies whether the crawler should use Lake Formation credentials for the crawler instead of the IAM role credentials.
      - UseLakeFormationCredentials (boolean) –
        
        Specifies whether to use Lake Formation credentials for the crawler instead of the IAM role credentials.
      - AccountId (string) –
        
        Required for cross account crawls. For same account crawls as the target data, this can be left as null.
- NextToken (string) –
  
  A continuation token, if the returned list has not reached the end of those defined in this customer account.

Exceptions

Glue.Client.exceptions.OperationTimeoutException