Glue / Client / update_crawler

update_crawler#

Glue.Client.update_crawler(**kwargs)#

Updates a crawler. If a crawler is running, you must stop it using StopCrawler before updating it.

See also: AWS API Documentation

Request Syntax

response = client.update_crawler(
    Name='string',
    Role='string',
    DatabaseName='string',
    Description='string',
    Targets={
        'S3Targets': [
            {
                'Path': 'string',
                'Exclusions': [
                    'string',
                ],
                'ConnectionName': 'string',
                'SampleSize': 123,
                'EventQueueArn': 'string',
                'DlqEventQueueArn': 'string'
            },
        ],
        'JdbcTargets': [
            {
                'ConnectionName': 'string',
                'Path': 'string',
                'Exclusions': [
                    'string',
                ],
                'EnableAdditionalMetadata': [
                    'COMMENTS'|'RAWTYPES',
                ]
            },
        ],
        'MongoDBTargets': [
            {
                'ConnectionName': 'string',
                'Path': 'string',
                'ScanAll': True|False
            },
        ],
        'DynamoDBTargets': [
            {
                'Path': 'string',
                'scanAll': True|False,
                'scanRate': 123.0
            },
        ],
        'CatalogTargets': [
            {
                'DatabaseName': 'string',
                'Tables': [
                    'string',
                ],
                'ConnectionName': 'string',
                'EventQueueArn': 'string',
                'DlqEventQueueArn': 'string'
            },
        ],
        'DeltaTargets': [
            {
                'DeltaTables': [
                    'string',
                ],
                'ConnectionName': 'string',
                'WriteManifest': True|False,
                'CreateNativeDeltaTable': True|False
            },
        ],
        'IcebergTargets': [
            {
                'Paths': [
                    'string',
                ],
                'ConnectionName': 'string',
                'Exclusions': [
                    'string',
                ],
                'MaximumTraversalDepth': 123
            },
        ]
    },
    Schedule='string',
    Classifiers=[
        'string',
    ],
    TablePrefix='string',
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
    },
    RecrawlPolicy={
        'RecrawlBehavior': 'CRAWL_EVERYTHING'|'CRAWL_NEW_FOLDERS_ONLY'|'CRAWL_EVENT_MODE'
    },
    LineageConfiguration={
        'CrawlerLineageSettings': 'ENABLE'|'DISABLE'
    },
    LakeFormationConfiguration={
        'UseLakeFormationCredentials': True|False,
        'AccountId': 'string'
    },
    Configuration='string',
    CrawlerSecurityConfiguration='string'
)
Parameters:
  • Name (string) –

    [REQUIRED]

    Name of the new crawler.

  • Role (string) – The IAM role or Amazon Resource Name (ARN) of an IAM role that is used by the new crawler to access customer resources.

  • DatabaseName (string) – The Glue database where results are stored, such as: arn:aws:daylight:us-east-1::database/sometable/*.

  • Description (string) – A description of the new crawler.

  • Targets (dict) –

    A list of targets to crawl.

    • S3Targets (list) –

      Specifies Amazon Simple Storage Service (Amazon S3) targets.

      • (dict) –

        Specifies a data store in Amazon Simple Storage Service (Amazon S3).

        • Path (string) –

          The path to the Amazon S3 target.

        • Exclusions (list) –

          A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

          • (string) –

        • ConnectionName (string) –

          The name of a connection which allows a job or crawler to access data in Amazon S3 within an Amazon Virtual Private Cloud environment (Amazon VPC).

        • SampleSize (integer) –

          Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.

        • EventQueueArn (string) –

          A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.

        • DlqEventQueueArn (string) –

          A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.

    • JdbcTargets (list) –

      Specifies JDBC targets.

      • (dict) –

        Specifies a JDBC data store to crawl.

        • ConnectionName (string) –

          The name of the connection to use to connect to the JDBC target.

        • Path (string) –

          The path of the JDBC target.

        • Exclusions (list) –

          A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

          • (string) –

        • EnableAdditionalMetadata (list) –

          Specify a value of RAWTYPES or COMMENTS to enable additional metadata in table responses. RAWTYPES provides the native-level datatype. COMMENTS provides comments associated with a column or table in the database.

          If you do not need additional metadata, keep the field empty.

          • (string) –

    • MongoDBTargets (list) –

      Specifies Amazon DocumentDB or MongoDB targets.

      • (dict) –

        Specifies an Amazon DocumentDB or MongoDB data store to crawl.

        • ConnectionName (string) –

          The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.

        • Path (string) –

          The path of the Amazon DocumentDB or MongoDB target (database/collection).

        • ScanAll (boolean) –

          Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.

          A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.

    • DynamoDBTargets (list) –

      Specifies Amazon DynamoDB targets.

      • (dict) –

        Specifies an Amazon DynamoDB table to crawl.

        • Path (string) –

          The name of the DynamoDB table to crawl.

        • scanAll (boolean) –

          Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table.

          A value of true means to scan all records, while a value of false means to sample the records. If no value is specified, the value defaults to true.

        • scanRate (float) –

          The percentage of the configured read capacity units to use by the Glue crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second.

          The valid values are null or a value between 0.1 to 1.5. A null value is used when user does not provide a value, and defaults to 0.5 of the configured Read Capacity Unit (for provisioned tables), or 0.25 of the max configured Read Capacity Unit (for tables using on-demand mode).

    • CatalogTargets (list) –

      Specifies Glue Data Catalog targets.

      • (dict) –

        Specifies an Glue Data Catalog target.

        • DatabaseName (string) – [REQUIRED]

          The name of the database to be synchronized.

        • Tables (list) – [REQUIRED]

          A list of the tables to be synchronized.

          • (string) –

        • ConnectionName (string) –

          The name of the connection for an Amazon S3-backed Data Catalog table to be a target of the crawl when using a Catalog connection type paired with a NETWORK Connection type.

        • EventQueueArn (string) –

          A valid Amazon SQS ARN. For example, arn:aws:sqs:region:account:sqs.

        • DlqEventQueueArn (string) –

          A valid Amazon dead-letter SQS ARN. For example, arn:aws:sqs:region:account:deadLetterQueue.

    • DeltaTargets (list) –

      Specifies Delta data store targets.

      • (dict) –

        Specifies a Delta data store to crawl one or more Delta tables.

        • DeltaTables (list) –

          A list of the Amazon S3 paths to the Delta tables.

          • (string) –

        • ConnectionName (string) –

          The name of the connection to use to connect to the Delta table target.

        • WriteManifest (boolean) –

          Specifies whether to write the manifest files to the Delta table path.

        • CreateNativeDeltaTable (boolean) –

          Specifies whether the crawler will create native tables, to allow integration with query engines that support querying of the Delta transaction log directly.

    • IcebergTargets (list) –

      Specifies Apache Iceberg data store targets.

      • (dict) –

        Specifies an Apache Iceberg data source where Iceberg tables are stored in Amazon S3.

        • Paths (list) –

          One or more Amazon S3 paths that contains Iceberg metadata folders as s3://bucket/prefix.

          • (string) –

        • ConnectionName (string) –

          The name of the connection to use to connect to the Iceberg target.

        • Exclusions (list) –

          A list of glob patterns used to exclude from the crawl. For more information, see Catalog Tables with a Crawler.

          • (string) –

        • MaximumTraversalDepth (integer) –

          The maximum depth of Amazon S3 paths that the crawler can traverse to discover the Iceberg metadata folder in your Amazon S3 path. Used to limit the crawler run time.

  • Schedule (string) – A cron expression used to specify the schedule (see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).

  • Classifiers (list) –

    A list of custom classifiers that the user has registered. By default, all built-in classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.

    • (string) –

  • TablePrefix (string) – The table prefix used for catalog tables that are created.

  • SchemaChangePolicy (dict) –

    The policy for the crawler’s update and deletion behavior.

    • UpdateBehavior (string) –

      The update behavior when the crawler finds a changed schema.

    • DeleteBehavior (string) –

      The deletion behavior when the crawler finds a deleted object.

  • RecrawlPolicy (dict) –

    A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

    • RecrawlBehavior (string) –

      Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run.

      A value of CRAWL_EVERYTHING specifies crawling the entire dataset again.

      A value of CRAWL_NEW_FOLDERS_ONLY specifies crawling only folders that were added since the last crawler run.

      A value of CRAWL_EVENT_MODE specifies crawling only the changes identified by Amazon S3 events.

  • LineageConfiguration (dict) –

    Specifies data lineage configuration settings for the crawler.

    • CrawlerLineageSettings (string) –

      Specifies whether data lineage is enabled for the crawler. Valid values are:

      • ENABLE: enables data lineage for the crawler

      • DISABLE: disables data lineage for the crawler

  • LakeFormationConfiguration (dict) –

    Specifies Lake Formation configuration settings for the crawler.

    • UseLakeFormationCredentials (boolean) –

      Specifies whether to use Lake Formation credentials for the crawler instead of the IAM role credentials.

    • AccountId (string) –

      Required for cross account crawls. For same account crawls as the target data, this can be left as null.

  • Configuration (string) – Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler’s behavior. For more information, see Setting crawler configuration options.

  • CrawlerSecurityConfiguration (string) – The name of the SecurityConfiguration structure to be used by this crawler.

Return type:

dict

Returns:

Response Syntax

{}

Response Structure

  • (dict) –

Exceptions

  • Glue.Client.exceptions.InvalidInputException

  • Glue.Client.exceptions.VersionMismatchException

  • Glue.Client.exceptions.EntityNotFoundException

  • Glue.Client.exceptions.CrawlerRunningException

  • Glue.Client.exceptions.OperationTimeoutException