GlueDataBrew / Client / create_dataset

create_dataset#

GlueDataBrew.Client.create_dataset(**kwargs)#

Creates a new DataBrew dataset.

Request Syntax

response = client.create_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    Tags={
        'string': 'string'
    }
)

Parameters:

Name (string) –
[REQUIRED]

The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.
Format (string) – The file format of a dataset that is created from an Amazon S3 file or folder.
FormatOptions (dict) –
Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.
- Json (dict) –
  
  Options that define how JSON input is to be interpreted by DataBrew.
  - MultiLine (boolean) –
    
    A value that specifies whether JSON input contains embedded new line characters.
- Excel (dict) –
  
  Options that define how Excel input is to be interpreted by DataBrew.
  - SheetNames (list) –
    
    One or more named sheets in the Excel file that will be included in the dataset.
    - (string) –
  - SheetIndexes (list) –
    
    One or more sheet numbers in the Excel file that will be included in the dataset.
    - (integer) –
  - HeaderRow (boolean) –
    
    A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
- Csv (dict) –
  
  Options that define how CSV input is to be interpreted by DataBrew.
  - Delimiter (string) –
    
    A single character that specifies the delimiter being used in the CSV file.
  - HeaderRow (boolean) –
    
    A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.
Input (dict) –
[REQUIRED]

Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.
- S3InputDefinition (dict) –
  
  The Amazon S3 location where the data is stored.
  - Bucket (string) – [REQUIRED]
    
    The Amazon S3 bucket name.
  - Key (string) –
    
    The unique name of the object in the bucket.
  - BucketOwner (string) –
    
    The Amazon Web Services account ID of the bucket owner.
- DataCatalogInputDefinition (dict) –
  
  The Glue Data Catalog parameters for the data.
  - CatalogId (string) –
    
    The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.
  - DatabaseName (string) – [REQUIRED]
    
    The name of a database in the Data Catalog.
  - TableName (string) – [REQUIRED]
    
    The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
  - TempDirectory (dict) –
    
    Represents an Amazon location where DataBrew can store intermediate results.
    - Bucket (string) – [REQUIRED]
      
      The Amazon S3 bucket name.
    - Key (string) –
      
      The unique name of the object in the bucket.
    - BucketOwner (string) –
      
      The Amazon Web Services account ID of the bucket owner.
- DatabaseInputDefinition (dict) –
  
  Connection information for dataset input files stored in a database.
  - GlueConnectionName (string) – [REQUIRED]
    
    The Glue Connection that stores the connection information for the target database.
  - DatabaseTableName (string) –
    
    The table within the target database.
  - TempDirectory (dict) –
    
    Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.
    - Bucket (string) – [REQUIRED]
      
      The Amazon S3 bucket name.
    - Key (string) –
      
      The unique name of the object in the bucket.
    - BucketOwner (string) –
      
      The Amazon Web Services account ID of the bucket owner.
  - QueryString (string) –
    
    Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.
- Metadata (dict) –
  
  Contains additional resource information needed for specific datasets.
  - SourceArn (string) –
    
    The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.
PathOptions (dict) –
A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.
- LastModifiedDateCondition (dict) –
  
  If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.
  - Expression (string) – [REQUIRED]
    
    The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol.
  - ValuesMap (dict) – [REQUIRED]
    
    The map of substitution variable names to their values used in this filter expression.
    - (string) –
      - (string) –
- FilesLimit (dict) –
  
  If provided, this structure imposes a limit on a number of files that should be selected.
  - MaxFiles (integer) – [REQUIRED]
    
    The number of Amazon S3 files to select.
  - OrderedBy (string) –
    
    A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it’s the only allowed value.
  - Order (string) –
    
    A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.
- Parameters (dict) –
  
  A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.
  - (string) –
    - (dict) –
      
      Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.
      - Name (string) – [REQUIRED]
        
        The name of the parameter that is used in the dataset’s Amazon S3 path.
      - Type (string) – [REQUIRED]
        
        The type of the dataset parameter, can be one of a ‘String’, ‘Number’ or ‘Datetime’.
      - DatetimeOptions (dict) –
        
        Additional parameter options such as a format and a timezone. Required for datetime parameters.
        
        Format (string) – [REQUIRED]
        
        Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. “MM.dd.yyyy-‘at’-HH:mm”.
        
        TimezoneOffset (string) –
        
        Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn’t be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.
        
        LocaleCode (string) –
        
        Optional value for a non-US locale code, needed for correct interpretation of some date formats.
      - CreateColumn (boolean) –
        
        Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.
      - Filter (dict) –
        
        The optional filter expression structure to apply additional matching criteria to the parameter.
        
        Expression (string) – [REQUIRED]
        
        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol.
        
        ValuesMap (dict) – [REQUIRED]
        
        The map of substitution variable names to their values used in this filter expression.
        
        (string) –
        
        (string) –
Tags (dict) –
Metadata tags to apply to this dataset.
- (string) –
  - (string) –

Return type:

dict

Returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

(dict) –
- Name (string) –
  
  The name of the dataset that you created.

Exceptions

GlueDataBrew.Client.exceptions.AccessDeniedException
GlueDataBrew.Client.exceptions.ConflictException
GlueDataBrew.Client.exceptions.ServiceQuotaExceededException
GlueDataBrew.Client.exceptions.ValidationException