GlueDataBrew / Client / create_dataset
create_dataset#
- GlueDataBrew.Client.create_dataset(**kwargs)#
- Creates a new DataBrew dataset. - See also: AWS API Documentation - Request Syntax- response = client.create_dataset( Name='string', Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC', FormatOptions={ 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, Input={ 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string', 'BucketOwner': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string', 'BucketOwner': 'string' } }, 'DatabaseInputDefinition': { 'GlueConnectionName': 'string', 'DatabaseTableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string', 'BucketOwner': 'string' }, 'QueryString': 'string' }, 'Metadata': { 'SourceArn': 'string' } }, PathOptions={ 'LastModifiedDateCondition': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } }, 'FilesLimit': { 'MaxFiles': 123, 'OrderedBy': 'LAST_MODIFIED_DATE', 'Order': 'DESCENDING'|'ASCENDING' }, 'Parameters': { 'string': { 'Name': 'string', 'Type': 'Datetime'|'Number'|'String', 'DatetimeOptions': { 'Format': 'string', 'TimezoneOffset': 'string', 'LocaleCode': 'string' }, 'CreateColumn': True|False, 'Filter': { 'Expression': 'string', 'ValuesMap': { 'string': 'string' } } } } }, Tags={ 'string': 'string' } ) - Parameters:
- Name (string) – - [REQUIRED] - The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space. 
- Format (string) – The file format of a dataset that is created from an Amazon S3 file or folder. 
- FormatOptions (dict) – - Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input. - Json (dict) – - Options that define how JSON input is to be interpreted by DataBrew. - MultiLine (boolean) – - A value that specifies whether JSON input contains embedded new line characters. 
 
- Excel (dict) – - Options that define how Excel input is to be interpreted by DataBrew. - SheetNames (list) – - One or more named sheets in the Excel file that will be included in the dataset. - (string) – 
 
- SheetIndexes (list) – - One or more sheet numbers in the Excel file that will be included in the dataset. - (integer) – 
 
- HeaderRow (boolean) – - A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated. 
 
- Csv (dict) – - Options that define how CSV input is to be interpreted by DataBrew. - Delimiter (string) – - A single character that specifies the delimiter being used in the CSV file. 
- HeaderRow (boolean) – - A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated. 
 
 
- Input (dict) – - [REQUIRED] - Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3. - S3InputDefinition (dict) – - The Amazon S3 location where the data is stored. - Bucket (string) – [REQUIRED] - The Amazon S3 bucket name. 
- Key (string) – - The unique name of the object in the bucket. 
- BucketOwner (string) – - The Amazon Web Services account ID of the bucket owner. 
 
- DataCatalogInputDefinition (dict) – - The Glue Data Catalog parameters for the data. - CatalogId (string) – - The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data. 
- DatabaseName (string) – [REQUIRED] - The name of a database in the Data Catalog. 
- TableName (string) – [REQUIRED] - The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset. 
- TempDirectory (dict) – - Represents an Amazon location where DataBrew can store intermediate results. - Bucket (string) – [REQUIRED] - The Amazon S3 bucket name. 
- Key (string) – - The unique name of the object in the bucket. 
- BucketOwner (string) – - The Amazon Web Services account ID of the bucket owner. 
 
 
- DatabaseInputDefinition (dict) – - Connection information for dataset input files stored in a database. - GlueConnectionName (string) – [REQUIRED] - The Glue Connection that stores the connection information for the target database. 
- DatabaseTableName (string) – - The table within the target database. 
- TempDirectory (dict) – - Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job. - Bucket (string) – [REQUIRED] - The Amazon S3 bucket name. 
- Key (string) – - The unique name of the object in the bucket. 
- BucketOwner (string) – - The Amazon Web Services account ID of the bucket owner. 
 
- QueryString (string) – - Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs. 
 
- Metadata (dict) – - Contains additional resource information needed for specific datasets. - SourceArn (string) – - The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow. 
 
 
- PathOptions (dict) – - A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset. - LastModifiedDateCondition (dict) – - If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3. - Expression (string) – [REQUIRED] - The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol. 
- ValuesMap (dict) – [REQUIRED] - The map of substitution variable names to their values used in this filter expression. - (string) – - (string) – 
 
 
 
- FilesLimit (dict) – - If provided, this structure imposes a limit on a number of files that should be selected. - MaxFiles (integer) – [REQUIRED] - The number of Amazon S3 files to select. 
- OrderedBy (string) – - A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it’s the only allowed value. 
- Order (string) – - A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING. 
 
- Parameters (dict) – - A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions. - (string) – - (dict) – - Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset. - Name (string) – [REQUIRED] - The name of the parameter that is used in the dataset’s Amazon S3 path. 
- Type (string) – [REQUIRED] - The type of the dataset parameter, can be one of a ‘String’, ‘Number’ or ‘Datetime’. 
- DatetimeOptions (dict) – - Additional parameter options such as a format and a timezone. Required for datetime parameters. - Format (string) – [REQUIRED] - Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. “MM.dd.yyyy-‘at’-HH:mm”. 
- TimezoneOffset (string) – - Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn’t be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed. 
- LocaleCode (string) – - Optional value for a non-US locale code, needed for correct interpretation of some date formats. 
 
- CreateColumn (boolean) – - Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset. 
- Filter (dict) – - The optional filter expression structure to apply additional matching criteria to the parameter. - Expression (string) – [REQUIRED] - The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol. 
- ValuesMap (dict) – [REQUIRED] - The map of substitution variable names to their values used in this filter expression. - (string) – - (string) – 
 
 
 
 
 
 
 
- Tags (dict) – - Metadata tags to apply to this dataset. - (string) – - (string) – 
 
 
 
- Return type:
- dict 
- Returns:
- Response Syntax- { 'Name': 'string' } - Response Structure- (dict) – - Name (string) – - The name of the dataset that you created. 
 
 
 - Exceptions- GlueDataBrew.Client.exceptions.AccessDeniedException
- GlueDataBrew.Client.exceptions.ConflictException
- GlueDataBrew.Client.exceptions.ServiceQuotaExceededException
- GlueDataBrew.Client.exceptions.ValidationException