tencent cloud

Feedback

CREATE TABLE

Last updated: 2024-08-07 17:12:15

    Description

    Supported engines: Presto and SparkSQL
    Applicable table: Native Iceberg tables and external tables
    Purpose: Create a table with some attributes. CREATE TABLE AS can also be used.
    Storage path for the created table: The path can be a COS directory but not a file.

    Statement for External Table

    Statement

    CREATE TABLE [ IF NOT EXISTS ] table_identifier
    ( col_name[:] col_type [ COMMENT col_comment ], ... )
    USING data_source
    [ COMMENT table_comment ]
    [ OPTIONS ( 'key1'='value1', 'key2'='value2' )]
    [ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]
    [ LOCATION path ]
    [ TBLPROPERTIES ( property_name=property_value, ... ) ]

    Parameter

    USING data_source: Data source used during table creation. Currently, the value of data_source can be CSV, ORC, PARQUET, and ICEBERG. table_identifier: Table name in the three-part format. Example: catalog.database.table. COMMENT: Table description
    OPTIONS: Additional parameters of USING data_source, which are used to pass dynamic values during storage. PARTITIONED BY: Create partitions based on specified columns. LOCATION path: Storage path of the data table. TBLPROPERTIES: A set of key-value pairs used to specify table parameters.

    Detailed Description of USING and OPTIONS

    USING CSV
    USING ORC
    USING PARQUET
    Reference: Working with CSV.
    Configurations supported by CSV data tables are as follows.
    Key Supported by OPTIONS
    Corresponding Default Value
    Description
    sep or delimiter
    ,
    Delimiter between columns for storage in CSV files. The comma is used by default.
    mode
    PERMISSIVE
    Processing mode when data conversion does not meet expectations.
    PERMISSIVE: A more relaxed mode, which is the default mode. If extra columns exist after the data conversion of a certain row, only the required columns are adopted automatically.
    DROPMALFORMED: Discard data that does not meet expectations. For example, if a row has an extra column, the row will be discarded.
    FAILFAST: Strictly follow the CSV format. The conversion fails if the result does not meet expectations, such as extra columns in a row.
    encoding or charset
    UTF-8
    String encoding scheme
    Example: UTF-8, US-ASCII, ISO-8859-1, UTF-16BE, UTF-16LE, and UTF-16
    quote
    \\"
    Indicate whether the quotation marks are single or double. Remember to use an escape character.
    escape
    \\\\
    Escape character
    charToEscapeQuoteEscaping
    -
    Escape characters required inside quotation marks
    comment
    \\u0000
    Remarks.
    header
    false
    Indicate whether a header exists.
    inferSchema
    false
    Indicate whether to infer the data type of each column. If the inference is not performed, the data type of each column is string.
    ignoreLeadingWhiteSpace
    Read: false
    Write: true
    Ignore the leading empty string.
    ignoreTrailingWhiteSpace
    Read: false
    Write: true
    Ignore the trailing empty string.
    columnNameOfCorruptRecord
    _corrupt_record
    Name of the column where data cannot be converted. If the value is inconsistent with that of spark.sql.columnNameOfCorruptRecord, the value specified in the CREATE TABLE statement prevails.
    nullValue
    -
    Storage format for null values, which are empty strings by default. In this case, use the format defined by emptyValue.
    nanValue
    NaN
    Storage format for non-numeric values
    positiveInf
    Inf
    Storage format for positive infinity values
    negativeInf
    -Inf
    Storage format for negative infinity values
    compression or codec
    -
    Name of the compression algorithm. No compression is performed by default. Abbreviations such as bzip2, deflate, gzip, lz4, and snappy can be used.
    timeZone
    Default system time zone
    Default time zone. If the value is inconsistent with that of spark.sql.session.timeZone (example: Asia/Shanghai), the value specified in the CREATE TABLE statement prevails.
    locale
    en-US
    Language
    dateFormat
    yyyy-MM-dd
    Default date format
    timestampFormat
    yyyy-MM-dd'T'HH:mm:ss.SSSXXX
    Default time format, which is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] in non-LEGACY mode.
    multiLine
    false
    Allow multiple rows.
    maxColumns
    20480
    Maximum number of columns
    maxCharsPerColumn
    -1
    Maximum number of characters per column. -1 indicates that the maximum number is not limited.
    escapeQuotes
    true
    Escape quotation marks
    quoteAll
    quoteAll
    Enclose all values in quotation marks.
    samplingRatio
    1.0
    Sampling ratio
    enforceSchema
    true
    Forcibly apply the specified schema for reading, and ignore the table headers.
    emptyValue
    Read:
    Write: \\"\\"
    Read and write formats for empty values.
    lineSep
    -
    Line break.
    inputBufferSize
    -
    Buffer size for read. If the value is inconsistent with that of spark.sql.csv.parser.inputBufferSize, the value specified in the CREATE TABLE statement prevails.
    unescapedQuoteHandling
    STOP_AT_DELIMITER
    Policy for handling unescaped quotation marks
    STOP_AT_DELIMITER: Stop reading when a delimiter is found.
    BACK_TO_DELIMITER: Back to the delimiter.
    STOP_AT_CLOSING_QUOTE: Stop reading when a closing quotation mark is found.
    SKIP_VALUE: Skip this column of data.
    RAISE_ERROR: Report an error.
    
    Configurations supported by ORC data tables are as follows:
    Key Supported by OPTIONS
    Corresponding Default Value
    Description
    compression or orc.compress
    snappy
    Compression algorithm. Abbreviations such as snappy, zlib, lzo, lz3, and zstd are supported. If the value is inconsistent with that of spark.sql.orc.compression.codec, the value specified in the CREATE TABLE statement prevails.
    mergeSchema
    false
    Merge schemas. If the value is inconsistent with that of spark.sql.orc.mergeSchema, the value specified in the CREATE TABLE statement prevails.
    If you use HiveRead and HiveWriter (configuring spark.sql.hive.convertMetastoreOrc=false) for read and write, OPTIONS can also support native ORC configurations. For details, see LanguageManual ORC.
    Most of the parameters related to PARQUET data tables can be configured using SparkConf (recommended). Configurations supported by OPTIONS are as follows:
    Key Supported by OPTIONS
    Corresponding Default Value
    Description
    compression or parquet.compression
    snappy
    Compression algorithm. Snappy is used by default. If the value is inconsistent with that of spark.sql.parquet.compression.codec, the value specified in the CREATE TABLE statement prevails.
    mergeSchema
    false
    Indicate whether to merge schemas. If the value is inconsistent with that of spark.sql.parquet.mergeSchema, the value specified in the CREATE TABLE statement prevails.
    datetimeRebaseMode
    EXCEPTION
    Policy of date conversion during PARQUET file write. Dates are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when dates are in different formats. If the value is inconsistent with that of spark.sql.parquet.datetimeRebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
    int96RebaseMode
    EXCEPTION
    Policy of time conversion during PARQUET file read. Time values are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when time values are in different formats. If the value is inconsistent with that of spark.sql.parquet.int96RebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
    If you use HiveRead and HiveWriter (setting spark.sql.hive.convertMetastoreParquet to false) for read and write, OPTIONS can also support native PARQUET configurations. Reference: Hadoop integration.

    Example

    CREATE TABLE dempts(
    id bigint COMMENT 'id number',
    num int,
    eno float,
    dno double,
    cno decimal(9,3),
    flag boolean,
    data string,
    ts_year timestamp,
    date_month date,
    bno binary,
    point struct<x: double, y: double>,
    points array<struct<x: double, y: double>>,
    pointmaps map<struct<x: int>, struct<a: int>>
    )
    USING iceberg
    COMMENT 'table documentation'
    PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data))
    LOCATION '/warehouse/db_001/dempts'
    TBLPROPERTIES ('write.format.default'='orc');

    Notes

    Spark's USING and Hive's STORED AS are different when they are used as keywords of CREATE_TABLE. The file format and read method after table creation may not meet expectations. Note:
    USING DATA_SOURCE: Used in Spark statement. This keyword indicates the data source used as the input format for table creation and directly affects the file format and reading method of the table under the location. The value can be CSV, TXT, Iceberg, Parquet, Orc, etc.
    STORED AS FILE_FORMAT: This keyword is used in Hive statements to specify the file format for table storage. The value can be TXT, Parquet, Orc, etc. This keyword is not recommended because the native reader/writer of Spark may not support the specified file format such as CSV.

    Statement for Native Iceberg Table

    Caution
    This statement is supported for creating native tables only.

    Statement

    CREATE TABLE [ IF NOT EXISTS ] table_identifier
    ( col_name[:] col_type [ COMMENT col_comment ], ... )
    [ COMMENT table_comment ]
    [ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]

    Parameter

    table_identifier: Table name in the three-part format. Example: catalog.db.name. Schemas and Data Types
    col_type
    : primitive_type
    | nested_type
    
    primitive_type
    : boolean
    | int/integer
    | long/bigint
    | float
    | double
    | decimal(p,s), p=maximum number of digits, s=maximum number of decimal places, s<=p<=38
    | date
    | timestamp, timestamp with timezone, time and without timezone are not supported.
    | string, which corresponds to the UUID type of Iceberg tables.
    | binary, which corresponds to the FIXED type of Iceberg tables.
    
    nested_type
    : struct
    | list
    | map
    Partition Transforms
    transform
    : identity, support any type. DLC does not support this transformation.
    | bucket[N], partition by hashed value mod N buckets, supporting col_type: int,long, decimal, date, timestamp, string, binary
    | truncate[L], partition by value truncated to L, supporting col_type: int,long,decimal,string
    | years, partition by year, supporting col_type: date,timestamp
    | months, partition by month, supporting col_type: date,timestamp
    | days/date, partition by date, supporting col_type: date,timestamp
    | hours/date_hour, partition by hour, supporting col_type: timestamp

    Example

    CREATE TABLE dempts(
    id bigint COMMENT 'id number',
    num int,
    eno float,
    dno double,
    cno decimal(9,3),
    flag boolean,
    data string,
    ts_year timestamp,
    date_month date,
    bno binary,
    point struct<x: double, y: double>,
    points array<struct<x: double, y: double>>,
    pointmaps map<struct<x: int>, struct<a: int>>
    )
    COMMENT 'table documentation'
    PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data));
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support