tencent cloud

All product documents
Data Lake Compute
CREATE TABLE
Last updated: 2024-08-07 17:12:15
CREATE TABLE
Last updated: 2024-08-07 17:12:15

Description

Supported engines: Presto and SparkSQL
Applicable table: Native Iceberg tables and external tables
Purpose: Create a table with some attributes. CREATE TABLE AS can also be used.
Storage path for the created table: The path can be a COS directory but not a file.

Statement for External Table

Statement

CREATE TABLE [ IF NOT EXISTS ] table_identifier
( col_name[:] col_type [ COMMENT col_comment ], ... )
USING data_source
[ COMMENT table_comment ]
[ OPTIONS ( 'key1'='value1', 'key2'='value2' )]
[ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]
[ LOCATION path ]
[ TBLPROPERTIES ( property_name=property_value, ... ) ]

Parameter

USING data_source: Data source used during table creation. Currently, the value of data_source can be CSV, ORC, PARQUET, and ICEBERG. table_identifier: Table name in the three-part format. Example: catalog.database.table. COMMENT: Table description
OPTIONS: Additional parameters of USING data_source, which are used to pass dynamic values during storage. PARTITIONED BY: Create partitions based on specified columns. LOCATION path: Storage path of the data table. TBLPROPERTIES: A set of key-value pairs used to specify table parameters.

Detailed Description of USING and OPTIONS

USING CSV
USING ORC
USING PARQUET
Reference: Working with CSV.
Configurations supported by CSV data tables are as follows.
Key Supported by OPTIONS
Corresponding Default Value
Description
sep or delimiter
,
Delimiter between columns for storage in CSV files. The comma is used by default.
mode
PERMISSIVE
Processing mode when data conversion does not meet expectations.
PERMISSIVE: A more relaxed mode, which is the default mode. If extra columns exist after the data conversion of a certain row, only the required columns are adopted automatically.
DROPMALFORMED: Discard data that does not meet expectations. For example, if a row has an extra column, the row will be discarded.
FAILFAST: Strictly follow the CSV format. The conversion fails if the result does not meet expectations, such as extra columns in a row.
encoding or charset
UTF-8
String encoding scheme
Example: UTF-8, US-ASCII, ISO-8859-1, UTF-16BE, UTF-16LE, and UTF-16
quote
\\"
Indicate whether the quotation marks are single or double. Remember to use an escape character.
escape
\\\\
Escape character
charToEscapeQuoteEscaping
-
Escape characters required inside quotation marks
comment
\\u0000
Remarks.
header
false
Indicate whether a header exists.
inferSchema
false
Indicate whether to infer the data type of each column. If the inference is not performed, the data type of each column is string.
ignoreLeadingWhiteSpace
Read: false
Write: true
Ignore the leading empty string.
ignoreTrailingWhiteSpace
Read: false
Write: true
Ignore the trailing empty string.
columnNameOfCorruptRecord
_corrupt_record
Name of the column where data cannot be converted. If the value is inconsistent with that of spark.sql.columnNameOfCorruptRecord, the value specified in the CREATE TABLE statement prevails.
nullValue
-
Storage format for null values, which are empty strings by default. In this case, use the format defined by emptyValue.
nanValue
NaN
Storage format for non-numeric values
positiveInf
Inf
Storage format for positive infinity values
negativeInf
-Inf
Storage format for negative infinity values
compression or codec
-
Name of the compression algorithm. No compression is performed by default. Abbreviations such as bzip2, deflate, gzip, lz4, and snappy can be used.
timeZone
Default system time zone
Default time zone. If the value is inconsistent with that of spark.sql.session.timeZone (example: Asia/Shanghai), the value specified in the CREATE TABLE statement prevails.
locale
en-US
Language
dateFormat
yyyy-MM-dd
Default date format
timestampFormat
yyyy-MM-dd'T'HH:mm:ss.SSSXXX
Default time format, which is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] in non-LEGACY mode.
multiLine
false
Allow multiple rows.
maxColumns
20480
Maximum number of columns
maxCharsPerColumn
-1
Maximum number of characters per column. -1 indicates that the maximum number is not limited.
escapeQuotes
true
Escape quotation marks
quoteAll
quoteAll
Enclose all values in quotation marks.
samplingRatio
1.0
Sampling ratio
enforceSchema
true
Forcibly apply the specified schema for reading, and ignore the table headers.
emptyValue
Read:
Write: \\"\\"
Read and write formats for empty values.
lineSep
-
Line break.
inputBufferSize
-
Buffer size for read. If the value is inconsistent with that of spark.sql.csv.parser.inputBufferSize, the value specified in the CREATE TABLE statement prevails.
unescapedQuoteHandling
STOP_AT_DELIMITER
Policy for handling unescaped quotation marks
STOP_AT_DELIMITER: Stop reading when a delimiter is found.
BACK_TO_DELIMITER: Back to the delimiter.
STOP_AT_CLOSING_QUOTE: Stop reading when a closing quotation mark is found.
SKIP_VALUE: Skip this column of data.
RAISE_ERROR: Report an error.

Configurations supported by ORC data tables are as follows:
Key Supported by OPTIONS
Corresponding Default Value
Description
compression or orc.compress
snappy
Compression algorithm. Abbreviations such as snappy, zlib, lzo, lz3, and zstd are supported. If the value is inconsistent with that of spark.sql.orc.compression.codec, the value specified in the CREATE TABLE statement prevails.
mergeSchema
false
Merge schemas. If the value is inconsistent with that of spark.sql.orc.mergeSchema, the value specified in the CREATE TABLE statement prevails.
If you use HiveRead and HiveWriter (configuring spark.sql.hive.convertMetastoreOrc=false) for read and write, OPTIONS can also support native ORC configurations. For details, see LanguageManual ORC.
Most of the parameters related to PARQUET data tables can be configured using SparkConf (recommended). Configurations supported by OPTIONS are as follows:
Key Supported by OPTIONS
Corresponding Default Value
Description
compression or parquet.compression
snappy
Compression algorithm. Snappy is used by default. If the value is inconsistent with that of spark.sql.parquet.compression.codec, the value specified in the CREATE TABLE statement prevails.
mergeSchema
false
Indicate whether to merge schemas. If the value is inconsistent with that of spark.sql.parquet.mergeSchema, the value specified in the CREATE TABLE statement prevails.
datetimeRebaseMode
EXCEPTION
Policy of date conversion during PARQUET file write. Dates are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when dates are in different formats. If the value is inconsistent with that of spark.sql.parquet.datetimeRebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
int96RebaseMode
EXCEPTION
Policy of time conversion during PARQUET file read. Time values are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when time values are in different formats. If the value is inconsistent with that of spark.sql.parquet.int96RebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
If you use HiveRead and HiveWriter (setting spark.sql.hive.convertMetastoreParquet to false) for read and write, OPTIONS can also support native PARQUET configurations. Reference: Hadoop integration.

Example

CREATE TABLE dempts(
id bigint COMMENT 'id number',
num int,
eno float,
dno double,
cno decimal(9,3),
flag boolean,
data string,
ts_year timestamp,
date_month date,
bno binary,
point struct<x: double, y: double>,
points array<struct<x: double, y: double>>,
pointmaps map<struct<x: int>, struct<a: int>>
)
USING iceberg
COMMENT 'table documentation'
PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data))
LOCATION '/warehouse/db_001/dempts'
TBLPROPERTIES ('write.format.default'='orc');

Notes

Spark's USING and Hive's STORED AS are different when they are used as keywords of CREATE_TABLE. The file format and read method after table creation may not meet expectations. Note:
USING DATA_SOURCE: Used in Spark statement. This keyword indicates the data source used as the input format for table creation and directly affects the file format and reading method of the table under the location. The value can be CSV, TXT, Iceberg, Parquet, Orc, etc.
STORED AS FILE_FORMAT: This keyword is used in Hive statements to specify the file format for table storage. The value can be TXT, Parquet, Orc, etc. This keyword is not recommended because the native reader/writer of Spark may not support the specified file format such as CSV.

Statement for Native Iceberg Table

Caution
This statement is supported for creating native tables only.

Statement

CREATE TABLE [ IF NOT EXISTS ] table_identifier
( col_name[:] col_type [ COMMENT col_comment ], ... )
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]

Parameter

table_identifier: Table name in the three-part format. Example: catalog.db.name. Schemas and Data Types
col_type
: primitive_type
| nested_type

primitive_type
: boolean
| int/integer
| long/bigint
| float
| double
| decimal(p,s), p=maximum number of digits, s=maximum number of decimal places, s<=p<=38
| date
| timestamp, timestamp with timezone, time and without timezone are not supported.
| string, which corresponds to the UUID type of Iceberg tables.
| binary, which corresponds to the FIXED type of Iceberg tables.

nested_type
: struct
| list
| map
Partition Transforms
transform
: identity, support any type. DLC does not support this transformation.
| bucket[N], partition by hashed value mod N buckets, supporting col_type: int,long, decimal, date, timestamp, string, binary
| truncate[L], partition by value truncated to L, supporting col_type: int,long,decimal,string
| years, partition by year, supporting col_type: date,timestamp
| months, partition by month, supporting col_type: date,timestamp
| days/date, partition by date, supporting col_type: date,timestamp
| hours/date_hour, partition by hour, supporting col_type: timestamp

Example

CREATE TABLE dempts(
id bigint COMMENT 'id number',
num int,
eno float,
dno double,
cno decimal(9,3),
flag boolean,
data string,
ts_year timestamp,
date_month date,
bno binary,
point struct<x: double, y: double>,
points array<struct<x: double, y: double>>,
pointmaps map<struct<x: int>, struct<a: int>>
)
COMMENT 'table documentation'
PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data));

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon