tencent cloud

All product documents
Tencent Cloud WeData
HDFS Data Source
Last updated: 2024-11-01 17:50:37
HDFS Data Source
Last updated: 2024-11-01 17:50:37

Supported Editions

Supports HDFS 2.x, 3.x versions.

Use Limits

Offline Reading

HDFS Reader supports the following features:
Supports TextFile, ORCFile, rcfile, sequence file, csv, and parquet file formats, with the requirement that the file contents represent a logically meaningful two-dimensional table.
Supports reading various data types (represented by String), column trimming, and column constants.
Supports recursive reading and regular expressions * and ?.
Supports ORCFile data compression. Currently supports SNAPPY and ZLIB compression methods.
Supports SequenceFile data compression. Currently supports LZO compression method.
Multiple files can be read concurrently.
CSV type supports compression formats of gzip, bz2, zip, lzo, lzo_deflate, and snappy.
Currently, the Hive version in the plugin is 2.3.7 and the Hadoop version is 3.2.3.
Note:
HDFS Reader does not currently support multi-threaded concurrent reading of single files. This involves single file internal splitting algorithms.

Offline Writing

When using HDFS Writer, please note the following:
1. Currently, HDFS Writer only supports files in TextFile, ORCFile, and ParquetFile formats, and the contents of the files must represent a logically meaningful two-dimensional table.
2. Since HDFS is a file system and does not have a schema concept, partial column writing is not supported.

HDFS Offline Single Table Read Node Configuration




Parameters
Description
Data Source
Select available HDFS data sources in the current project.
File Path
File system path information. Path supports using ‘*’ as a wildcard. After specifying a wildcard, multiple files will be traversed. For example, specifying / means reading all files under the / directory, specifying /bazhen/ means reading all files under the bazhen directory. HDFS currently only supports * and ? as file wildcards, similar to typical Linux command line file wildcards.
File Type
HDFS supports four file types: txt, orc, parquet, csv.
txt: represents TextFile file format.
orc: represents ORCFile file format.
parquet: represents standard Parquet file format.
csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format
When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
No need to fill in for ORC file types.
Field Separator
The field delimiter to be read. When reading TextFile data with HDFS, a field delimiter needs to be specified, which defaults to a comma (,) if not specified. When reading ORC Files with HDFS, you do not need to specify a field delimiter.
Other available delimiters: '\\t', '\\u001', '|', 'space', ';', ','.
If you want each row to be considered a column in the destination, use a delimiter that does not appear in the content of the rows, such as the invisible character \\u0001.
Encoding
Configuration for reading file encoding. Supports UTF-8 and GBK encoding.
Null Value Conversion
During reading, convert specified strings to null.
Supports dropdown selection or manual input. Dropdown options include: empty string, space, \\n, \\0, null

HDFS Offline Single Table Write Node Configuration




Parameters
Description
Data Destination
Select available HDFS data sources in the current project.
Synchronization Method
HDFS supports two synchronization methods:
Data Synchronization: Parses structured data content and maps and synchronizes data content according to field relationships.
File Transfer: Transfers the entire file without content parsing. Applicable to unstructured data synchronization.
File Path
Path information of the file system. The path supports using '*' as a wildcard. After specifying the wildcard, multiple file information will be traversed.
Write Mode
HDFS supports three write modes:
append: Do not perform any processing before writing. Directly use filename to write, ensuring no file name conflict
nonConflict: Report an error when the file name is duplicated
overwrite: Clean up all files with the file name as the prefix before writing. For example, "fileName": "abc" will clean up all files in the corresponding directory that start with 'abc'.
File Type
HDFS supports four file types: txt, orc, parquet, csv.
txt: represents TextFile file format.
orc: represents ORCFile file format.
parquet: represents standard Parquet file format.
csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format
When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
No need to fill in for ORC file types.
Field Separator
When writing to HDFS, the field separator must match the one used in the HDFS table; otherwise, data cannot be queried in the HDFS table. Options: '\\t', '\\u001', '|', ' ', ',', ';', '\\u005E\\u0001\\u005E'.
Empty Character String Processing
No action taken: During writing, do not process empty strings. Processed as null: During writing, process empty strings as null.
Advanced Settings (Optional)
You can configure parameters according to business needs.

Data type conversion support

Read

Supported data types and conversion relationships for HDFS read (when processing HDFS, data types from the HDFS data source will be mapped to the data types used by the data processing engine):
HDFS (Hive table) data types
Internal Types
TINYINT,SMALLINT,INT,BIGINT
Long
FLOAT,DOUBLE
Double
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
String
BOOLEAN
Boolean
Date,TIMESTAMP
Date

Write

Supported data types and conversion relationships for HDFS read:
Internal Types
HDFS (Hive table) data types
Long
TINYINT,SMALLINT,INT,BIGINT
Double
FLOAT,DOUBLE
String
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
Boolean
BOOLEAN
Date
Date,TIMESTAMP


Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon