HDFS Data Source

Supported Editions
Supports HDFS 2.x, 3.x versions.
Use Limits
Offline Reading
HDFS Reader supports the following features:
Supports TextFile, ORCFile, rcfile, sequence file, csv, and parquet file formats, with the requirement that the file contents represent a logically meaningful two-dimensional table.
Supports reading various data types (represented by String), column trimming, and column constants.
Supports recursive reading and regular expressions * and ?.
Supports ORCFile data compression. Currently supports SNAPPY and ZLIB compression methods.
Supports SequenceFile data compression. Currently supports LZO compression method.
Multiple files can be read concurrently.
 CSV type supports compression formats of gzip, bz2, zip, lzo, lzo_deflate, and snappy.
Currently, the Hive version in the plugin is 2.3.7 and the Hadoop version is 3.2.3.
Note:
HDFS Reader does not currently support multi-threaded concurrent reading of single files. This involves single file internal splitting algorithms.
Offline Writing
When using HDFS Writer, please note the following:
1. Currently, HDFS Writer only supports files in TextFile, ORCFile, and ParquetFile formats, and the contents of the files must represent a logically meaningful two-dimensional table.
2. Since HDFS is a file system and does not have a schema concept, partial column writing is not supported.
HDFS Offline Single Table Read Node Configuration
﻿
﻿
﻿
Parameters
Description
Data Source
Select available HDFS data sources in the current project.
File Path
File system path information. Path supports using ‘*’ as a wildcard. After specifying a wildcard, multiple files will be traversed. For example, specifying / means reading all files under the / directory, specifying /bazhen/ means reading all files under the bazhen directory. HDFS currently only supports * and ? as file wildcards, similar to typical Linux command line file wildcards.
File Type
HDFS supports four file types: txt, orc, parquet, csv.
txt: represents TextFile file format.
orc: represents ORCFile file format.
parquet: represents standard Parquet file format.
csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format
When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
No need to fill in for ORC file types.
Field Separator
The field delimiter to be read. When reading TextFile data with HDFS, a field delimiter needs to be specified, which defaults to a comma (,) if not specified. When reading ORC Files with HDFS, you do not need to specify a field delimiter.
Other available delimiters: '\\t', '\\u001', '|', 'space', ';', ','.
If you want each row to be considered a column in the destination, use a delimiter that does not appear in the content of the rows, such as the invisible character \\u0001.
Encoding
Configuration for reading file encoding. Supports UTF-8 and GBK encoding.
Null Value Conversion
During reading, convert specified strings to null.
Supports dropdown selection or manual input. Dropdown options include: empty string, space, \\n, \\0, null
HDFS Offline Single Table Write Node Configuration
﻿
﻿
﻿
Parameters
Description
Data Destination
Select available HDFS data sources in the current project.
Synchronization Method
HDFS supports two synchronization methods:
Data Synchronization: Parses structured data content and maps and synchronizes data content according to field relationships.
File Transfer: Transfers the entire file without content parsing. Applicable to unstructured data synchronization.
File Path
Path information of the file system. The path supports using '*' as a wildcard. After specifying the wildcard, multiple file information will be traversed.
Write Mode
HDFS supports three write modes:
append: Do not perform any processing before writing. Directly use filename to write, ensuring no file name conflict 
nonConflict: Report an error when the file name is duplicated 
overwrite: Clean up all files with the file name as the prefix before writing. For example, "fileName": "abc" will clean up all files in the corresponding directory that start with 'abc'.
File Type
HDFS supports four file types: txt, orc, parquet, csv.
txt: represents TextFile file format.
orc: represents ORCFile file format.
parquet: represents standard Parquet file format.
csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format
When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
No need to fill in for ORC file types.
Field Separator
When writing to HDFS, the field separator must match the one used in the HDFS table; otherwise, data cannot be queried in the HDFS table. Options: '\\t', '\\u001', '|', ' ', ',', ';', '\\u005E\\u0001\\u005E'.
Empty Character String Processing
No action taken: During writing, do not process empty strings. 
Processed as null: During writing, process empty strings as null.
Advanced Settings (Optional)
You can configure parameters according to business needs.
Data type conversion support
Read 
Supported data types and conversion relationships for HDFS read (when processing HDFS, data types from the HDFS data source will be mapped to the data types used by the data processing engine):
HDFS (Hive table) data types
Internal Types
TINYINT,SMALLINT,INT,BIGINT
Long
FLOAT,DOUBLE
Double
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
String
BOOLEAN
Boolean
Date,TIMESTAMP
Date
Write
Supported data types and conversion relationships for HDFS read:
Internal Types
HDFS (Hive table) data types
Long
TINYINT,SMALLINT,INT,BIGINT
Double
FLOAT,DOUBLE
String
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
Boolean
BOOLEAN
Date
Date,TIMESTAMP
﻿
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Parameters	Description
Data Source	Select available HDFS data sources in the current project.
File Path	File system path information. Path supports using ‘’ as a wildcard. After specifying a wildcard, multiple files will be traversed. For example, specifying / means reading all files under the / directory, specifying /bazhen/ means reading all files under the bazhen directory. HDFS currently only supports and ? as file wildcards, similar to typical Linux command line file wildcards.
File Type	HDFS supports four file types: txt, orc, parquet, csv. txt: represents TextFile file format. orc: represents ORCFile file format. parquet: represents standard Parquet file format. csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format	When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy. Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format). No need to fill in for ORC file types.
Field Separator	The field delimiter to be read. When reading TextFile data with HDFS, a field delimiter needs to be specified, which defaults to a comma (,) if not specified. When reading ORC Files with HDFS, you do not need to specify a field delimiter. Other available delimiters: '\\t', '\\u001', '\|', 'space', ';', ','. If you want each row to be considered a column in the destination, use a delimiter that does not appear in the content of the rows, such as the invisible character \\u0001.
Encoding	Configuration for reading file encoding. Supports UTF-8 and GBK encoding.
Null Value Conversion	During reading, convert specified strings to null. Supports dropdown selection or manual input. Dropdown options include: empty string, space, \\n, \\0, null

Parameters	Description
Data Destination	Select available HDFS data sources in the current project.
Synchronization Method	HDFS supports two synchronization methods: Data Synchronization: Parses structured data content and maps and synchronizes data content according to field relationships. File Transfer: Transfers the entire file without content parsing. Applicable to unstructured data synchronization.
File Path	Path information of the file system. The path supports using '*' as a wildcard. After specifying the wildcard, multiple file information will be traversed.
Write Mode	HDFS supports three write modes: append: Do not perform any processing before writing. Directly use filename to write, ensuring no file name conflict nonConflict: Report an error when the file name is duplicated overwrite: Clean up all files with the file name as the prefix before writing. For example, "fileName": "abc" will clean up all files in the corresponding directory that start with 'abc'.
File Type	HDFS supports four file types: txt, orc, parquet, csv. txt: represents TextFile file format. orc: represents ORCFile file format. parquet: represents standard Parquet file format. csv: represents standard HDFS file format (logical two-dimensional table).
Compression Format	When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy. Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format). No need to fill in for ORC file types.
Field Separator	When writing to HDFS, the field separator must match the one used in the HDFS table; otherwise, data cannot be queried in the HDFS table. Options: '\\t', '\\u001', '\|', ' ', ',', ';', '\\u005E\\u0001\\u005E'.
Empty Character String Processing	No action taken: During writing, do not process empty strings. Processed as null: During writing, process empty strings as null.
Advanced Settings (Optional)	You can configure parameters according to business needs.

HDFS (Hive table) data types	Internal Types
TINYINT,SMALLINT,INT,BIGINT	Long
FLOAT,DOUBLE	Double
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY	String
BOOLEAN	Boolean
Date,TIMESTAMP	Date

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service free trial

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

E-commerce

E-commerce retail solutions

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Financial Services

Financial Services Solution

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha

Cloud Workload Protection Platform

Data Security Governance Center

Key Management Service