Parameters | Description |
Data Source | Select available HDFS data sources in the current project. |
File Path | File system path information. Path supports using ‘*’ as a wildcard. After specifying a wildcard, multiple files will be traversed. For example, specifying / means reading all files under the / directory, specifying /bazhen/ means reading all files under the bazhen directory. HDFS currently only supports * and ? as file wildcards, similar to typical Linux command line file wildcards. |
File Type | HDFS supports four file types: txt, orc, parquet, csv. txt: represents TextFile file format. orc: represents ORCFile file format. parquet: represents standard Parquet file format. csv: represents standard HDFS file format (logical two-dimensional table). |
Compression Format | When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy. Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format). No need to fill in for ORC file types. |
Field Separator | The field delimiter to be read. When reading TextFile data with HDFS, a field delimiter needs to be specified, which defaults to a comma (,) if not specified. When reading ORC Files with HDFS, you do not need to specify a field delimiter. Other available delimiters: '\\t', '\\u001', '|', 'space', ';', ','. If you want each row to be considered a column in the destination, use a delimiter that does not appear in the content of the rows, such as the invisible character \\u0001. |
Encoding | Configuration for reading file encoding. Supports UTF-8 and GBK encoding. |
Null Value Conversion | During reading, convert specified strings to null. Supports dropdown selection or manual input. Dropdown options include: empty string, space, \\n, \\0, null |
Parameters | Description |
Data Destination | Select available HDFS data sources in the current project. |
Synchronization Method | HDFS supports two synchronization methods: Data Synchronization: Parses structured data content and maps and synchronizes data content according to field relationships. File Transfer: Transfers the entire file without content parsing. Applicable to unstructured data synchronization. |
File Path | Path information of the file system. The path supports using '*' as a wildcard. After specifying the wildcard, multiple file information will be traversed. |
Write Mode | HDFS supports three write modes: append: Do not perform any processing before writing. Directly use filename to write, ensuring no file name conflict nonConflict: Report an error when the file name is duplicated overwrite: Clean up all files with the file name as the prefix before writing. For example, "fileName": "abc" will clean up all files in the corresponding directory that start with 'abc'. |
File Type | HDFS supports four file types: txt, orc, parquet, csv. txt: represents TextFile file format. orc: represents ORCFile file format. parquet: represents standard Parquet file format. csv: represents standard HDFS file format (logical two-dimensional table). |
Compression Format | When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy. Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format). No need to fill in for ORC file types. |
Field Separator | When writing to HDFS, the field separator must match the one used in the HDFS table; otherwise, data cannot be queried in the HDFS table. Options: '\\t', '\\u001', '|', ' ', ',', ';', '\\u005E\\u0001\\u005E'. |
Empty Character String Processing | No action taken: During writing, do not process empty strings.
Processed as null: During writing, process empty strings as null. |
Advanced Settings (Optional) | You can configure parameters according to business needs. |
HDFS (Hive table) data types | Internal Types |
TINYINT,SMALLINT,INT,BIGINT | Long |
FLOAT,DOUBLE | Double |
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY | String |
BOOLEAN | Boolean |
Date,TIMESTAMP | Date |
Internal Types | HDFS (Hive table) data types |
Long | TINYINT,SMALLINT,INT,BIGINT |
Double | FLOAT,DOUBLE |
String | String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY |
Boolean | BOOLEAN |
Date | Date,TIMESTAMP |
Was this page helpful?