tencent cloud

Feedback

HDFS Data Source

Last updated: 2024-11-01 17:50:37

    Supported Editions

    Supports HDFS 2.x, 3.x versions.

    Use Limits

    Offline Reading

    HDFS Reader supports the following features:
    Supports TextFile, ORCFile, rcfile, sequence file, csv, and parquet file formats, with the requirement that the file contents represent a logically meaningful two-dimensional table.
    Supports reading various data types (represented by String), column trimming, and column constants.
    Supports recursive reading and regular expressions * and ?.
    Supports ORCFile data compression. Currently supports SNAPPY and ZLIB compression methods.
    Supports SequenceFile data compression. Currently supports LZO compression method.
    Multiple files can be read concurrently.
    CSV type supports compression formats of gzip, bz2, zip, lzo, lzo_deflate, and snappy.
    Currently, the Hive version in the plugin is 2.3.7 and the Hadoop version is 3.2.3.
    Note:
    HDFS Reader does not currently support multi-threaded concurrent reading of single files. This involves single file internal splitting algorithms.

    Offline Writing

    When using HDFS Writer, please note the following:
    1. Currently, HDFS Writer only supports files in TextFile, ORCFile, and ParquetFile formats, and the contents of the files must represent a logically meaningful two-dimensional table.
    2. Since HDFS is a file system and does not have a schema concept, partial column writing is not supported.

    HDFS Offline Single Table Read Node Configuration

    
    
    
    Parameters
    Description
    Data Source
    Select available HDFS data sources in the current project.
    File Path
    File system path information. Path supports using ‘*’ as a wildcard. After specifying a wildcard, multiple files will be traversed. For example, specifying / means reading all files under the / directory, specifying /bazhen/ means reading all files under the bazhen directory. HDFS currently only supports * and ? as file wildcards, similar to typical Linux command line file wildcards.
    File Type
    HDFS supports four file types: txt, orc, parquet, csv.
    txt: represents TextFile file format.
    orc: represents ORCFile file format.
    parquet: represents standard Parquet file format.
    csv: represents standard HDFS file format (logical two-dimensional table).
    Compression Format
    When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
    Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
    No need to fill in for ORC file types.
    Field Separator
    The field delimiter to be read. When reading TextFile data with HDFS, a field delimiter needs to be specified, which defaults to a comma (,) if not specified. When reading ORC Files with HDFS, you do not need to specify a field delimiter.
    Other available delimiters: '\\t', '\\u001', '|', 'space', ';', ','.
    If you want each row to be considered a column in the destination, use a delimiter that does not appear in the content of the rows, such as the invisible character \\u0001.
    Encoding
    Configuration for reading file encoding. Supports UTF-8 and GBK encoding.
    Null Value Conversion
    During reading, convert specified strings to null.
    Supports dropdown selection or manual input. Dropdown options include: empty string, space, \\n, \\0, null

    HDFS Offline Single Table Write Node Configuration

    
    
    
    Parameters
    Description
    Data Destination
    Select available HDFS data sources in the current project.
    Synchronization Method
    HDFS supports two synchronization methods:
    Data Synchronization: Parses structured data content and maps and synchronizes data content according to field relationships.
    File Transfer: Transfers the entire file without content parsing. Applicable to unstructured data synchronization.
    File Path
    Path information of the file system. The path supports using '*' as a wildcard. After specifying the wildcard, multiple file information will be traversed.
    Write Mode
    HDFS supports three write modes:
    append: Do not perform any processing before writing. Directly use filename to write, ensuring no file name conflict
    nonConflict: Report an error when the file name is duplicated
    overwrite: Clean up all files with the file name as the prefix before writing. For example, "fileName": "abc" will clean up all files in the corresponding directory that start with 'abc'.
    File Type
    HDFS supports four file types: txt, orc, parquet, csv.
    txt: represents TextFile file format.
    orc: represents ORCFile file format.
    parquet: represents standard Parquet file format.
    csv: represents standard HDFS file format (logical two-dimensional table).
    Compression Format
    When fileType (file type) is csv, the supported file compression methods are: none, deflate, gzip, bzip2, lz4, snappy.
    Since snappy currently does not have a unified stream format, Data Integration currently only supports the most widely used hadoop-snappy (snappy stream format on hadoop) and framing-snappy (google recommended snappy stream format).
    No need to fill in for ORC file types.
    Field Separator
    When writing to HDFS, the field separator must match the one used in the HDFS table; otherwise, data cannot be queried in the HDFS table. Options: '\\t', '\\u001', '|', ' ', ',', ';', '\\u005E\\u0001\\u005E'.
    Empty Character String Processing
    No action taken: During writing, do not process empty strings. Processed as null: During writing, process empty strings as null.
    Advanced Settings (Optional)
    You can configure parameters according to business needs.

    Data type conversion support

    Read

    Supported data types and conversion relationships for HDFS read (when processing HDFS, data types from the HDFS data source will be mapped to the data types used by the data processing engine):
    HDFS (Hive table) data types
    Internal Types
    TINYINT,SMALLINT,INT,BIGINT
    Long
    FLOAT,DOUBLE
    Double
    String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
    String
    BOOLEAN
    Boolean
    Date,TIMESTAMP
    Date

    Write

    Supported data types and conversion relationships for HDFS read:
    Internal Types
    HDFS (Hive table) data types
    Long
    TINYINT,SMALLINT,INT,BIGINT
    Double
    FLOAT,DOUBLE
    String
    String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
    Boolean
    BOOLEAN
    Date
    Date,TIMESTAMP
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support