tencent cloud

All product documents
Tencent Cloud WeData
Hive Data Source
Last updated: 2024-11-01 17:50:37
Hive Data Source
Last updated: 2024-11-01 17:50:37

Supported Editions

Supports Hive 1.x, 2.x, and 3.x versions.

Use Limits

1. Currently, only TextFile, ORCFile, and ParquetFile formats are supported for reading files. It is recommended to use ORC or Parquet.
2. Hive reads data via JDBC connection to HiveServer2. Ensure the data source is correctly configured with the HiveServer2 JDBC URL.
3. Hive writing requires connection to the Hive Metastore service. Please correctly configure the Metastore Thrift protocol's IP and port in the data source. If it is a custom Hive data source, you also need to upload hive-site.xml, core-site.xml, and hdfs-site.xml.
4. When adding columns while rebuilding the Hive table, include the cascade keyword to prevent Partition Metadata from not updating, which affects data queries.
5. During the offline synchronization to the Hive cluster using Data Integration, temporary files will be generated on the server side. These files will be automatically deleted upon task completion. Be mindful of the file count limit in the server-side HDFS directory to avoid the file count reaching the limit and making the HDFS filesystem unavailable, ensuring the file count is within the allowable range of the HDFS directory.

Hive Offline Single Table Read Node Configuration




Parameters
Description
Data Source
Available HIVE Data Source
Database
Supports selecting or manually entering the database name to be read
By default, the database bound to the data source is used as the default database. Other databases need to be manually entered
If the data source network is not connected and the database information cannot be fetched directly, you can manually enter the database name. Data synchronization can still be performed when the Data Integration network is connected
Table
Supports selecting or manually entering the table name to be read
Read Method
Only JDBC reading method is supported
Filter Conditions (Optional)
When reading data based on Hive JDBC, it supports using WHERE conditions for data filtering. However, in this scenario, the Hive engine may generate MapReduce tasks at the backend, resulting in lower efficiency
Advanced Settings (Optional)
You can configure parameters according to business needs.

Hive Offline Single Table Write Node Configuration




Parameters
Description
Data Destination
Specify the Hive data source to write to.
Database
Supports selection or manual input of the database name to write to
By default, the database bound to the data source is used as the default database. Other databases need to be manually entered.
If the data source network is not connected and the database information cannot be fetched directly, you can manually enter the database name. Data synchronization can still be performed when the Data Integration network is connected.
Table
Supports selection or manual input of the table name to write to
If the data source network is not connected and the table information cannot be fetched directly, you can manually enter the table name. Data synchronization can still be performed when the Data Integration network is connected.
Write Mode
Hive writing supports three modes:
Append: Retain original data, append new rows
nonConflict: Report an error when data conflicts occur
Overwrite: Delete the existing data and rewrite it
writeMode is a high-risk parameter. Please pay attention to the data output directory and write mode to avoid accidental data deletion. The data loading behavior needs to be used with hiveConfig. Please check your configuration.
Batch Submission Size
The record size of one-time batch submission, this value can greatly reduce the number of network interactions between the data synchronization system and Hive, and improve the overall throughput. If this value is set too large, it may cause OOM exceptions in the data synchronization process.
Empty Character String Processing
No action taken: Do not process empty strings when writing.
Processed as null: Process empty strings as null when writing.
Pre-Executed SQL (Optional)
The SQL statement executed before the synchronization task. Fill in the correct SQL syntax according to the data source type, such as clearing the old data in the table before execution (truncate table tablename).
Post-Executed SQL (Optional)
The SQL statement executed after the synchronization task. Fill in the correct SQL syntax according to the data source type, such as adding a timestamp (alter table tablename add colname timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP).
Advanced Settings (Optional)
You can configure parameters according to business needs.

Data type conversion support

Read

Supported data types and conversion relationships for Hive reading are as follows (when processing Hive, the data types of the Hive data source will first be mapped to the data types of the data processing engine):
Hive Data Types
Internal Types
TINYINT,SMALLINT,INT,BIGINT
Long
FLOAT,DOUBLE
Double
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
String
BOOLEAN
Boolean
Date,TIMESTAMP
Date

Write

Supported data types and conversion relationships for Hive reading are as follows:
Internal Types
Hive Data Types
Long
TINYINT,SMALLINT,INT,BIGINT
Double
FLOAT,DOUBLE
String
String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY
Boolean
BOOLEAN
Date
Date,TIMESTAMP

FAQs

1.Hive On CHDFS table write error: Permission denied: No access rules matched

Problem information:



Cause:
Data Integration resource group is a fully managed resource group, and its outbound network segment is not within the customer's VPC intranet. It is necessary to open the security group for the Data Integration resource group.

Please authorize the CHDFS permission ID of the integrated resource group at the CHDFS mount point.
Solutions:
Open the page: API 3.0 Explorer, fill in the parameters, where MountPointId is the customer's CHDFS mount point, and AccessGroupIds.N is the CHDFS permission ID of the Data Integration resource group (note, permission IDs vary by region).
CHDFS permission ID and region comparison table for Data Integration resource group:
Region
CHDFS permission ID for Data Integration resource group
Beijing
ag-wgbku4no
Guangzhou
ag-x1bhppnr
Shanghai
ag-xnjfr9d3
Singapore
ag-phxtv0ah
USA Silicon Valley
ag-tgwl8bca
Virginia
ag-esxpwxjn

2.Some fields are queried as NULL during synchronization

Cause:
After creating the Hive partition table, subsequent ALTER operations to add new fields without the cascade keyword cause the Partition Metadata not to update, affecting data queries.
Solutions:
alter table table_name add columns (time_interval TIMESTAMP) cascade;

Hive write error: [fs.defaultFS] is a required parameter error

Problem information:



Cause:
When setting the data source, configuration files were not uploaded. fs.defaultFS is found within core-site.xml.
Solutions:








Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon