tencent cloud

Feedback

Hive External Table

Last updated: 2024-07-04 10:13:34
    Note:
    The content showcased in this document is only suitable for Tencent Cloud TCHouse-D v1.1 and below. For later versions, it is recommended to use the Multi-Catalog feature for interfacing with external data directories.
    The Hive External Table of Doris enables Doris to directly access Hive external tables. This eliminates the tedious work of data importing, and leverages OLAP capabilities of Doris for data analysis in Hive tables:
    1. Doris can access Hive data source.
    2. Supports joint querying of table in Doris and Hive, the Hudi data source for more complex analysis.
    3. Supports accessing Hive data sources for which Kerberos is enabled.
    This document mainly introduces the usage and considerations of this feature.

    Usage

    Creating a Hive External Table in Doris

    -- Syntax
    CREATE [EXTERNAL] TABLE table_name (
    col_name col_type [NULL | NOT NULL] [COMMENT "comment"]
    ) ENGINE=HIVE
    [COMMENT "comment"]
    PROPERTIES (
    'property_name'='property_value',
    ...
    );
    
    -- Example 1: Create the hive_table in hive_db of the Hive cluster
    CREATE TABLE `t_hive` (
    `k1` int NOT NULL COMMENT "",
    `k2` char(10) NOT NULL COMMENT "",
    `k3` datetime NOT NULL COMMENT "",
    `k5` varchar(20) NOT NULL COMMENT "",
    `k6` double NOT NULL COMMENT ""
    ) ENGINE=HIVE
    COMMENT "HIVE"
    PROPERTIES (
    'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
    'database' = 'hive_db',
    'table' = 'hive_table'
    );
    
    -- Example 2: Create the hive_table in hive_db of the Hive cluster, with HDFS using HA configuration
    CREATE TABLE `t_hive` (
    `k1` int NOT NULL COMMENT "",
    `k2` char(10) NOT NULL COMMENT "",
    `k3` datetime NOT NULL COMMENT "",
    `k5` varchar(20) NOT NULL COMMENT "",
    `k6` double NOT NULL COMMENT ""
    ) ENGINE=HIVE
    COMMENT "HIVE"
    PROPERTIES (
    'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
    'database' = 'hive_db',
    'table' = 'hive_table',
    'dfs.nameservices'='hacluster',
    'dfs.ha.namenodes.hacluster'='n1,n2',
    'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
    'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
    'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    );
    
    -- Example 3: Create the hive_table in hive_db of the Hive cluster, with HDFS using HA configuration and Kerberos authentication enabled
    CREATE TABLE `t_hive` (
    `k1` int NOT NULL COMMENT "",
    `k2` char(10) NOT NULL COMMENT "",
    `k3` datetime NOT NULL COMMENT "",
    `k5` varchar(20) NOT NULL COMMENT "",
    `k6` double NOT NULL COMMENT ""
    ) ENGINE=HIVE
    COMMENT "HIVE"
    PROPERTIES (
    'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
    'database' = 'hive_db',
    'table' = 'hive_table',
    'dfs.nameservices'='hacluster',
    'dfs.ha.namenodes.hacluster'='n1,n2',
    'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
    'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
    'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
    'dfs.namenode.kerberos.principal'='hadoop/_HOST@REALM.COM'
    'hadoop.security.authentication'='kerberos',
    'hadoop.kerberos.principal'='doris_test@REALM.COM',
    'hadoop.kerberos.keytab'='/path/to/doris_test.keytab'
    );
    
    -- Example 4: Create the hive_table in hive_db of the Hive cluster, with Hive data stored on S3
    CREATE TABLE `t_hive` (
    `k1` int NOT NULL COMMENT "",
    `k2` char(10) NOT NULL COMMENT "",
    `k3` datetime NOT NULL COMMENT "",
    `k5` varchar(20) NOT NULL COMMENT "",
    `k6` double NOT NULL COMMENT ""
    ) ENGINE=HIVE
    COMMENT "HIVE"
    PROPERTIES (
    'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
    'database' = 'hive_db',
    'table' = 'hive_table',
    'AWS_ACCESS_KEY' = 'your_access_key',
    'AWS_SECRET_KEY' = 'your_secret_key',
    'AWS_ENDPOINT' = 's3.us-east-1.amazonaws.com',
    'AWS_REGION' = 'us-east-1'
    );

    Parameter description:

    External Table Column:
    The column name must correspond to the Hive table one by one.
    The order of the columns needs to be consistent with the Hive table.
    It must include all columns in the Hive table.
    The Hive table partition column does not need to be specified, and can be defined like a regular column.
    ENGINE needs to be specified as HIVE.
    PROPERTIES attributes:
    hive.metastore.uris: Hive Metastore service address.
    database: The name of the corresponding database when Hive is mounted.
    table: The name of the corresponding table when Hive is mounted.
    dfs.nameservices: The name of name service, consistent with hdfs-site.xml.
    dfs.ha.namenodes.[nameservice ID]: List of namenode IDs, consistent with hdfs-site.xml.
    dfs.namenode.rpc-address.[nameservice ID].[name node ID]: The rpc address of the namenode, the number of which is the same as the number of namenodes, and is consistent with hdfs-site.xml.
    dfs.client.failover.proxy.provider.[nameservice ID] :HDFS client's Java class for connecting to the active namenode, usually org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
    To access Hive data sources for which Kerberos is enabled, you need to add the following PROPERTIES for the Hive external table:
    hadoop.security.authentication: Please set the authentication method to Kerberos, the default is simple.
    dfs.namenode.kerberos.principal: The Kerberos entity of the HDFS namenode service.
    hadoop.kerberos.principal: Setting the Kerberos entity used when Doris is connected to HDFS.
    hadoop.kerberos.keytab: Setting the local path to the keytab file.
    yarn.resourcemanager.principal: The Kerberos entity of the resource manager when Doris is connected to a Hadoop cluster.
    If the data is stored is in a system like S3, such as Tencent Cloud COS, the following properties need to be set:
    AWS_ACCESS_KEY: The SecretId of your Tencent Cloud account.
    AWS_SECRET_KEY: The SecretKey of your Tencent Cloud account.
    AWS_ENDPOINT: The endpoint of the COS bucket, for example: cos.ap-nanjing.myqcloud.com.
    AWS_REGION: The region of the COS bucket, for example: ap-nanjing.
    Note
    To have Doris access a Hadoop cluster with Kerberos authentication enabled, you need to deploy Kerberos client kinit on all Doris cluster running nodes and configure krb5.conf by filling in KDC service information, etc.
    You need to specify the absolute path to the keytab file to set the value for the PROPERTIES hadoop.kerberos.keytab, and allow Doris process to access this local file.
    The configuration of HDFS cluster can be written into the hdfs-site.xml file. This configuration file is in the conf directory of fe and be. When creating a Hive table, users don't need to fill in the configuration info for the HDFS cluster.

    Type Matching

    The correspondence between supported Hive column types and Doris is as follows:
    Hive
    Doris
    Description
    BOOLEAN
    BOOLEAN
    
    CHAR
    CHAR
    Only UTF8 encoding is currently supported.
    VARCHAR
    VARCHAR
    Only UTF8 encoding is currently supported.
    TINYINT
    TINYINT
    
    SMALLINT
    SMALLINT
    
    INT
    INT
    
    BIGINT
    BIGINT
    
    FLOAT
    FLOAT
    
    DOUBLE
    DOUBLE
    
    DECIMAL
    DECIMAL
    
    DATE
    DATE
    
    TIMESTAMP
    DATETIME
    Converting timestamp to datetime may cause loss of precision
    Note
    The schema of the Hive table will not be synchronized automatically. You need to rebuild the Hive external table in Doris.
    The current storage format of Hive only supports Text, Parquet, and ORC types.
    The currently supported versions of Hive are 2.3.7 and 3.1.2 by default, and other versions have not been tested. More versions will be supported in the future.

    Query Usage

    You can use the Hive external table in Doris the same way as using Doris OLAP tables, except that the Doris data models (Rollup, Pre-Aggregation, and Materialized View, etc.) are unavailable.
    select * from t_hive where k1 > 1000 and k3 ='term' or k4 like '%doris';
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support