tencent cloud

Feedback

Hive Catalog

Last updated: 2024-07-04 10:16:48
    Note:
    This feature is applicable to Tencent Cloud TCHouse-D 1.2 and subsequent versions.
    By connecting to Hive Metastore, or a metadata service compatible with Hive Metastore, Doris can automatically obtain Hive database table information and perform data queries.
    Apart from Hive, many other systems also use the Hive Metastore to store metadata. Therefore, through the Hive Catalog, we can access not only Hive but also systems that use Hive Metastore as metadata storage, such as Iceberg, Hudi, etc.

    Terms and Conditions

    1. Put core-site.xml, hdfs-site.xml and hive-site.xml in the conf directory of FE and BE. First read the hadoop configuration file in the conf directory, and then read the configuration files related to the environment variable HADOOP_CONF_DIR.
    2. Hive supports Hive versions 1/2/3.
    3. Support Managed Table and External Table.
    4. Can identify Hive, Iceberg, Hudi metadata stored in Hive Metastore.
    5. Support Hive tables with data stored on JuiceFS, which is used as follows (you need to put the juicefs-hadoop-x.x.x.jar in fe/lib/ and apache_hdfs_broker/lib/).
    6. Support Hive tables with data stored on CHDFS. The environment is required to be configured as follows:
    6.1 Put chdfs_hadoop_plugin_network-x.x.jar in fe/lib/ and apache_hdfs_broker/lib/.
    6.2 Copy the core-site.xml and hdfs-site.xml of the Hadoop cluster where Hive is located to fe/conf/ and apache_hdfs_broker/conf.
    7. Support Hive, and Iceberg tables with data stored on GooseFS (GFS). The environment is required to be configured:
    7.1 Put goosefs-x.x.x-client.jar in fe/lib/ and apache_hdfs_broker/lib/.
    7.2 Add property when creating a catalog: 'fs.AbstractFileSystem.gfs.impl' = 'com.qcloud.cos.goosefs.hadoop.GooseFileSystem', 'fs.gfs.impl' = 'com.qcloud.cos.goosefs.hadoop.FileSystem'.

    Feature Limits

    1. Not support querying views in Hive / Iceberg / Hudi.
    2. Version v1.2 does not support querying show partitions from statement, which is supported by the subsequent versions of v2.0.
    3. For text format tables, Doris only supports non-compressed formats, but not supports compressed formats like lzo, bzip2, deflate, lz4, and zstd, etc.
    4. Doris will cache the metadata of Hive / Iceberg / Hudi tables. If the original table is frequently updated, and you want to obtain the most accurate data from Doris, execute 'refresh table' command before querying the table.

    Create Catalog

    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    );
    Apart from the two requied parameters oftype and hive.metastore.uris, more parameters can be used to send the information required for the connection.
    If HDFS HA information is provided, the example is as follows:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    );
    Provide the HDFS HA information and Kerberos authentication information at the same time, here is an example:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hive.metastore.sasl.enabled' = 'true',
    'hive.metastore.kerberos.principal' = 'your-hms-principal',
    'dfs.nameservices'='your-nameservice',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
    'hadoop.security.authentication' = 'kerberos',
    'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
    'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
    'yarn.resourcemanager.principal' = 'your-rm-principal'
    );
    Place the krb5.conf and keytab authentication files in all BE and FE nodes. Keep the path of the keytab authentication file consistent with the configuration. The krb5.conf file is placed in the path /etc/krb5.conf by default. The value ofhive.metastore.kerberos.principal must be consistent with the same property name in the connected hive metastore, which can be obtained from hive-site.xml.
    Provide Hadoop KMS encrypted transmission information, here is an example:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'dfs.encryption.key.provider.uri' = 'kms://http@kms_host:kms_port/kms'
    );
    If Hive data storage is in JuiceFS, here is an example:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'root',
    'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
    'fs.AbstractFileSystem.jfs.impl' = 'io.juicefs.JuiceFS',
    'juicefs.meta' = 'xxx'
    );
    Hive metadata is stored in Glue, and data is stored in S3, here is an example:
    CREATE CATALOG hive PROPERTIES (
    "type"="hms",
    "hive.metastore.type" = "glue",
    "aws.region" = "us-east-1",
    "aws.glue.access-key" = "ak",
    "aws.glue.secret-key" = "sk",
    "AWS_ENDPOINT" = "s3.us-east-1.amazonaws.com",
    "AWS_REGION" = "us-east-1",
    "AWS_ACCESS_KEY" = "ak",
    "AWS_SECRET_KEY" = "sk",
    "use_path_style" = "true"
    );
    In version 1.2.1 and subsequent versions, you can store these information by creating a Resource, and then use this Resource when creating a Catalog. Here is an example:
    # 1. Creating Resource
    CREATE RESOURCE hms_resource PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    );
    # 2. Create Catalog and use the Resource, where the Key Value information will override the information in the Resource.
    CREATE CATALOG hive WITH RESOURCE hms_resource PROPERTIES(
    'key' = 'value'
    );
    When creating a Catalog, you can use the parameter file.meta.cache.ttl-second to set the automatic expiration time of the File Cache, or you can set this value to 0 to disable the File Cache. The unit of time is seconds. Here is an example:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
    'file.meta.cache.ttl-second' = '60'
    );
    We can also directly put hive-site.xml into the conf directory of FE and BE. The system will automatically read the information in hive-site.xml. The rules for overriding information are as follows:
    The information in the Resource overrides the information in hive-site.xml.
    The information in CREATE CATALOG PROPERTIES overrides the information in Resource.

    Hive Version

    Doris can correctly access Hive Metastore in different Hive versions. By default, Doris will access Hive Metastore with the compatibility interface of Hive 2.3 version. You can also specify the version of Hive when creating a Catalog. For example, to access Hive 1.1.0 version:
    CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hive.version' = '1.1.0'
    );

    Column type map

    Applicable to Hive/Iceberge/Hudi
    HMS Type
    Doris Type
    Comment
    boolean
    boolean
    -
    tinyint
    tinyint
    -
    smallint
    smallint
    -
    int
    int
    -
    bigint
    bigint
    -
    date
    date
    -
    timestamp
    datetime
    -
    float
    float
    -
    double
    double
    -
    char
    char
    -
    varchar
    varchar
    -
    decimal
    decimal
    -
    array<type>
    array<type>
    Support nested arrays, as in array<array<int>>
    map<KeyType, ValueType>
    map<KeyType, ValueType>
    Nested structure is not supported; KeyType and ValueType need to be basic types
    struct<col1: Type1, col2: Type2, ...>
    struct<col1: Type1, col2: Type2, ...>
    Nested structure is not supported, Type1, Type2,... need to be basic types
    other
    unsupported
    -

    Using Ranger for Permission Verification

    Note:
    This feature is applicable to Tencent Cloud TCHouse-D 2.0 and later versions.
    Apache Ranger is a security framework for monitoring, enabling services, and comprehensive data security access management on the Hadoop platform. Currently, Doris supports Ranger library, table, and column permissions, but does not support encryption or row permissions, etc.

    Environment Configuration

    Connecting to Hive Metastore with enabled Ranger permission verification requires adding configuration and configuration environment:
    1. Add the configuration when creating Catalog:
    "access_controller.properties.ranger.service.name" = "hive",
    "access_controller.class" = "org.apache.doris.catalog.authorizer.RangerHiveAccessControllerFactory",
    2. Configure all FE environments:
    2.1 Copy the configuration files ranger-hive-audit.xml, ranger-hive-security.xml, and ranger-policymgr-ssl.xml from the HMS conf directory to the <doris_home>/conf directory.
    2.2 Modify the properties of ranger-hive-security.xml, see the following configuration:
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
    #The directory for caching permission data, needs to be writable
    <property>
    <name>ranger.plugin.hive.policy.cache.dir</name>
    <value>/mnt/datadisk0/zhangdong/rangerdata</value>
    </property>
    #The time interval for periodically pulling permission data
    <property>
    <name>ranger.plugin.hive.policy.pollIntervalMs</name>
    <value>30000</value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.policy.rest.client.connection.timeoutMs</name>
    <value>60000</value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.policy.rest.client.read.timeoutMs</name>
    <value>60000</value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.policy.rest.ssl.config.file</name>
    <value></value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.policy.rest.url</name>
    <value>http://172.21.0.32:6080</value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.policy.source.impl</name>
    <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
    </property>
    
    <property>
    <name>ranger.plugin.hive.service.name</name>
    <value>hive</value>
    </property>
    
    <property>
    <name>xasecure.hive.update.xapolicies.on.grant.revoke</name>
    <value>true</value>
    </property>
    
    </configuration>
    2.3 To obtain the log of Ranger authentication itself, you can add configuration file log4j.properties in the <doris_home>/conf directory.
    2.4 Restart FE.

    Best Practice

    1. Create a user with the name of user1 on the Ranger side and authorize the query permission of db1.table1.col1.
    2. Create a role with the name of role1 on the Ranger side and authorize the query permission of db1.table1.col2.
    3. Create a user with the same name user1 in Doris, user1 will directly have the query permission of db1.table1.col1.
    4. Create a role with the same name role1 in Doris, and assign role1 to user1, user1 will have the query permissions of both db1.table1.col1 and col2.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support