tencent cloud

Feedback

Accessing Hudi Data with Hive

Last updated: 2024-10-30 11:43:08

    Development Preparation

    Make sure you have activated Tencent Cloud and created an EMR cluster. For more details, see Creating a Cluster.
    During the creation of an EMR cluster, select the Hive, Spark, and Hudi components in the software configuration interface.

    Reading and Writing Hudi with Spark

    Log in to the master node, switch to the hadoop user, and use SparkSQL with the HoodieSparkSessionExtension extension to read and write data:
    spark-sql --master yarn \\
    --num-executors 2 \\
    --executor-memory 1g \\
    --executor-cores 2 \\
    --jars /usr/local/service/hudi/hudi-bundle/hudi-spark3.3-bundle_2.12-0.13.0.jar \\
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \\
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \\
    --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
    Note:
    Among them, --master specifies your master URL, --num-executors specifies the number of executors, and --executor-memory specifies the executor memory capacity. You can modify these parameters based on your actual requirements. The dependency package versions used by --jars may vary across different EMR versions. Check and use the correct dependency package located in the /usr/local/service/hudi/hudi-bundle directory.
    Create a table:
    -- Create a partition table
    
    
    spark-sql> create table hudi_cow_nonpcf_tbl (
    uuid int,
    name string,
    price double
    ) using hudi
    tblproperties (
    primaryKey = 'uuid'
    );
    
    
    -- Create a partition table
    
    
    spark-sql> create table hudi_cow_pt_tbl (
    id bigint,
    name string,
    ts bigint,
    dt string,
    hh string
    ) using hudi
    tblproperties (
    type = 'cow',
    primaryKey = 'id',
    preCombineField = 'ts'
    )
    partitioned by (dt, hh);
    
    
    -- Create a MOR partition table
    
    
    spark-sql> create table hudi_mor_tbl (
    id int,
    name string,
    price double,
    ts bigint,
    dt string
    ) using hudi
    tblproperties (
    type = 'mor',
    primaryKey = 'id',
    preCombineField = 'ts'
    )
    partitioned by (dt);
    Write data:
    -- insert into non-partitioned table
    spark-sql> insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;
    
    
    -- insert dynamic partition
    spark-sql> insert into hudi_cow_pt_tbl partition (dt, hh) select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt, '10' as hh;
    
    
    -- insert static partition
    spark-sql> insert into hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='11') select 2, 'a2', 1000;
    spark-sql> insert into hudi_mor_tbl partition(dt = '2021-12-09') select 1, 'a1', 20, 1000;

    Using Hive to Query Hudi Table

    Log in to the Master node, switch to the hadoop user, and execute the following command to connect to Hive:
    hive
    Add the Hudi dependency package:
    hive> add jar /usr/local/service/hudi/hudi-bundle/hudi-hadoop-mr-bundle-0.13.0.jar;
    View the table:
    hive> show tables;
    OK
    hudi_cow_nonpcf_tbl
    hudi_cow_pt_tbl
    hudi_mor_tbl
    hudi_mor_tbl_ro
    hudi_mor_tbl_rt
    Time taken:0.023 seconds, Fetched:5 row(s)
    Query data:
    hive> select * from hudi_cow_nonpcf_tbl;
    OK
    20230905170525412 20230905170525412_0_0 1 8d32a1cc-11f9-437f-9a7b-8ba9532223d3-0_0-17-15_20230905170525412.parquet 1 a1 20.0
    Time taken:1.447 seconds, Fetched:1 row(s)
    
    hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
    hive> select * from hudi_mor_tbl_ro;
    OK
    20230808174602565 20230808174602565_0_1 id:1 dt=2021-12-09 af40667d-1dca-4163-89ca-2c48250985b2-0_0-34-1617_20230808174602565.parquet 1 a1 20.0 1000 2021-12-09
    Time taken:0.159 seconds, Fetched:1 row(s)
    hive> set hive.vectorized.execution.enabled=false;
    hive> select name, count(*) from hudi_mor_tbl_rt group by name;
    a1 1
    Time taken:17.618 seconds, Fetched:1 row(s)
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support