tencent cloud

Feedback

Guide to Configuring CHDFS for CDH

Last updated: 2022-03-30 09:30:26

    Overview

    Cloudera's Distribution Including Apache Hadoop (CDH) is a popular Hadoop distribution. This document describes how to use CHDFS on CDH to separate big data computing and storage and provide a flexible big data solution at low costs.

    CHDFS supports the following big data components:

    Component Supported by CHDFS Service Component Restart Required
    YARN Yes NodeManager
    YARN Yes NodeManager
    Hive Yes HiveServer and HiveMetastore
    Spark Yes NodeManager
    Sqoop Yes NodeManager
    Presto Yes HiveServer, HiveMetastore, and Presto
    Flink Yes No
    Impala Yes No
    EMR Yes No
    Self-built component To be supported in the future No
    HBase Not recommended No

    Version Dependency

    The dependent component versions used in this document are as follows:

    • CDH 5.16.1
    • Hadoop 2.6.0

    Directions

    Storage environment configuration

    1. Log in to Cloudera Manager (CDH management page).
    2. On the system homepage, select Configuration > Service-Wide > Advanced to enter the advanced configuration code snippet page as shown below:
    3. Enter the CHDFS configuration in the code box of Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
    <property>
    <name>fs.AbstractFileSystem.ofs.impl</name>
    <value>com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter</value>
    </property>
    <property>
    <name>fs.ofs.impl</name>
    <value>com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter</value>
    </property>
    <!--Temporary directory of the local cache. For data read/write, data will be written to the local disk when the memory cache is insufficient. This path will be created automatically if it does not exist-->
    <property>
    <name>fs.ofs.tmp.cache.dir</name>
    <value>/data/emr/hdfs/tmp/chdfs/</value>
    </property>
    <!--appId-->      
    <property>
    <name>fs.ofs.user.appid</name>
    <value>1250000000</value>
    </property>
    

    The following are the required CHDFS configuration items (which need to be added to core-site.xml). For other CHDFS configuration items, please see Mounting CHDFS Instance.

    CHDFS Configuration Item Value Description
    fs.ofs.user.appid 1250000000 User appid
    fs.ofs.tmp.cache.dir /data/emr/hdfs/tmp/chdfs/ Temporary directory of the local cache
    fs.ofs.impl com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter CHDFS implementation class for FileSystem, which is fixed at com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
    fs.AbstractFileSystem.ofs.impl com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter CHDFS implementation class for AbstractFileSystem, which is fixed at com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
    1. Click the HDFS service to deploy the client configuration, and the above core-site.xml configuration will be updated to the servers in the cluster.
    2. Place the latest CHDFS SDK package under the JAR package path of the CDH HDFS service as shown below, where the path should be replaced with the actual value:
    cp chdfs_hadoop_plugin_network-2.0.jar /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/hadoop-hdfs/
    
    Note:

    You need to place the SDK package at the same location on each server in the cluster.

    Data migration

    Migrate the data from CDH HDFS to CHDFS with the Hadoop DistCp tool. For more information, please see
    Migrating Data from Native HDFS to CHDFS.

    Using CHDFS for big data suites

    1. MapReduce

    Directions

    (1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
    (2) On the CDH system homepage, find YARN and restart NodeManager (you don't need to restart NodeManager for TeraGen but need to do so for TeraSort due to internal business logic; therefore, we recommend you restart NodeManager in both cases).

    Sample

    The following uses TeraGen and TeraSort in a standard Hadoop test as an example:

    hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar teragen -Dmapred.map.tasks=4 1099 cosn://examplebucket-1250000000/teragen_5/
    hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar terasort  -Dmapred.map.tasks=4 cosn://examplebucket-1250000000/teragen_5/ cosn://examplebucket-1250000000/result14
    
    Note:

    Please replace the part after ofs:// schema with the mount point path of your CHDFS instance.

    2. Hive

    2.1 MR engine

    Directions

    (1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
    (2) On the CDH homepage, find Hive and restart the HiveServer2 and HiveMetastore roles.

    Sample

    For a user's real business query, for example, run the Hive command line to create a location as a sharded table on CHDFS:

    CREATE TABLE `report.report_o2o_pid_credit_detail_grant_daily`(
    `cal_dt` string,
    `change_time` string,
    `merchant_id` bigint,
    `store_id` bigint,
    `store_name` string,
    `wid` string,
    `member_id` bigint,
    `meber_card` string,
    `nickname` string,
    `name` string,
    `gender` string,
    `birthday` string,
    `city` string,
    `mobile` string,
    `credit_grant` bigint,
    `change_reason` string,
    `available_point` bigint,
    `date_time` string,
    `channel_type` bigint,
    `point_flow_id` bigint)
    PARTITIONED BY (
    `topicdate` string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
    LOCATION
    'cosn://examplebucket-1250000000/user/hive/warehouse/report.db/report_o2o_pid_credit_detail_grant_daily'
    TBLPROPERTIES (
    'last_modified_by'='work',
    'last_modified_time'='1589310646',
    'transient_lastDdlTime'='1589310646')
    

    Run the following SQL query:

    select count(1) from report.report_o2o_pid_credit_detail_grant_daily;
    

    You will see the following results:

    2.2 Tez engine

    For the Tez engine, you need to import the CHDFS JAR package into the compressed package of Tez. The following takes apache-tez.0.8.5 as an example for description:

    Directions

    (1) Find the Tez package installed on the CDH cluster and decompress it, such as /usr/local/service/tez/tez-0.8.5.tar.gz.
    (2) Place the CHDFS JAR package in the decompressed directory, recompress it, and output a compressed package.
    (3) Upload the new compressed package to the path specified by tez.lib.uris (if it already exists, directly replace it).
    (4) On the CDH homepage, find Hive and restart HiveServer and HiveMetastore.

    3. Spark

    Directions

    (1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
    (2) Restart NodeManager.

    Sample

    The following takes the Spark example word count test conducted with CHDFS as an example.

    spark-submit  --class org.apache.spark.examples.JavaWordCount --executor-memory 4g --executor-cores 4  ./spark-examples-1.6.0-cdh5.16.1-hadoop2.6.0-cdh5.16.1.jar cosn://examplebucket-1250000000/wordcount
    

    The execution result is as follows:

    4. Sqoop

    Directions

    (1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.

    (2) Place the CHDFS SDK JAR package in the sqoop directory (such as /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/sqoop/).

    (3) Restart NodeManager.

    Sample

    The test is performed by taking exporting MySQL tables to CHDFS as instructed in Import/Export of Relational Database and HDFS as an example.

    sqoop import --connect "jdbc:mysql://IP:PORT/mysql" --table sqoop_test --username root --password 123  --target-dir cosn://examplebucket-1250000000/sqoop_test
    

    The execution result is as follows:

    5. Presto

    Directions

    (1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
    (2) Place the CHDFS SDK JAR package in the presto directory (such as /usr/local/services/cos_presto/plugin/hive-hadoop2).
    (3) As Presto doesn't load gson-2.*.*.jar under hadoop common, you also need to place gson-2.*.*.jar in the presto directory (such as /usr/local/services/cos_presto/plugin/hive-hadoop2. Only CHDFS depends on Gson).
    (4) Restart HiveServer, HiveMetaStore, and Presto.

    Sample

    The following takes creating a location with Hive as a CHDFS table query as an example:

    select * from chdfs_test_table where bucket is not null limit 1;
    
    Note:

    chdfs_test_table is the table whose location is ofs scheme.

    The query result is as follows:

    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support