Cloudera's Distribution Including Apache Hadoop (CDH) is a popular Hadoop distribution. This document describes how to use CHDFS on CDH to separate big data computing and storage and provide a flexible big data solution at low costs.
CHDFS supports the following big data components:
Component | Supported by CHDFS | Service Component Restart Required |
---|---|---|
YARN | Yes | NodeManager |
YARN | Yes | NodeManager |
Hive | Yes | HiveServer and HiveMetastore |
Spark | Yes | NodeManager |
Sqoop | Yes | NodeManager |
Presto | Yes | HiveServer, HiveMetastore, and Presto |
Flink | Yes | No |
Impala | Yes | No |
EMR | Yes | No |
Self-built component | To be supported in the future | No |
HBase | Not recommended | No |
The dependent component versions used in this document are as follows:
Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
.<property>
<name>fs.AbstractFileSystem.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter</value>
</property>
<property>
<name>fs.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter</value>
</property>
<!--Temporary directory of the local cache. For data read/write, data will be written to the local disk when the memory cache is insufficient. This path will be created automatically if it does not exist-->
<property>
<name>fs.ofs.tmp.cache.dir</name>
<value>/data/emr/hdfs/tmp/chdfs/</value>
</property>
<!--appId-->
<property>
<name>fs.ofs.user.appid</name>
<value>1250000000</value>
</property>
The following are the required CHDFS configuration items (which need to be added to core-site.xml
). For other CHDFS configuration items, please see Mounting CHDFS Instance.
CHDFS Configuration Item | Value | Description |
---|---|---|
fs.ofs.user.appid | 1250000000 | User appid |
fs.ofs.tmp.cache.dir | /data/emr/hdfs/tmp/chdfs/ | Temporary directory of the local cache |
fs.ofs.impl | com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter | CHDFS implementation class for FileSystem , which is fixed at com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter |
fs.AbstractFileSystem.ofs.impl | com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter | CHDFS implementation class for AbstractFileSystem , which is fixed at com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter |
core-site.xml
configuration will be updated to the servers in the cluster.cp chdfs_hadoop_plugin_network-2.0.jar /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/hadoop-hdfs/
Note:You need to place the SDK package at the same location on each server in the cluster.
Migrate the data from CDH HDFS to CHDFS with the Hadoop DistCp tool. For more information, please see
Migrating Data from Native HDFS to CHDFS.
Directions
(1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
(2) On the CDH system homepage, find YARN and restart NodeManager (you don't need to restart NodeManager for TeraGen but need to do so for TeraSort due to internal business logic; therefore, we recommend you restart NodeManager in both cases).
Sample
The following uses TeraGen and TeraSort in a standard Hadoop test as an example:
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar teragen -Dmapred.map.tasks=4 1099 cosn://examplebucket-1250000000/teragen_5/
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar terasort -Dmapred.map.tasks=4 cosn://examplebucket-1250000000/teragen_5/ cosn://examplebucket-1250000000/result14
Note:Please replace the part after
ofs:// schema
with the mount point path of your CHDFS instance.
Directions
(1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
(2) On the CDH homepage, find Hive and restart the HiveServer2 and HiveMetastore roles.
Sample
For a user's real business query, for example, run the Hive
command line to create a location as a sharded table on CHDFS:
CREATE TABLE `report.report_o2o_pid_credit_detail_grant_daily`(
`cal_dt` string,
`change_time` string,
`merchant_id` bigint,
`store_id` bigint,
`store_name` string,
`wid` string,
`member_id` bigint,
`meber_card` string,
`nickname` string,
`name` string,
`gender` string,
`birthday` string,
`city` string,
`mobile` string,
`credit_grant` bigint,
`change_reason` string,
`available_point` bigint,
`date_time` string,
`channel_type` bigint,
`point_flow_id` bigint)
PARTITIONED BY (
`topicdate` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'cosn://examplebucket-1250000000/user/hive/warehouse/report.db/report_o2o_pid_credit_detail_grant_daily'
TBLPROPERTIES (
'last_modified_by'='work',
'last_modified_time'='1589310646',
'transient_lastDdlTime'='1589310646')
Run the following SQL query:
select count(1) from report.report_o2o_pid_credit_detail_grant_daily;
You will see the following results:
For the Tez engine, you need to import the CHDFS JAR package into the compressed package of Tez. The following takes apache-tez.0.8.5
as an example for description:
Directions
(1) Find the Tez package installed on the CDH cluster and decompress it, such as /usr/local/service/tez/tez-0.8.5.tar.gz
.
(2) Place the CHDFS JAR package in the decompressed directory, recompress it, and output a compressed package.
(3) Upload the new compressed package to the path specified by tez.lib.uris
(if it already exists, directly replace it).
(4) On the CDH homepage, find Hive and restart HiveServer and HiveMetastore.
Directions
(1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
(2) Restart NodeManager.
Sample
The following takes the Spark example word count
test conducted with CHDFS as an example.
spark-submit --class org.apache.spark.examples.JavaWordCount --executor-memory 4g --executor-cores 4 ./spark-examples-1.6.0-cdh5.16.1-hadoop2.6.0-cdh5.16.1.jar cosn://examplebucket-1250000000/wordcount
The execution result is as follows:
Directions
(1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
(2) Place the CHDFS SDK JAR package in the sqoop
directory (such as /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/sqoop/
).
(3) Restart NodeManager.
Sample
The test is performed by taking exporting MySQL tables to CHDFS as instructed in Import/Export of Relational Database and HDFS as an example.
sqoop import --connect "jdbc:mysql://IP:PORT/mysql" --table sqoop_test --username root --password 123 --target-dir cosn://examplebucket-1250000000/sqoop_test
The execution result is as follows:
Directions
(1) Configure HDFS as instructed in Data migration and place the CHDFS SDK JAR package in the corresponding directory of HDFS.
(2) Place the CHDFS SDK JAR package in the presto
directory (such as /usr/local/services/cos_presto/plugin/hive-hadoop2
).
(3) As Presto doesn't load gson-2.*.*.jar
under hadoop common
, you also need to place gson-2.*.*.jar
in the presto
directory (such as /usr/local/services/cos_presto/plugin/hive-hadoop2
. Only CHDFS depends on Gson).
(4) Restart HiveServer, HiveMetaStore, and Presto.
Sample
The following takes creating a location with Hive as a CHDFS table query as an example:
select * from chdfs_test_table where bucket is not null limit 1;
Note:
chdfs_test_table
is the table whose location isofs scheme
.
The query result is as follows:
Was this page helpful?