tencent cloud

All product documents
Cloud Object Storage
DocumentationCloud Object StorageData Lake StorageMetadata AcceleratorMigrating HDFS Data to Metadata Acceleration-Enabled Bucket
Migrating HDFS Data to Metadata Acceleration-Enabled Bucket
Last updated: 2024-03-25 16:04:01
Migrating HDFS Data to Metadata Acceleration-Enabled Bucket
Last updated: 2024-03-25 16:04:01

Overview

COS offers the metadata acceleration feature to provide high-performance file system capabilities. Metadata acceleration leverages the powerful metadata management feature of Cloud HDFS (CHDFS) at the underlying layer to allow using file system semantics for COS access. The designed system metrics can reach a bandwidth of up to 100 GB/s, over 100,000 queries per second (QPS), and a latency of milliseconds. Buckets with metadata acceleration enabled can be widely used in scenarios such as big data, high-performance computing, machine learning, and AI. For more information on metadata acceleration, see Metadata Acceleration Overview.
COS provides the Hadoop semantics through the metadata acceleration service. Therefore, you can use COSDistCp to easily implement two-way data migration between COS and other Hadoop file systems. This document describes how to use COSDistCp to migrate files in the local HDFS to a metadata acceleration bucket in COS.

Environment Preparations Before Migration

Migration tools

1. Download the JAR packages of the tools as listed below and place them in the local directory on the node running the migration task in the cluster, such as /data01/jars.
EMR environment
Installation notes
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin

Self-built environment such as Hadoop or CDH
Software dependency
Hadoop 2.6.0 or later and Hadoop-COS 8.1.5 or later are required. The cos_api-bundle plugin version must match the Hadoop-COS version as described in Releases.
Installation notes
Install the following plugins in the Hadoop environment:
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin
Hadoop-COS
8.1.5 or later
For more information, see Hadoop.
cos_api-bundle
The version needs to match the Hadoop-COS version.

Note:
Hadoop-COS supports access to metadata acceleration buckets in the format of cosn://bucketname-appid/ starting from v8.1.5.
The metadata acceleration feature can only be enabled during bucket creation and cannot be disabled once enabled. Therefore, carefully consider whether to enable it based on your business conditions. You should also note that legacy Hadoop-COS packages cannot access metadata acceleration buckets.
2. Create a metadata acceleration bucket and configure the HDFS protocol for it as instructed in "Creating Bucket and Configuring the HDFS Protocol" in Using HDFS to Access Metadata Acceleration-Enabled Bucket.
3. Modify the migration cluster's core-site.xml and distribute the configuration to all nodes. If only data needs to be migrated, you don't need to restart the big data component.
Key
Value
Configuration File
Description
fs.cosn.trsf.fs.ofs.impl
com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl
com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.ofs.tmp.cache.dir
In the format of `/data/emr/hdfs/tmp/`
core-site.xml
Temporary directory, which is required. It will be created on all MRS nodes. You need to ensure that there are sufficient space and permissions.
fs.cosn.trsf.fs.ofs.user.appid
`appid` of your COS bucket
core-site.xml
Required
fs.cosn.trsf.fs.ofs.ranger.enable.flag
false
core-site.xml
This key is required. You need to check whether the value is `false`.
fs.cosn.trsf.fs.ofs.bucket.region
Bucket region
core-site.xml
This key is required. Valid values: eu-frankfurt (Frankfurt), ap-chengdu (Chengdu), and ap-singapore (Singapore).
4. You can verify the migration by accessing the metadata acceleration bucket over the private network as instructed in "Configuring Computing Cluster to Access COS" in Using HDFS to Access Metadata Acceleration-Enabled Bucket. Use the migration cluster submitter to verify whether COS can be accessed successfully.

Existing Data Migration

1. Determine the directories to be migrated

Generally, HDFS storage data will be first migrated. The directory to be migrated in the source HDFS cluster will be selected, and the target path needs to be the same as the source path.
Suppose you need to migrate the HDFS directory hdfs:///data/user/target to cosn://{bucketname-appid}/data/user/target.
To ensure that the files in the source directory remain unchanged during migration, the snapshot feature of HDFS will be used to create a snapshot of the source directory (named the current date).
hdfs dfsadmin -disallowSnapshot hdfs:///data/user/
hdfs dfsadmin -allowSnapshot hdfs:///data/user/target
hdfs dfs -deleteSnapshot hdfs:///data/user/target {current date}
hdfs dfs -createSnapshot hdfs:///data/user/target {current date}
Sample successful execution:


If you don't want to create a snapshot, you can directly migrate the target files in the source directory.

2. Use COSDistCp for migration

Start a COSDistCp task to copy files from the source HDFS to the target COS bucket.
A COSDistCp task is essentially a MapReduce task. The printed MapReduce task log will show whether the task is executed successfully. If the task fails, you can view the YARN page and submit the log or exception information to the COS team for troubleshooting. You can use COSDistCp to execute a migration task in the following steps: (1) Create a temporary directory. (2) Run a COSDistCp task. (3) Migrate failed files again.

(1) Create a temporary directory

hadoop fs -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar -mkdir cosn://bucket-appid/distcp-tmp

(2) Run a COSDistCp task

nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 >> ./distcp.log &
The parameters are as detailed below. You can adjust their values as needed.
--taskNumber=VALUE: Number of copy threads. Example: --taskNumber=10.
--workerNumber=VALUE: Number of copy threads. COSDistCp will create a copy thread pool for each copy process based on this value set. Example: workerNumber=4.
--bandWidth: Maximum bandwidth for reading each migrated file (in MB/s). Default value: -1, which indicates no limit on the read bandwidth. Example: --bandWidth=10.
--cosChecksumType=CRC32C: CRC32C is used by default, but the HDFS cluster must be able to check COMPOSITE_CRC32. The Hadoop version must be 3.1.1 or later; otherwise, you need to change this parameter to --cosChecksumType=CRC64.
Note:
The formula for calculating the total bandwidth limit of COSDistCp migration is: taskNumber * workerNumber * bandWidth. You can set workerNumber to 1, use the taskNumber parameter to control the number of concurrent migrations, and use the bandWidth parameter to control the bandwidth of a single concurrent migration.
When the copy operation ends, the task log will output statistics on the copy. The counters are as follows: Here, FILES_FAILED indicates the number of failed files. If there is no FILES_FAILED counter, all files have been migrated successfully.
CosDistCp Counters
BYTES_EXPECTED=10198247
BYTES_SKIPPED=10196880
FILES_COPIED=1
FILES_EXPECTED=7
FILES_FAILED=1
FILES_SKIPPED=5

The specific statistics items in the output result are as detailed below:
Statistics Item
Description
BYTES_EXPECTED
Total size (in bytes) to copy according to the source directory
FILES_EXPECTED
Number of files to copy according to the source directory, including the directory itself
BYTES_SKIPPED
Total size (in bytes) of files that can be skipped (same length or checksum value)
FILES_SKIPPED
Number of source files that can be skipped (same length or checksum value)
FILES_COPIED
Number of source files that are successfully copied
FILES_FAILED
Number of source files that failed to be copied
FOLDERS_COPIED
Number of directories that are successfully copied
FOLDERS_SKIPPED
Number of directories that are skipped

3. Migrate failed files again

COSDistCp not only solves most problems of inefficient file migration but also allows you to use the --delete parameter to guarantee the complete consistency between the HDFS and COS data.
When using the --delete parameter, you need to add the --deleteOutput=/xxx(custom) parameter but not the --diffMode parameter.
nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=--src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 --delete --deleteOutput=/dele-xx >> ./distcp.log &
After execution, the different data between HDFS and COS will be moved to the trash directory, and the list of moved files will be generated in the /xxx/failed directory. You can run hadoop fs -rm URL or hadoop fs -rmr URL to delete the data in the trash directory.

Incremental Migration

If any incremental data needs to be migrated afterwards, you only need to repeat the steps of full migration until all data has been migrated.
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon