Migrating HDFS Data to Metadata Acceleration-Enabled Bucket

Overview
COS offers the metadata acceleration feature to provide high-performance file system capabilities. Metadata acceleration leverages the powerful metadata management feature of Cloud HDFS (CHDFS) at the underlying layer to allow using file system semantics for COS access. The designed system metrics can reach a bandwidth of up to 100 GB/s, over 100,000 queries per second (QPS), and a latency of milliseconds. Buckets with metadata acceleration enabled can be widely used in scenarios such as big data, high-performance computing, machine learning, and AI. For more information on metadata acceleration, see Metadata Acceleration Overview.
COS provides the Hadoop semantics through the metadata acceleration service. Therefore, you can use COSDistCp to easily implement two-way data migration between COS and other Hadoop file systems. This document describes how to use COSDistCp to migrate files in the local HDFS to a metadata acceleration bucket in COS.
Environment Preparations Before Migration
Migration tools
1. Download the JAR packages of the tools as listed below and place them in the local directory on the node running the migration task in the cluster, such as /data01/jars.
EMR environment
Installation notes
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin
Download.
﻿
Self-built environment such as Hadoop or CDH
Software dependency
Hadoop 2.6.0 or later and Hadoop-COS 8.1.5 or later are required. The cos_api-bundle plugin version must match the Hadoop-COS version as described in Releases.
Installation notes
Install the following plugins in the Hadoop environment:
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin
Download
Hadoop-COS
8.1.5 or later
For more information, see Hadoop.
cos_api-bundle
The version needs to match the Hadoop-COS version.
Download.
﻿
Note: 
Hadoop-COS supports access to metadata acceleration buckets in the format of cosn://bucketname-appid/ starting from v8.1.5.
The metadata acceleration feature can only be enabled during bucket creation and cannot be disabled once enabled. Therefore, carefully consider whether to enable it based on your business conditions. You should also note that legacy Hadoop-COS packages cannot access metadata acceleration buckets.
2. Create a metadata acceleration bucket and configure the HDFS protocol for it as instructed in "Creating Bucket and Configuring the HDFS Protocol" in Using HDFS to Access Metadata Acceleration-Enabled Bucket.
3. Modify the migration cluster's core-site.xml and distribute the configuration to all nodes. If only data needs to be migrated, you don't need to restart the big data component.
Key
Value
Configuration File
Description
fs.cosn.trsf.fs.ofs.impl
com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl
com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.ofs.tmp.cache.dir
In the format of `/data/emr/hdfs/tmp/`
core-site.xml
Temporary directory, which is required. It will be created on all MRS nodes. You need to ensure that there are sufficient space and permissions.
fs.cosn.trsf.fs.ofs.user.appid
`appid` of your COS bucket
core-site.xml
Required
fs.cosn.trsf.fs.ofs.ranger.enable.flag
false
core-site.xml
This key is required. You need to check whether the value is `false`.
fs.cosn.trsf.fs.ofs.bucket.region
Bucket region
core-site.xml
This key is required. Valid values: eu-frankfurt (Frankfurt), ap-chengdu (Chengdu), and ap-singapore (Singapore).
4. You can verify the migration by accessing the metadata acceleration bucket over the private network as instructed in "Configuring Computing Cluster to Access COS" in Using HDFS to Access Metadata Acceleration-Enabled Bucket. Use the migration cluster submitter to verify whether COS can be accessed successfully.
Existing Data Migration
1. Determine the directories to be migrated
Generally, HDFS storage data will be first migrated. The directory to be migrated in the source HDFS cluster will be selected, and the target path needs to be the same as the source path.
Suppose you need to migrate the HDFS directory hdfs:///data/user/target to cosn://{bucketname-appid}/data/user/target.
To ensure that the files in the source directory remain unchanged during migration, the snapshot feature of HDFS will be used to create a snapshot of the source directory (named the current date).
hdfs dfsadmin -disallowSnapshot hdfs:///data/user/
hdfs dfsadmin -allowSnapshot hdfs:///data/user/target
hdfs dfs -deleteSnapshot hdfs:///data/user/target {current date}
hdfs dfs -createSnapshot hdfs:///data/user/target {current date}
Sample successful execution:
﻿
﻿
If you don't want to create a snapshot, you can directly migrate the target files in the source directory.
2. Use COSDistCp for migration
Start a COSDistCp task to copy files from the source HDFS to the target COS bucket.
A COSDistCp task is essentially a MapReduce task. The printed MapReduce task log will show whether the task is executed successfully. If the task fails, you can view the YARN page and submit the log or exception information to the COS team for troubleshooting. You can use COSDistCp to execute a migration task in the following steps:
(1) Create a temporary directory.
(2) Run a COSDistCp task.
(3) Migrate failed files again.
(1) Create a temporary directory
hadoop fs -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar -mkdir cosn://bucket-appid/distcp-tmp
(2) Run a COSDistCp task
nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars  /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=hdfs:///data/user/target/.snapshot/{current date}  --dest=cosn://{bucket-appid}/data/user/target   --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt  --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 >> ./distcp.log &
The parameters are as detailed below. You can adjust their values as needed.
--taskNumber=VALUE: Number of copy threads. Example: --taskNumber=10.
--workerNumber=VALUE: Number of copy threads. COSDistCp will create a copy thread pool for each copy process based on this value set. Example: workerNumber=4.
--bandWidth: Maximum bandwidth for reading each migrated file (in MB/s). Default value: -1, which indicates no limit on the read bandwidth. Example: --bandWidth=10.
--cosChecksumType=CRC32C: CRC32C is used by default, but the HDFS cluster must be able to check COMPOSITE_CRC32. The Hadoop version must be 3.1.1 or later; otherwise, you need to change this parameter to --cosChecksumType=CRC64.
Note: 
The formula for calculating the total bandwidth limit of COSDistCp migration is: taskNumber * workerNumber * bandWidth. You can set workerNumber to 1, use the taskNumber parameter to control the number of concurrent migrations, and use the bandWidth parameter to control the bandwidth of a single concurrent migration.
When the copy operation ends, the task log will output statistics on the copy. The counters are as follows:
Here, FILES_FAILED indicates the number of failed files. If there is no FILES_FAILED counter, all files have been migrated successfully.
CosDistCp Counters
        BYTES_EXPECTED=10198247
        BYTES_SKIPPED=10196880
        FILES_COPIED=1
        FILES_EXPECTED=7
        FILES_FAILED=1
        FILES_SKIPPED=5
﻿
The specific statistics items in the output result are as detailed below:
Statistics Item
Description
BYTES_EXPECTED
Total size (in bytes) to copy according to the source directory
FILES_EXPECTED
Number of files to copy according to the source directory, including  the directory itself
BYTES_SKIPPED
Total size (in bytes) of files that can be skipped (same length or checksum value)
FILES_SKIPPED
Number of source files that can be skipped (same length or checksum value)
FILES_COPIED
Number of source files that are successfully copied
FILES_FAILED
Number of source files that failed to be copied
FOLDERS_COPIED
Number of directories that are successfully copied
FOLDERS_SKIPPED
Number of directories that are skipped
3. Migrate failed files again
COSDistCp not only solves most problems of inefficient file migration but also allows you to use the --delete parameter to guarantee the complete consistency between the HDFS and COS data.
When using the --delete parameter, you need to add the --deleteOutput=/xxx(custom) parameter but not the --diffMode parameter.
nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=--src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 --delete --deleteOutput=/dele-xx >> ./distcp.log &
After execution, the different data between HDFS and COS will be moved to the trash directory, and the list of moved files will be generated in the /xxx/failed directory. You can run hadoop fs -rm URL or hadoop fs -rmr URL to delete the data in the trash directory.
Incremental Migration
If any incremental data needs to be migrated afterwards, you only need to repeat the steps of full migration until all data has been migrated.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Overview

Environment Preparations Before Migration

Migration tools

Existing Data Migration

1. Determine the directories to be migrated

2. Use COSDistCp for migration

(1) Create a temporary directory

(2) Run a COSDistCp task

3. Migrate failed files again

Incremental Migration

About Tencent Cloud

Help & Support

Resources

User Center

JAR Filename	Description	Download Address
cos-distcp-1.12-3.1.0.jar	COSDistCp package, whose data needs to be copied to COSN.	For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar	OFS plugin	Download.

Key	Value	Configuration File	Description
fs.cosn.trsf.fs.ofs.impl	com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter	core-site.xml	COSN implementation class, which is required.
fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl	com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter	core-site.xml	COSN implementation class, which is required.
fs.cosn.trsf.fs.ofs.tmp.cache.dir	In the format of `/data/emr/hdfs/tmp/`	core-site.xml	Temporary directory, which is required. It will be created on all MRS nodes. You need to ensure that there are sufficient space and permissions.
fs.cosn.trsf.fs.ofs.user.appid	`appid` of your COS bucket	core-site.xml	Required
fs.cosn.trsf.fs.ofs.ranger.enable.flag	false	core-site.xml	This key is required. You need to check whether the value is `false`.
fs.cosn.trsf.fs.ofs.bucket.region	Bucket region	core-site.xml	This key is required. Valid values: eu-frankfurt (Frankfurt), ap-chengdu (Chengdu), and ap-singapore (Singapore).

Statistics Item	Description
BYTES_EXPECTED	Total size (in bytes) to copy according to the source directory
FILES_EXPECTED	Number of files to copy according to the source directory, including the directory itself
BYTES_SKIPPED	Total size (in bytes) of files that can be skipped (same length or checksum value)
FILES_SKIPPED	Number of source files that can be skipped (same length or checksum value)
FILES_COPIED	Number of source files that are successfully copied
FILES_FAILED	Number of source files that failed to be copied
FOLDERS_COPIED	Number of directories that are successfully copied
FOLDERS_SKIPPED	Number of directories that are skipped

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Voice Technology

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Tencent Big Model

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Image Creation

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Overview

Environment Preparations Before Migration

Migration tools

Existing Data Migration

1. Determine the directories to be migrated

2. Use COSDistCp for migration

(1) Create a temporary directory

(2) Run a COSDistCp task

3. Migrate failed files again

Incremental Migration

About Tencent Cloud

Help & Support

Resources

User Center