hadoop jar cos-distcp-${version}.jar \\-libjars cos_api-bundle-${version}.jar,hadoop-cos-${version}.jar \\-Dfs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider \\-Dfs.cosn.userinfo.secretId=COS_SECRETID \\-Dfs.cosn.userinfo.secretKey=COS_SECRETKEY \\-Dfs.cosn.bucket.region=ap-guangzhou \\-Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem \\-Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN \\--src /data/warehouse \\--dest cosn://examplebucket-1250000000/warehouse
--skipMode
or --diffMode
parameter to compare the length or CRC checksum of the files to implement data verification and incremental file migration.Attribute Key | Description | Default Value | Required |
--help | Outputs parameters supported by COSDistCp.
Example: --help | None | No |
--src=LOCATION | Location of the data to copy. This can be either an HDFS or COS location.
Example: --src=hdfs://user/logs/ | None | Yes |
--dest=LOCATION | Destination for the data. This can be either an HDFS or COS location.
Example: --dest=cosn://examplebucket-1250000000/user/logs | None | Yes |
--srcPattern=PATTERN | A regular expression that filters files in the source location.
Example: --srcPattern='.*\\.log$'
*Note: Enclose your parameter in single quotation marks (') in case asterisks () are parsed by the shell.** | None | No |
--taskNumber=VALUE | Number of copy threads
Example: --taskNumber=10 | 10 | No |
--workerNumber=VALUE | Number of copy threads. COSDistCp will create a copy thread pool for each copy process based on this value set.
Example: workerNumber=4 | 4 | No |
--filesPerMapper=VALUE | The number of files input to each mapper.
Example: --filesPerMapper=10000 | 500000 | No |
--groupBy=PATTERN | A regular expression to concatenate text files that match the regular expression.
Example: --groupBy='.*group-input/(\\d+)-(\\d+).*' | None | No |
--targetSize=VALUE | The size (in MB) of the files to create. This parameter is used together with --groupBy.
Example: --targetSize=10 | None | No |
--outputCodec=VALUE | Compression method of output file. Valid values: gzip, lzo, snappy, none, keep. Here:
1. keep indicates to keep the compression method of the original file.
2. none indicates to decompress the file based on the file extension.
Example: --outputCodec=gzip
Note: if the /dir/test.gzip and /dir/test.gz files exist, and you specify the output format as lzo, only /dir/test.lzo will be retained. | keep | No |
--deleteOnSuccess | Deletes the source file immediately after it is successfully copied to the destination directory.
Example: --deleteOnSuccess
Note: v1.7 and later no longer provide this parameter. We recommend you delete the data in the source file system after migrating the data successfully and using --diffMode for verification. | false | No |
--multipartUploadChunkSize=VALUE | The size (in MB) of the multipart upload part transferred to COS using the Hadoop-COS plugin. COS supports up to 10,000 parts. You can set the value based on the file size.
Example: --multipartUploadChunkSize=20 | 8 MB | No |
--cosServerSideEncryption | Specifies whether to use SSE-COS for encryption on the COS server side.
Example: --cosServerSideEncryption | false | No |
--outputManifest=VALUE | Creates a file (Gzip compressed) that contains a list of all files copied to the destination location.
Example: --outputManifest=manifest.gz | None | No |
--requirePreviousManifest | If this parameter is set to true, --previousManifest=VALUE must be specified for incremental copy.
Example: --requirePreviousManifest | false | No |
--previousManifest=LOCATION | A manifest file that was created during the previous copy operation.
Example: --previousManifest=cosn://examplebucket-1250000000/big-data/manifest.gz | None | No |
--copyFromManifest | Copies files specified in --previousManifest to the destination file system. This is used together with previousManifest=LOCATION.
Example: --copyFromManifest | false | No |
--storageClass=VALUE | The storage class to use. Valid values: STANDARD, STANDARD_IA, ARCHIVE, DEEP_ARCHIVE, and INTELLIGENT_TIERING. For more information, see Storage Class Overview. | None | No |
--srcPrefixesFile=LOCATION | A local file that contains a list of source directories, one directory per line.
Example: --srcPrefixesFile=file:///data/migrate-folders.txt | None | No |
--skipMode=MODE | Verifies whether the source and destination files are the same before the copy. If they are the same, the file will be skipped. Valid values are none (no verification), length, checksum, length-mtime, and length-checksum.
Example: --skipMode=length | length-checksum | No |
--checkMode=MODE | Verifies whether the source and destination files are the same when the copy is completed. Valid values are none (no verification), length, checksum, length-mtime, and length-checksum.
Example: --checkMode=length-checksum | length-checksum | No |
--diffMode=MODE | Specifies the rule for obtaining the list of different files in the source and destination directories. Valid values are length, checksum, length-mtime, and length-checksum.
Example: --diffMode=length-checksum | None | No |
--diffOutput=LOCATION | Specifies the HDFS output directory in diffMode. This directory must be empty.
Example: --diffOutput=/diff-output | None | No |
--cosChecksumType=TYPE | Specifies the CRC algorithm used by the Hadoop-COS plugin. Valid values are CRC32C and CRC64.
Example: --cosChecksumType=CRC32C | CRC32C | No |
--preserveStatus=VALUE | Specifies whether to copy the user, group, permission, xattr, and timestamps metadata of the source file to the destination file. Valid values are any combinations of letters u, g, p, x, and t (initials of user, group, permission, xattr, and timestamps, respectively).
Example: --preserveStatus=ugpt | None | No |
--ignoreSrcMiss | Ignores files that exist in the manifest file but cannot be found during the copy. | false | No |
--promGatewayAddress=VALUE | Specifies the Prometheus PushGateway address and port for pushing the counter data of MapReduce jobs. | None | No |
--promGatewayDeleteOnFinish=VALUE | Whether to delete JobName metrics from Prometheus PushGateway when the specified job is completed.
Example: --promGatewayDeleteOnFinish=true | true | No |
--promGatewayJobName=VALUE | JobName to report to Prometheus PushGateway
Example: --promGatewayJobName=cos-distcp-hive-backup | None | No |
--promCollectInterval=VALUE | Interval to collect MapReduce jobs, in ms
Example: --promCollectInterval=5000 | 5000 | No |
--promPort=VALUE | Server port to expose Prometheus metrics
Example: --promPort=9028 | None | No |
--enableDynamicStrategy | Enables the dynamic task assignment policy to make tasks with quicker migration migrate more files.
Note: This mode has certain limits; for example, the task counter may be inaccurate if the process is abnormal. Therefore, use --diffMode to verify the data after migration.
Example: --enableDynamicStrategy | false | No |
--splitRatio=VALUE | Split ratio of the dynamic strategy. A higher splitRatio indicates a smaller job granularity.
Example: --splitRatio=8 | 8 | No |
--localTemp=VALUE | Local folder to store the job files generated by the dynamic strategy
Example: --localTemp=/tmp | /tmp | No |
--taskFilesCopyThreadNum=VALUE | Number of concurrency to copy the job files generated by the dynamic strategy to the HDFS
Example: --taskFilesCopyThreadNum=32 | 32 | No |
--statsRange=VALUE | Statistics range
Example: ---statsRange=0,1mb,10mb,100mb,1gb,10gb,inf | 0,1mb,10mb,100mb,1gb,10gb,inf | No |
--printStatsOnly | Collects only statistics on the file size distribution without copying the data.
Example: --printStatsOnly | None | No |
--bandWidth | Maximum bandwidth for reading each migrated file (in MB/s). Default value: -1, which indicates no limit on the read bandwidth.
Example: --bandWidth=10 | None | No |
--jobName | Migration task name.
Example: --jobName=cosdistcp-to-warehouse | None | No |
--compareWithCompatibleSuffix | Whether to change the source file extension gzip to gz and lzop to lzo when using the --skipMode and --diffMode parameters.
Example: --compareWithCompatibleSuffix | None | No |
--delete | Moves files that exist in the source directory but not in the target directory to the separate trash directory and generates the file list in order to ensure the file consistency between the source and target directories.
Note: This parameter cannot be used together with --diffMode. | None | No |
--deleteOutput | Specifies the HDFS output directory for delete. This directory must be empty.
Example: --deleteOutput=/dele-output | None | No |
--help
to view the parameters supported by COSDistCp:hadoop jar cos-distcp-${version}.jar --help
${version}
is the version ID of the COSDistCp. For example, the name of the COSDistCp JAR package (version 1.0) is cos-distcp-1.0.jar
.--printStatsOnly
and --statsRange=VALUE
parameters to output the file size distribution of the files to copy:hadoop jar cos-distcp-${version}.jar --src /wookie/data --dest cosn://examplebucket-1250000000/wookie/data --printStatsOnly --statsRange=0,1mb,10mb,100mb,1gb,10gb,infCopy File Distribution Statistics:Total File Count: 4Total File Size: 1190133760| SizeRange | TotalCount | TotalSize || 0MB ~ 1MB | 0(0.00%) | 0(0.00%) || 1MB ~ 10MB | 1(25.00%) | 1048576(0.09%) || 10MB ~ 100MB | 1(25.00%) | 10485760(0.88%) || 100MB ~ 1024MB | 1(25.00%) | 104857600(8.81%) || 1024MB ~ 10240MB | 1(25.00%) | 1073741824(90.22%) || 10240MB ~ LONG_MAX| 0(0.00%) | 0(0.00%) |
--src
and --dest
parameters:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse
/tmp/${randomUUID}/output/failed/
directory, where ${randomUUID}
is a random string. After recording the failed file information, COSDistCp will continue to migrate the remaining files, and the migration task will not fail due to the migration failure of some files. When the migration task is completed, COSDistCp will output counter information (ensure that your task submitting machine is configured with INFO log output for MapReduce jobs on the submission end) and determine whether there are files that failed to be migrated, and if yes, it will throw an exception on the client that submitted the task.application_1610615435237_0021
is the application ID.yarn logs -applicationId application_1610615435237_0021 > application_1610615435237_0021.log
CosDistCp CountersBYTES_EXPECTED=10198247BYTES_SKIPPED=10196880FILES_COPIED=1FILES_EXPECTED=7FILES_FAILED=1FILES_SKIPPED=5
Statistics Item | Description |
BYTES_EXPECTED | Total size (in bytes) to copy according to the source directory |
FILES_EXPECTED | Number of files to copy according to the source directory, including the directory itself |
BYTES_SKIPPED | Total size (in bytes) of files that can be skipped (same length or checksum value) |
FILES_SKIPPED | Number of source files that can be skipped (same length or checksum value) |
FILES_COPIED | Number of source files that are successfully copied |
FILES_FAILED | Number of source files that failed to be copied |
FOLDERS_COPIED | Number of directories that are successfully copied |
FOLDERS_SKIPPED | Number of directories that are skipped |
--taskNumber
and --workersNumber
parameters. COSDistCp adopts a multi-process, multi-thread framework for the copy operation. You can:--taskNumber
to specify the number of processes.--workerNumber
to specify the number of threads in each copy process.hadoop jar cos-distcp-${version}.jar --src /data/warehouse/ --dest cosn://examplebucket-1250000000/data/warehouse --taskNumber=10 --workerNumber=5
--skipMode
parameter to skip copying source files with the same length and checksum as those of destination files. The default value is length-checksum
:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --skipMode=length-checksum
--skipMode
is used to verify whether the source and destination files are the same before the copy. If they are the same, the file will be skipped. Valid values are none
(no verification), length
, checksum
, and length-checksum
(length + CRC checksum).hadoop fs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /data/test.txt/data/test.txt COMPOSITE-CRC32C 6a732798
--diffMode
and --diffOutput
parameters:--diffMode
can be set to length
or length-checksum
.--diffMode=length
obtains the list of different files based on whether the file sizes are the same.--diffMode=length-checksum
obtains the list of different files based on whether the file size and CRC checksum are the same.--diffOutput
specifies the output directory for the diff operation.
If the destination file system is COS and the CRC algorithm of the source file system is different from that of COS, COSDistCp will pull the source file to calculate the CRC checksum of the destination file system and compare the CRC checksums to check whether they are the same. In the following sample code, the --diffMode
parameter is used to check whether the source and destination files are the same based on the file size and CRC checksum after migration.hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --diffMode=length-checksum --diffOutput=/tmp/diff-output
diff
operation fails due to insufficient permissions or other reasons./tmp/diff-output/failed
directory in HDFS (or /tmp/diff-output
for v1.0.5 or earlier versions). You can run the following command to obtain the list of different files except for those recorded as SRC_MISS:hadoop fs -getmerge /tmp/diff-output/failed diff-manifestgrep -v '"comment":"SRC_MISS"' diff-manifest |gzip > diff-manifest.gz
hadoop jar cos-distcp-${version}.jar --taskNumber=20 --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --previousManifest=file:///usr/local/service/hadoop/diff-manifest.gz --copyFromManifest
--diffMode
parameter again to check whether the files are completely identical.--checkMode
parameter to check whether the source and destination files have the same length and checksum after file copy is completed. The default value is length-checksum
.hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --checkMode=length-checksum
--groupBy
is not specified and --outputCodec
is the default value.--bandWidth
parameter (in MB). The following example command restricts the read bandwidth of each copied file to 10 MB/s:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --bandWidth=10
cat
command to view the directories as follows:cat srcPrefixes.txt/data/warehouse/20181121//data/warehouse/20181122/
--srcPrefixesFile
to specify this file. The command is as follows:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --srcPrefixesFile file:///usr/local/service/hadoop/srcPrefixes.txt --dest cosn://examplebucket-1250000000/data/warehouse/ --taskNumber=20
--srcPattern
parameter. In this example, only files whose extension is ".log" in the /data/warehouse/
directory are copied.hadoop jar cos-distcp-${version}.jar --src /data/warehouse/ --dest cosn://examplebucket-1250000000/data/warehouse --srcPattern='.*\\.log$'
hadoop jar cos-distcp-${version}.jar --src /data/warehouse/ --dest cosn://examplebucket-1250000000/data/warehouse/ --srcPattern='.*(?<!\\.temp|\\.tmp)$'
--cosChecksumType
parameter. Valid values are CRC32C
(default) and CRC64
.hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --cosChecksumType=CRC32C
--storageClass
parameter:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --outputManifest=manifest-2020-01-10.gz --storageClass=STANDARD_IA
--outputCodec
parameter, which allows you to compress HDFS data to COS in real time to reduce storage costs. Valid values are keep
, none
, gzip
, lzop
, and snappy
. If the parameter is set to none
, the files will be copied uncompressed. If it is set to keep
, the files will be copied with no change in their compression. The following is an example:hadoop jar cos-distcp-${version}.jar --src /data/warehouse/logs --dest cosn://examplebucket-1250000000/data/warehouse/logs-gzip --outputCodec=gzip
keep
, the files will be decompressed and converted to the target compression format. Due to the difference in compression parameters, the content of the destination files might be different from that of the source files, but the files will be the same after decompression. If --groupBy
is not specified and --outputCodec
is the default value, you can use --skipMode
to perform incremental migration and --checkMode
to perform data verification.--deleteOnSuccess
parameter. The following example deletes the corresponding source files in the /data/warehouse
directory immediately after they are copied from HDFS to COS:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --deleteOnSuccess
--deleteOnSuccess
is specified, each source file is deleted immediately after the file is copied, but not after all source files are copied. The parameter is not provided in version 1.7 or later.--outputManifest
and --previousManifest
parameters.--outputManifest
generates a local manifest.gz
(Gzip compressed) file. When the copy operation is successful, the file is moved to the directory specified in --dest
.--previousManifest
specifies the destination files that are copied during the previous copy operation (--outputManifest
). COSDistCp will skip files of the same size.hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --outputManifest=manifest.gz --previousManifest= cosn://examplebucket-1250000000/data/warehouse/manifest-2020-01-10.gz
--diffMode
and determine the changed manifest files based on the CRC checksum.--enableDynamicStrategy
to enable the dynamic strategy, which allows faster-speed jobs to copy more files to speed up the whole copy process.hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --enableDynamicStrategy
hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --diffMode=length-checksum --diffOutput=/tmp/diff-output
--diffMode
to verify the data after migration.--preserveStatus
parameter to copy the user
, group
, permission
, and timestamps
(modification time and access time) metadata of the source file/directory to the destination file/directory. The parameter takes effect when files are copied from HDFS to CHDFS.
Sample:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --preserveStatus=ugpt
prometheus.yml
to add the jobs to grab:- job_name: 'cos-distcp-hive-backup'static_configs:- targets: ['172.16.16.139:9028']
--promPort=VALUE
parameter to expose the counter of the current MapReduce job:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --promPort=9028
--completionCallbackClass
parameter to specify the path of the callback class. When the task is completed, COSDistCp will use the collected task information as parameters to execute the callback function. For user-defined callback functions, the following APIs need to be implemented. You can download the callback sample code.package com.qcloud.cos.distcp;import java.util.Map;public interface TaskCompletionCallback {/*** @description: When the task is completed, the callback function is executed* @param jobType Copy or Diff* @param jobStartTime the job start time* @param errorMsg the exception error msg* @param applicationId the MapReduce application id* @param: cosDistCpCounters the job*/void doTaskCompletionCallback(String jobType, long jobStartTime, String errorMsg, String applicationId, Map<String, Long> cosDistCpCounters);/*** @description: init callback config before execute*/void init() throws Exception;}
export alarmSecretId=SECRET-IDexport alarmSecretKey=SECRET-KEYexport alarmRegion=ap-guangzhouexport alarmModule=moduleexport alarmPolicyId=cm-xxxhadoop jar cos-distcp-1.4-2.8.5.jar \\-Dfs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider \\-Dfs.cosn.userinfo.secretId=SECRET-ID \\-Dfs.cosn.userinfo.secretKey=SECRET-KEY \\-Dfs.cosn.bucket.region=ap-guangzhou \\-Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem \\-Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN \\--src /data/warehouse \\--dest cosn://examplebucket-1250000000/data/warehouse/ \\--checkMode=checksum \\--completionCallbackClass=com.qcloud.cos.distcp.DefaultTaskCompletionCallback
alarmPolicyId
in the command above is an alarm policy created in Cloud Monitor. You can go to the Cloud Monitor console (Alarm Management > Alarm Configuration > Custom Messages) to create and configure one.checkMode
:hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse --taskNumber=20
hadoop jar cos-distcp-${version}.jar --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse/ --diffMode=length-checksum --diffOutput=/tmp/diff-output
hadoop jar cos-distcp-${version}.jar \\-Dfs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider \\-Dfs.cosn.userinfo.secretId=COS_SECRETID \\-Dfs.cosn.userinfo.secretKey=COS_SECRETKEY \\-Dfs.cosn.bucket.region=ap-guangzhou \\-Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem \\-Dfs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN \\--src /data/warehouse \\--dest cosn://examplebucket-1250000000/warehouse
/tmp/${randomUUID}/output/failed/
directory, where ${randomUUID}
is a random string. Common reasons for the copy failure are as follows:hadoop fs -getmerge /tmp/${randomUUID}/output/failed/ failed-manifestgrep -v '"comment":"SRC_MISS"' failed-manifest |gzip > failed-manifest.gz
/tmp/${randomUUID}/output/logs/
directory and pulling the application logs. The following command example pulls the logs of the YARN application:yarn logs -applicationId application_1610615435237_0021 > application_1610615435237_0021.log
application_1610615435237_0021
is the application ID.hadoop jar cos-distcp-${version}.jar -Dmapreduce.task.timeout=18000 -Dmapreduce.reduce.memory.mb=8192 --src /data/warehouse --dest cosn://examplebucket-1250000000/data/warehouse
mapreduce.task.timeout
is changed to 18,000 seconds to avoid job timeout when large files are copied, and the value of mapreduce.reduce.memory.mb
(memory size of the Reduce process) is changed to 8 GB to avoid memory overflow.workerNumber
to 1, use the taskNumber
parameter to control the number of concurrent migrations, and use the bandWidth
parameter to control the bandwidth of a single concurrent migration.
Was this page helpful?