tencent cloud

All product documents
DocumentationHyper Computing ClusterOperation GuideInstallation Instructions of TCCL on GPU Instances
Installation Instructions of TCCL on GPU Instances
Last updated: 2024-08-20 17:06:38
Installation Instructions of TCCL on GPU Instances
Last updated: 2024-08-20 17:06:38

Introduction of TCCL

Tencent Collective Communication Library (TCCL) is a high-performance customized acceleration communication library designed for Tencent Cloud's StarPulse network architecture. The main feature leverages the StarPulse network hardware architecture to provide more efficient network communication performance for large-scale AI model training, along with intelligent operations and maintenance capabilities for rapid perception and self-healing of network failures. TCCL has been expanded and optimized based on open source NCCL code, ensuring full compatibility with NCCL's features and usage methods. TCCL currently supports main features including:
Dynamic aggregation optimization of dual network interfaces maximizes the performance of bonding devices.
Global Hash Routing enables load balancing to avoid congestion.
Topology affinity traffic scheduling minimizes traffic detours.

Overview

This document describes how to configure the TCCL acceleration communication library in the Tencent Cloud environment to improve multi-machine multi-card communication performance in the Tencent Cloud RDMA environment. In large-scale model training scenarios, TCCL is expected to increase the bandwidth utilization rate by approximately 50% compared with the open source NCCL scheme.

Directions

Environment Preparations

1. Create GPU Hyper Computing ClusterPNV4sne or GPU Hyper Computing ClusterPNV4sn Hyper Computing Cluster instances, which support 1.6 Tbps and 800 Gbps RDMA networks respectively.
2. Install the GPU driver and nvidia-fabricmanager service for GPU instances.
Note:
TCCL operating software environment requires glibc version 2.17 or later and CUDA version 10.0 or later.

Selecting Installation Methods

TCCL currently supports three installation methods. You can select the installation method suitable for business scenarios as required.
TCCL communication library + compile and install pytorch
TCCL communication library + pytorch communication plugins
NCCL plugins + sorted IP list
Note:
Since most large-scale model training is based on the Pytorch framework, we will take Pytorch as an example.
The comparison of three access schemes for TCCL is shown in the following table:
Installation Methods
Method 1: compile and install Pytorch.
Method 2: install Pytorch communication plugins.
(Recommended) Method 3: install NCCL communication plugins.
Usage Steps
Install TCCL.
Recompile and install Pytorch.
Install Pytorch communication plugins.
Modify the distributed communication backend.
Install NCCL plugins.
Modify the startup script.
Advantage
No intrusion into business code.
Easy to install.
Easy to install.
Disadvantage
Require to recompile and install Pytorch.
Have requirements for the software environment.
Require to modify the business code.
Have requirements for the software environment.
Require to update the sorted list after scaling out cluster nodes.
Software Dependency on the Environment
Corresponding to NCCL version 2.12.
Require glibc version 2.17 or later.
Require CUDA version 10.0 or later.
Current installation package only supports Pytorch 1.12
Require glibc version 2.17 or later. Require CUDA version 10.0 or later.
Install NCCL.
If your machine resources and model training scenarios are relatively fixed, it is recommended to use Method 3, which is compatible with different NCCL and CUDA versions, and easy to install and use without modifying business code or recompiling pytorch.
If your resources need to be provided to different business teams or frequently require scale-out, it is recommended to use the first two methods, which do not require algorithm personnel or scheduling frameworks to deliberately perceive the network topology information of machines.
If you do not want to adapt the business code, you can use Method 1, which only requires recompiling the pytorch framework.

Configuring the TCCL Environment and Verifying

Method 1: compile and install Pytorch.
Method 2: install Pytorch communication plugins.
(Recommended) Method 3: install NCCL plugins
As the community pytorch connects to the NCCL communication library statically by default, TCCL cannot be used by replacing the shared libraries.
1. Install TCCL.
Take Ubuntu 20.04 as an example, you can use the following commands to install. After installation, TCCL will be located in the /opt/tencent/tccl directory.
# Uninstall the existing tccl versions and nccl plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins

# Download and install tccl v1.5 version.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/TCCL_1.5-ubuntu.20.04.5_amd64.deb && dpkg -i TCCL_1.5-ubuntu.20.04.5_amd64.deb && rm -f TCCL_1.5-ubuntu.20.04.5_amd64.deb
If you use CentOS or TencentOS, see the following steps for installation:
# Uninstall the existing tccl versions and nccl plugins.
rpm -e tccl && rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64

# Download tccl v1.5 version.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/tccl-1.5-1.tl2.x86_64.rpm && rpm -ivh --nodeps --force tccl-1.5-1.tl2.x86_64.rpm && rm -f tccl-1.5-1.tl2.x86_64.rpm
2. Recompile and install Pytorch.
The following is an example of Pytorch source code installation. See the Official Pytorch Installation Description for details.
#!/bin/bash

# Uninstall the current version.
pip uninstall -y torch

# Download pytorch source code.
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch

# <!Important> Configure the installation path of TCCL.
export USE_SYSTEM_NCCL=1
export NCCL_INCLUDE_DIR="/opt/tencent/tccl/include"
export NCCL_LIB_DIR="/opt/tencent/tccl/lib"

# See the official website to add other compilation options .

# Install the development environment.
python setup.py develop
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
export TCCL_TOPO_AFFINITY=4
Note:
You need to enable network topology awareness through TCCL_TOPO_AFFINITY=4.
4. Verify Pytorch.
During the process of running single-machine multi-GPU or multi-machine multi-GPU training, the following log will be printed (export NCCL_DEBUG=INFO):



5. Verify nccl-tests.
Before running nccl-tests, you need to export the corresponding TCCL path:
export LD_LIBRARY_PATH=/opt/tencent/tccl/lib:$LD_LIBRARY_PATH
6. Supported software versions
At present, TCCL corresponds to NCCL version 2.12, requiring glibc version 2.17 or later and CUDA version 10.0 or later. For other supported CUDA version, please contact your pre-sales manager for support.
Pytorch supports integrating third-party communication backends through plugins, so users can use TCCL communication backends without recompiling Pytorch. The API is fully compatible with NCCL. See Introduction to Existing Communication Backends in Pytorch for details.
1. Install Pytorch communication plugins.
# Uninstall the existing tccl and NCCL plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins

# Uninstall torch_tccl.
pip uninstall -y torch-tccl

# Install torch_tccl version 0.0.2.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/torch_tccl-0.0.2_pt1.12-py3-none-any.whl && pip install torch_tccl-0.0.2_pt1.12-py3-none-any.whl && rm -f torch_tccl-0.0.2_pt1.12-py3-none-any.whl
2. Modify the business code.
import torch_tccl
#args.dist_backend = "nccl"
args.dist_backend = "tccl"
torch.distributed.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size, rank=args.rank
)
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
export TCCL_TOPO_AFFINITY=4
Note:
You need to enable the network topology awareness feature through TCCL_TOPO_AFFINITY=4.
4. Verify Pytorch.
When executing distributed training, the following prompts indicate that the communication backend has been loaded correctly.



5. Software version limit
The current installation package only supports Pytorch 1.12. For other supported Pytorch and CUDA versions, please contact your pre-sales manager for support.
Note:
If running nccl-tests or other scenarios that require dynamically linked communication libraries, use Method 1 to install TCCL.

If you have installed NCCL, you can also use the TCCL acceleration capability through the NCCL plugins.
1. Install NCCL plugins.
Take Ubuntu 20.04 as an example, you can use the following commands to install plugins.
# Uninstall the existing tccl and nccl plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins

# Download and install nccl 1.2 plugins.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.2_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.2_amd64.deb

# Please ensure that the version of nccl plugins used within the cluster is consistent. The following are the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.0_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.0_amd64.deb && rm -f nccl-rdma-sharp-plugins_1.0_amd64.deb
If you use CentOS or TencentOS, see the following steps for installation:
# Uninstall the existing nccl plugins.
rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64

# Download and install nccl 1.2 plugins.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm

# Ensure that the version of nccl plugins used within the cluster is consistent. The following is the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rm -f nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm
2. Obtain the topologically sorted IP list.
The NCCL plugins do not require dependency files to provide two optimizations: dynamic aggregation on bonding interfaces and global hash routing. If affinity awareness of the network topology is needed, users can achieve it through the sorted IP list.
IP sorting can be completed as follows:
Prepare the IP list file.
The VPC IP address can be obtained through ifconfig eth0, and each row has one node IP. The format is as follows:
root@VM-125-10-tencentos:/workspace# cat ip_eth0.txt
172.16.177.28
172.16.176.11
172.16.177.25
172.16.177.12
Execution sorting
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/get_rdma_order_by_ip.sh && bash get_rdma_order_by_ip.sh ip_eth0.txt
Note:
The curl tool is installed on all nodes (for Ubuntu, it can be installed via apt install curl).
The node executing the script can access all other nodes without ssh password.
View the sorted IP list file.
root@VM-125-10-tencentos:/workspace# cat hostfile.txt
172.16.176.11
172.16.177.12
172.16.177.25
172.16.177.28
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0

# After the machine IP is manually sorted, there is no need to add the following variables.
# export TCCL_TOPO_AFFINITY=4
4. Modify the startup script.
You need to modify the startup script when starting distributed training. For example, if you use the deepspeed launcher to start the training process, you need to obtain the sorted IP list, write the corresponding IP list into the hostfile, and then start the training process.
root@vm-3-17-centos:/workspace/ptm/gpt# cat hostfile
172.16.176.11 slots=8
172.16.177.12 slots=8
172.16.177.25 slots=8
172.16.177.28 slots=8

deepspeed --hostfile ./hostfile --master_addr 172.16.176.11 train.py
If torchrun is used to start the training process, specify the corresponding node sequence through --node_rank,
// on 172.16.176.11
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.12
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.25
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.28
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 --master_addr=172.16.176.11 train.py ...
If mpirun is used to start the training process, just sort the IP addresses in order.
mpirun \
-np 64 \
-H 172.16.176.11:8,172.16.177.12:8,172.16.177.25:8,172.16.177.28:8 \
--allow-run-as-root \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_IB_TIMEOUT=22 \
-x NCCL_PXN_DISABLE=0 \
-x LD_LIBRARY_PATH -x PATH \
-mca coll_hcoll_enable 0 \
-mca pml ob1 \
-mca btl_tcp_if_include eth0 \
-mca btl ^openib \
all_reduce_perf -b 1G -e 1G -n 1000 -g 1

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon