Installation Instructions of TCCL on GPU Instances

Hyper Computing Cluster

Product Introduction

Overview

Strengths

Use Cases

Instance Specifications

Purchase Guide

Billing Overview

Instance Regions

Purchasing Hyper Computing Cluster Instances

Operation Guide

Managing Hyper Computing Cluster

Installing nvidia-fabricmanager Service on GPU Instance

Installation Instructions of TCCL on GPU Instances

Installing RDMA Millisecond-Level Monitoring Component on GPU Instances

API Document

FAQs

DocumentationHyper Computing ClusterOperation GuideInstallation Instructions of TCCL on GPU Instances

Installation Instructions of TCCL on GPU Instances

Download PDF

Last updated: 2024-08-20 17:06:38

Installation Instructions of TCCL on GPU Instances

Last updated: 2024-08-20 17:06:38

Download PDF

Introduction of TCCL
Tencent Collective Communication Library (TCCL) is a high-performance customized acceleration communication library designed for Tencent Cloud's StarPulse network architecture. The main feature leverages the StarPulse network hardware architecture to provide more efficient network communication performance for large-scale AI model training, along with intelligent operations and maintenance capabilities for rapid perception and self-healing of network failures. TCCL has been expanded and optimized based on open source NCCL code, ensuring full compatibility with NCCL's features and usage methods. TCCL currently supports main features including:
Dynamic aggregation optimization of dual network interfaces maximizes the performance of bonding devices.
Global Hash Routing enables load balancing to avoid congestion.
Topology affinity traffic scheduling minimizes traffic detours.
Overview
This document describes how to configure the TCCL acceleration communication library in the Tencent Cloud environment to improve multi-machine multi-card communication performance in the Tencent Cloud RDMA environment. In large-scale model training scenarios, TCCL is expected to increase the bandwidth utilization rate by approximately 50% compared with the open source NCCL scheme.
Directions
Environment Preparations
1. Create GPU Hyper Computing ClusterPNV4sne or GPU Hyper Computing ClusterPNV4sn Hyper Computing Cluster instances, which support 1.6 Tbps and 800 Gbps RDMA networks respectively.
2. Install the GPU driver and nvidia-fabricmanager service for GPU instances.
Note:
TCCL operating software environment requires glibc version 2.17 or later and CUDA version 10.0 or later.
Selecting Installation Methods
TCCL currently supports three installation methods. You can select the installation method suitable for business scenarios as required.
TCCL communication library + compile and install pytorch
TCCL communication library + pytorch communication plugins
NCCL plugins + sorted IP list
Note:
Since most large-scale model training is based on the Pytorch framework, we will take Pytorch as an example.
The comparison of three access schemes for TCCL is shown in the following table:
Installation Methods
Method 1: compile and install Pytorch.
Method 2: install Pytorch communication plugins.
(Recommended) Method 3: install NCCL communication plugins.
Usage Steps
Install TCCL.
Recompile and install Pytorch.
Install Pytorch communication plugins.
Modify the distributed communication backend.
Install NCCL plugins.
Modify the startup script.
Advantage
No intrusion into business code.
Easy to install.
Easy to install.
Disadvantage
Require to recompile and install Pytorch.
Have requirements for the software environment.
Require to modify the business code.
Have requirements for the software environment.
Require to update the sorted list after scaling out cluster nodes.
Software Dependency on the Environment
Corresponding to NCCL version 2.12. 
Require glibc version 2.17 or later.
Require CUDA version 10.0 or later.
Current installation package only supports Pytorch 1.12
Require glibc version 2.17 or later.
Require CUDA version 10.0 or later.
Install NCCL.
If your machine resources and model training scenarios are relatively fixed, it is recommended to use Method 3, which is compatible with different NCCL and CUDA versions, and easy to install and use without modifying business code or recompiling pytorch.
If your resources need to be provided to different business teams or frequently require scale-out, it is recommended to use the first two methods, which do not require algorithm personnel or scheduling frameworks to deliberately perceive the network topology information of machines.
If you do not want to adapt the business code, you can use Method 1, which only requires recompiling the pytorch framework.
Configuring the TCCL Environment and Verifying
Method 1: compile and install Pytorch.
Method 2: install Pytorch communication plugins.
(Recommended) Method 3: install NCCL plugins
As the community pytorch connects to the NCCL communication library statically by default, TCCL cannot be used by replacing the shared libraries.
1. Install TCCL.
Take Ubuntu 20.04 as an example, you can use the following commands to install. After installation, TCCL will be located in the /opt/tencent/tccl directory.
# Uninstall the existing tccl versions and nccl plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
﻿
# Download and install tccl v1.5 version.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/TCCL_1.5-ubuntu.20.04.5_amd64.deb && dpkg -i TCCL_1.5-ubuntu.20.04.5_amd64.deb && rm -f TCCL_1.5-ubuntu.20.04.5_amd64.deb
If you use CentOS or TencentOS, see the following steps for installation:
# Uninstall the existing tccl versions and nccl plugins.
rpm -e tccl && rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64
﻿
# Download tccl v1.5 version.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/tccl-1.5-1.tl2.x86_64.rpm && rpm -ivh --nodeps --force tccl-1.5-1.tl2.x86_64.rpm && rm -f tccl-1.5-1.tl2.x86_64.rpm
2. Recompile and install Pytorch.
The following is an example of Pytorch source code installation. See the Official Pytorch Installation Description for details.
#!/bin/bash
﻿
# Uninstall the current version.
pip uninstall -y torch
﻿
# Download pytorch source code.
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
﻿
# <!Important> Configure the installation path of TCCL.
export USE_SYSTEM_NCCL=1
export NCCL_INCLUDE_DIR="/opt/tencent/tccl/include"
export NCCL_LIB_DIR="/opt/tencent/tccl/lib"
﻿
# See the official website to add other compilation options.
﻿
# Install the development environment.
python setup.py develop
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
export TCCL_TOPO_AFFINITY=4
Note:
You need to enable network topology awareness through TCCL_TOPO_AFFINITY=4.
4. Verify Pytorch.
During the process of running single-machine multi-GPU or multi-machine multi-GPU training, the following log will be printed (export NCCL_DEBUG=INFO):
﻿
﻿
﻿
5. Verify nccl-tests.
Before running nccl-tests, you need to export the corresponding TCCL path:
export LD_LIBRARY_PATH=/opt/tencent/tccl/lib:$LD_LIBRARY_PATH
6. Supported software versions
At present, TCCL corresponds to NCCL version 2.12, requiring glibc version 2.17 or later and CUDA version 10.0 or later. For other supported CUDA version, please contact your pre-sales manager for support.
Pytorch supports integrating third-party communication backends through plugins, so users can use TCCL communication backends without recompiling Pytorch. The API is fully compatible with NCCL. See Introduction to Existing Communication Backends in Pytorch for details.
1. Install Pytorch communication plugins.
# Uninstall the existing tccl and NCCL plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
﻿
# Uninstall torch_tccl.
pip uninstall -y torch-tccl
﻿
# Install torch_tccl version 0.0.2.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/torch_tccl-0.0.2_pt1.12-py3-none-any.whl && pip install torch_tccl-0.0.2_pt1.12-py3-none-any.whl && rm -f torch_tccl-0.0.2_pt1.12-py3-none-any.whl
2. Modify the business code.
import torch_tccl
    
    #args.dist_backend = "nccl"
    args.dist_backend = "tccl"
    torch.distributed.init_process_group(
        backend=args.dist_backend,
        init_method=args.dist_url,
        world_size=args.world_size, rank=args.rank
    )
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
export TCCL_TOPO_AFFINITY=4
Note:
You need to enable the network topology awareness feature through TCCL_TOPO_AFFINITY=4.
4.  Verify Pytorch.
When executing distributed training, the following prompts indicate that the communication backend has been loaded correctly.
﻿
﻿
﻿
5. Software version limit
The current installation package only supports Pytorch 1.12. For other supported Pytorch and CUDA versions, please contact your pre-sales manager for support.
Note:
If running nccl-tests or other scenarios that require dynamically linked communication libraries, use Method 1 to install TCCL.
﻿
If you have installed NCCL, you can also use the TCCL acceleration capability through the NCCL plugins.
1. Install NCCL plugins.
Take Ubuntu 20.04 as an example, you can use the following commands to install plugins.
# Uninstall the existing tccl and nccl plugins.
dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
﻿
# Download and install nccl 1.2 plugins.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.2_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.2_amd64.deb
﻿
# Please ensure that the version of nccl plugins used within the cluster is consistent. The following are the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.0_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.0_amd64.deb && rm -f nccl-rdma-sharp-plugins_1.0_amd64.deb
If you use CentOS or TencentOS, see the following steps for installation:
# Uninstall the existing nccl plugins.
rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64
﻿
# Download and install nccl 1.2 plugins.
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm
﻿
# Ensure that the version of nccl plugins used within the cluster is consistent. The following is the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
# wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rm -f nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm
2. Obtain the topologically sorted IP list.
The NCCL plugins do not require dependency files to provide two optimizations: dynamic aggregation on bonding interfaces and global hash routing. If affinity awareness of the network topology is needed, users can achieve it through the sorted IP list.
IP sorting can be completed as follows:
Prepare the IP list file.
The VPC IP address can be obtained through ifconfig eth0, and each row has one node IP. The format is as follows:
root@VM-125-10-tencentos:/workspace# cat ip_eth0.txt
172.16.177.28
172.16.176.11
172.16.177.25
172.16.177.12
Execution sorting
wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/get_rdma_order_by_ip.sh && bash get_rdma_order_by_ip.sh ip_eth0.txt
Note:
The curl tool is installed on all nodes (for Ubuntu, it can be installed via apt install curl).
The node executing the script can access all other nodes without ssh password.
View the sorted IP list file.
root@VM-125-10-tencentos:/workspace# cat hostfile.txt
172.16.176.11
172.16.177.12
172.16.177.25
172.16.177.28
3. Configure TCCL environment variables.
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TC=160
export NCCL_IB_TIMEOUT=22
export NCCL_PXN_DISABLE=0
﻿
# After the machine IP is manually sorted, there is no need to add the following variables.
# export TCCL_TOPO_AFFINITY=4
4. Modify the startup script.
You need to modify the startup script when starting distributed training. For example, if you use the deepspeed launcher to start the training process, you need to obtain the sorted IP list, write the corresponding IP list into the hostfile, and then start the training process.
root@vm-3-17-centos:/workspace/ptm/gpt# cat hostfile
172.16.176.11 slots=8
172.16.177.12 slots=8
172.16.177.25 slots=8
172.16.177.28 slots=8
﻿
deepspeed --hostfile ./hostfile --master_addr 172.16.176.11 train.py
If torchrun is used to start the training process, specify the corresponding node sequence through --node_rank,
// on 172.16.176.11
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.12
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.25
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 --master_addr=172.16.176.11 train.py ...
// on 172.16.176.28
torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 --master_addr=172.16.176.11 train.py ...
If mpirun is used to start the training process, just sort the IP addresses in order.
mpirun \
-np 64 \
-H 172.16.176.11:8,172.16.177.12:8,172.16.177.25:8,172.16.177.28:8 \
--allow-run-as-root \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO 
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_IB_TIMEOUT=22 \
-x NCCL_PXN_DISABLE=0 \
-x LD_LIBRARY_PATH -x PATH \
-mca coll_hcoll_enable 0 \
-mca pml ob1 \
-mca btl_tcp_if_include eth0 \
-mca btl ^openib \
all_reduce_perf -b 1G -e 1G -n 1000 -g 1
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

Installation Methods	Method 1: compile and install Pytorch.	Method 2: install Pytorch communication plugins.	(Recommended) Method 3: install NCCL communication plugins.
Usage Steps	Install TCCL. Recompile and install Pytorch.	Install Pytorch communication plugins. Modify the distributed communication backend.	Install NCCL plugins. Modify the startup script.
Advantage	No intrusion into business code.	Easy to install.	Easy to install.
Disadvantage	Require to recompile and install Pytorch. Have requirements for the software environment.	Require to modify the business code. Have requirements for the software environment.	Require to update the sorted list after scaling out cluster nodes.
Software Dependency on the Environment	Corresponding to NCCL version 2.12. Require glibc version 2.17 or later. Require CUDA version 10.0 or later.	Current installation package only supports Pytorch 1.12 Require glibc version 2.17 or later. Require CUDA version 10.0 or later.	Install NCCL.

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service Special Offers

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

Financial Services

Financial Services Solution

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Real Estate

Tencent Cloud LinkBase(Weiling)

E-commerce

E-commerce retail solutions

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

TencentDB for TcaplusDB

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha