tencent cloud

Feedback

Installation Instructions of TCCL on GPU Instances

Last updated: 2024-08-20 17:06:38

    Introduction of TCCL

    Tencent Collective Communication Library (TCCL) is a high-performance customized acceleration communication library designed for Tencent Cloud's StarPulse network architecture. The main feature leverages the StarPulse network hardware architecture to provide more efficient network communication performance for large-scale AI model training, along with intelligent operations and maintenance capabilities for rapid perception and self-healing of network failures. TCCL has been expanded and optimized based on open source NCCL code, ensuring full compatibility with NCCL's features and usage methods. TCCL currently supports main features including:
    Dynamic aggregation optimization of dual network interfaces maximizes the performance of bonding devices.
    Global Hash Routing enables load balancing to avoid congestion.
    Topology affinity traffic scheduling minimizes traffic detours.

    Overview

    This document describes how to configure the TCCL acceleration communication library in the Tencent Cloud environment to improve multi-machine multi-card communication performance in the Tencent Cloud RDMA environment. In large-scale model training scenarios, TCCL is expected to increase the bandwidth utilization rate by approximately 50% compared with the open source NCCL scheme.

    Directions

    Environment Preparations

    1. Create GPU Hyper Computing ClusterPNV4sne or GPU Hyper Computing ClusterPNV4sn Hyper Computing Cluster instances, which support 1.6 Tbps and 800 Gbps RDMA networks respectively.
    2. Install the GPU driver and nvidia-fabricmanager service for GPU instances.
    Note:
    TCCL operating software environment requires glibc version 2.17 or later and CUDA version 10.0 or later.

    Selecting Installation Methods

    TCCL currently supports three installation methods. You can select the installation method suitable for business scenarios as required.
    TCCL communication library + compile and install pytorch
    TCCL communication library + pytorch communication plugins
    NCCL plugins + sorted IP list
    Note:
    Since most large-scale model training is based on the Pytorch framework, we will take Pytorch as an example.
    The comparison of three access schemes for TCCL is shown in the following table:
    Installation Methods
    Method 1: compile and install Pytorch.
    Method 2: install Pytorch communication plugins.
    (Recommended) Method 3: install NCCL communication plugins.
    Usage Steps
    Install TCCL.
    Recompile and install Pytorch.
    Install Pytorch communication plugins.
    Modify the distributed communication backend.
    Install NCCL plugins.
    Modify the startup script.
    Advantage
    No intrusion into business code.
    Easy to install.
    Easy to install.
    Disadvantage
    Require to recompile and install Pytorch.
    Have requirements for the software environment.
    Require to modify the business code.
    Have requirements for the software environment.
    Require to update the sorted list after scaling out cluster nodes.
    Software Dependency on the Environment
    Corresponding to NCCL version 2.12.
    Require glibc version 2.17 or later.
    Require CUDA version 10.0 or later.
    Current installation package only supports Pytorch 1.12
    Require glibc version 2.17 or later. Require CUDA version 10.0 or later.
    Install NCCL.
    If your machine resources and model training scenarios are relatively fixed, it is recommended to use Method 3, which is compatible with different NCCL and CUDA versions, and easy to install and use without modifying business code or recompiling pytorch.
    If your resources need to be provided to different business teams or frequently require scale-out, it is recommended to use the first two methods, which do not require algorithm personnel or scheduling frameworks to deliberately perceive the network topology information of machines.
    If you do not want to adapt the business code, you can use Method 1, which only requires recompiling the pytorch framework.

    Configuring the TCCL Environment and Verifying

    Method 1: compile and install Pytorch.
    Method 2: install Pytorch communication plugins.
    (Recommended) Method 3: install NCCL plugins
    As the community pytorch connects to the NCCL communication library statically by default, TCCL cannot be used by replacing the shared libraries.
    1. Install TCCL.
    Take Ubuntu 20.04 as an example, you can use the following commands to install. After installation, TCCL will be located in the /opt/tencent/tccl directory.
    # Uninstall the existing tccl versions and nccl plugins.
    dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
    
    # Download and install tccl v1.5 version.
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/TCCL_1.5-ubuntu.20.04.5_amd64.deb && dpkg -i TCCL_1.5-ubuntu.20.04.5_amd64.deb && rm -f TCCL_1.5-ubuntu.20.04.5_amd64.deb
    If you use CentOS or TencentOS, see the following steps for installation:
    # Uninstall the existing tccl versions and nccl plugins.
    rpm -e tccl && rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64
    
    # Download tccl v1.5 version.
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/tccl-1.5-1.tl2.x86_64.rpm && rpm -ivh --nodeps --force tccl-1.5-1.tl2.x86_64.rpm && rm -f tccl-1.5-1.tl2.x86_64.rpm
    2. Recompile and install Pytorch.
    The following is an example of Pytorch source code installation. See the Official Pytorch Installation Description for details.
    #!/bin/bash
    
    # Uninstall the current version.
    pip uninstall -y torch
    
    # Download pytorch source code.
    git clone --recursive https://github.com/pytorch/pytorch
    cd pytorch
    
    # <!Important> Configure the installation path of TCCL.
    export USE_SYSTEM_NCCL=1
    export NCCL_INCLUDE_DIR="/opt/tencent/tccl/include"
    export NCCL_LIB_DIR="/opt/tencent/tccl/lib"
    
    # See the official website to add other compilation options .
    
    # Install the development environment.
    python setup.py develop
    3. Configure TCCL environment variables.
    export NCCL_DEBUG=INFO
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_GID_INDEX=3
    export NCCL_IB_DISABLE=0
    export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
    export NCCL_NET_GDR_LEVEL=2
    export NCCL_IB_QPS_PER_CONNECTION=4
    export NCCL_IB_TC=160
    export NCCL_IB_TIMEOUT=22
    export NCCL_PXN_DISABLE=0
    export TCCL_TOPO_AFFINITY=4
    Note:
    You need to enable network topology awareness through TCCL_TOPO_AFFINITY=4.
    4. Verify Pytorch.
    During the process of running single-machine multi-GPU or multi-machine multi-GPU training, the following log will be printed (export NCCL_DEBUG=INFO):
    
    
    
    5. Verify nccl-tests.
    Before running nccl-tests, you need to export the corresponding TCCL path:
    export LD_LIBRARY_PATH=/opt/tencent/tccl/lib:$LD_LIBRARY_PATH
    6. Supported software versions
    At present, TCCL corresponds to NCCL version 2.12, requiring glibc version 2.17 or later and CUDA version 10.0 or later. For other supported CUDA version, please contact your pre-sales manager for support.
    Pytorch supports integrating third-party communication backends through plugins, so users can use TCCL communication backends without recompiling Pytorch. The API is fully compatible with NCCL. See Introduction to Existing Communication Backends in Pytorch for details.
    1. Install Pytorch communication plugins.
    # Uninstall the existing tccl and NCCL plugins.
    dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
    
    # Uninstall torch_tccl.
    pip uninstall -y torch-tccl
    
    # Install torch_tccl version 0.0.2.
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/torch_tccl-0.0.2_pt1.12-py3-none-any.whl && pip install torch_tccl-0.0.2_pt1.12-py3-none-any.whl && rm -f torch_tccl-0.0.2_pt1.12-py3-none-any.whl
    2. Modify the business code.
    import torch_tccl
    #args.dist_backend = "nccl"
    args.dist_backend = "tccl"
    torch.distributed.init_process_group(
    backend=args.dist_backend,
    init_method=args.dist_url,
    world_size=args.world_size, rank=args.rank
    )
    3. Configure TCCL environment variables.
    export NCCL_DEBUG=INFO
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_GID_INDEX=3
    export NCCL_IB_DISABLE=0
    export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
    export NCCL_NET_GDR_LEVEL=2
    export NCCL_IB_QPS_PER_CONNECTION=4
    export NCCL_IB_TC=160
    export NCCL_IB_TIMEOUT=22
    export NCCL_PXN_DISABLE=0
    export TCCL_TOPO_AFFINITY=4
    Note:
    You need to enable the network topology awareness feature through TCCL_TOPO_AFFINITY=4.
    4. Verify Pytorch.
    When executing distributed training, the following prompts indicate that the communication backend has been loaded correctly.
    
    
    
    5. Software version limit
    The current installation package only supports Pytorch 1.12. For other supported Pytorch and CUDA versions, please contact your pre-sales manager for support.
    Note:
    If running nccl-tests or other scenarios that require dynamically linked communication libraries, use Method 1 to install TCCL.
    
    If you have installed NCCL, you can also use the TCCL acceleration capability through the NCCL plugins.
    1. Install NCCL plugins.
    Take Ubuntu 20.04 as an example, you can use the following commands to install plugins.
    # Uninstall the existing tccl and nccl plugins.
    dpkg -r tccl && dpkg -r nccl-rdma-sharp-plugins
    
    # Download and install nccl 1.2 plugins.
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.2_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.2_amd64.deb
    
    # Please ensure that the version of nccl plugins used within the cluster is consistent. The following are the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
    # wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins_1.0_amd64.deb && dpkg -i nccl-rdma-sharp-plugins_1.0_amd64.deb && rm -f nccl-rdma-sharp-plugins_1.0_amd64.deb
    If you use CentOS or TencentOS, see the following steps for installation:
    # Uninstall the existing nccl plugins.
    rpm -e nccl-rdma-sharp-plugins-1.0-1.x86_64
    
    # Download and install nccl 1.2 plugins.
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.2-1.x86_64.rpm
    
    # Ensure that the version of nccl plugins used within the cluster is consistent. The following is the download and installation commands for nccl 1.0 version. It is recommended to use the more stable nccl 1.2 version.
    # wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/nccl/nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rpm -ivh --nodeps --force nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm && rm -f nccl-rdma-sharp-plugins-1.0-1.x86_64.rpm
    2. Obtain the topologically sorted IP list.
    The NCCL plugins do not require dependency files to provide two optimizations: dynamic aggregation on bonding interfaces and global hash routing. If affinity awareness of the network topology is needed, users can achieve it through the sorted IP list.
    IP sorting can be completed as follows:
    Prepare the IP list file.
    The VPC IP address can be obtained through ifconfig eth0, and each row has one node IP. The format is as follows:
    root@VM-125-10-tencentos:/workspace# cat ip_eth0.txt
    172.16.177.28
    172.16.176.11
    172.16.177.25
    172.16.177.12
    Execution sorting
    wget https://taco-1251783334.cos.ap-shanghai.myqcloud.com/tccl/get_rdma_order_by_ip.sh && bash get_rdma_order_by_ip.sh ip_eth0.txt
    Note:
    The curl tool is installed on all nodes (for Ubuntu, it can be installed via apt install curl).
    The node executing the script can access all other nodes without ssh password.
    View the sorted IP list file.
    root@VM-125-10-tencentos:/workspace# cat hostfile.txt
    172.16.176.11
    172.16.177.12
    172.16.177.25
    172.16.177.28
    3. Configure TCCL environment variables.
    export NCCL_DEBUG=INFO
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_GID_INDEX=3
    export NCCL_IB_DISABLE=0
    export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7
    export NCCL_NET_GDR_LEVEL=2
    export NCCL_IB_QPS_PER_CONNECTION=4
    export NCCL_IB_TC=160
    export NCCL_IB_TIMEOUT=22
    export NCCL_PXN_DISABLE=0
    
    # After the machine IP is manually sorted, there is no need to add the following variables.
    # export TCCL_TOPO_AFFINITY=4
    4. Modify the startup script.
    You need to modify the startup script when starting distributed training. For example, if you use the deepspeed launcher to start the training process, you need to obtain the sorted IP list, write the corresponding IP list into the hostfile, and then start the training process.
    root@vm-3-17-centos:/workspace/ptm/gpt# cat hostfile
    172.16.176.11 slots=8
    172.16.177.12 slots=8
    172.16.177.25 slots=8
    172.16.177.28 slots=8
    
    deepspeed --hostfile ./hostfile --master_addr 172.16.176.11 train.py
    If torchrun is used to start the training process, specify the corresponding node sequence through --node_rank,
    // on 172.16.176.11
    torchrun --nnodes=4 --nproc_per_node=8 --node_rank=0 --master_addr=172.16.176.11 train.py ...
    // on 172.16.176.12
    torchrun --nnodes=4 --nproc_per_node=8 --node_rank=1 --master_addr=172.16.176.11 train.py ...
    // on 172.16.176.25
    torchrun --nnodes=4 --nproc_per_node=8 --node_rank=2 --master_addr=172.16.176.11 train.py ...
    // on 172.16.176.28
    torchrun --nnodes=4 --nproc_per_node=8 --node_rank=3 --master_addr=172.16.176.11 train.py ...
    If mpirun is used to start the training process, just sort the IP addresses in order.
    mpirun \\
    -np 64 \\
    -H 172.16.176.11:8,172.16.177.12:8,172.16.177.25:8,172.16.177.28:8 \\
    --allow-run-as-root \\
    -bind-to none -map-by slot \\
    -x NCCL_DEBUG=INFO
    -x NCCL_IB_GID_INDEX=3 \\
    -x NCCL_IB_DISABLE=0 \\
    -x NCCL_SOCKET_IFNAME=eth0 \\
    -x NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7 \\
    -x NCCL_NET_GDR_LEVEL=2 \\
    -x NCCL_IB_QPS_PER_CONNECTION=4 \\
    -x NCCL_IB_TC=160 \\
    -x NCCL_IB_TIMEOUT=22 \\
    -x NCCL_PXN_DISABLE=0 \\
    -x LD_LIBRARY_PATH -x PATH \\
    -mca coll_hcoll_enable 0 \\
    -mca pml ob1 \\
    -mca btl_tcp_if_include eth0 \\
    -mca btl ^openib \\
    all_reduce_perf -b 1G -e 1G -n 1000 -g 1
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support