tencent cloud

Feedback

Installing RDMA Millisecond-Level Monitoring Component on GPU Instances

Last updated: 2024-08-20 17:08:22

    Feature Introduction

    Hyper Computing Cluster has the capability to achieve millisecond-level monitoring in an RDMA network environment, enabling you to monitor and analyze instantaneous network data in real-time, helping you deeply analyze network traffic modes, optimize network and improve performance, and provide strong support for your business.

    Overview

    This document describes how to install the millisecond-level monitoring component in the Tencent Cloud Hyper Computing Cluster environment to achieve millisecond-level performance monitoring in Tencent Cloud RDMA environments. Tencent Cloud provides two ways to view monitoring data. You can view the millisecond-level monitoring statistics on Cloud Product Monitoring or view the saved monitoring logs locally on the instance.
    Note:
    The startup of RDMA millisecond-level monitoring will occupy less than 0.05 CPU resources. You can decide whether to use it based on business needs.

    Directions

    Environment Preparations

    1. Create a GPU Hyper Computing ClusterPNV4sne, GPU Hyper Computing ClusterPNV4sn, or GPU Hyper Computing ClusterPNV5v Hyper Computing Cluster instance. It is recommended to select the TencentOS Server 2.4 (TK4) image.
    2. Install the GPU driver and nvidia-fabricmanager service for GPU instances.

    Installation Verification

    1. In the TencentOS Server 2.4 (TK4) environment, you can use the following commands to install:
    # Uninstall the existing enhanced monitoring software package
    rpm -e rdma_monitor-1.0-1.tl2.x86_64
    # Download and install the millisecond-level monitoring component.
    # Once the software package is installed, a system service will be automatically registered to start and keep the enhanced monitoring alive without manual startup.
    wget http://mirrors.tencentyun.com/install/GPU/rdma_monitor-1.0-1.tl2.x86_64.rpm && rpm -ivh rdma_monitor-1.0-1.tl2.x86_64.rpm
    2. Use the following command to verify if the installation is successful:
    ps -aux | grep monitor_server
    Execute the command. If the fields are displayed in red, it means the enhanced monitoring has been successfully installed and started.
    

    Viewing Cloud Product Monitoring Data

    The data of RDMA millisecond-level monitoring can be viewed in the Cloud Product Monitoring. You can configure the required monitoring metrics in the Cloud Product Monitoring - dashboard. The directions are as follows:
    1. Create dashboard, and select CVM - RDMA Monitoring as the metric:
    
    2. Select the RDMA millisecond-level statistical metrics you need to monitor.
    
    Cloud Product Monitoring supports viewing the following statistical data. You can configure it in the Cloud Product Monitoring dashboard as needed.
    English Metric Name
    Chinese Metric Name
    Metric Description (optional)
    Unit
    Dimension
    Statistical Granularity
    RxHpbwAvg
    Millisecond-level_average of RDMA network interface received bandwidth
    The millisecond-level statistical granularity average of the RDMA network interface received bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    RxHpbwMax
    Millisecond-level_maximum value of RDMA network interface received bandwidth
    The millisecond-level statistical granularity maximum value of the RDMA network interface received bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    RxHpbwMin
    Millisecond-level_minimum value of RDMA network interface received bandwidth
    The millisecond-level statistical granularity minimum value of the RDMA network interface received bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    RxHpbwP50
    Millisecond-level_the 50th percentile of RDMA network interface received bandwidth
    The millisecond-level statistical granularity 50th percentile of the RDMA network interface received bandwidth from lowest to highest within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s, 86,400s
    RxHpbwP90
    Millisecond-level_the 90th percentile of RDMA network interface received bandwidth
    The millisecond-level statistical granularity 90th percentile of the RDMA network interface received bandwidth from lowest to highest within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    TxHpbwAvg
    Millisecond-level_average of RDMA network interface transmitted bandwidth
    The millisecond-level statistical granularity average of the RDMA network interface transmitted bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    TxHpbwMax
    Millisecond-level_maximum value of RDMA network interface transmitted bandwidth
    The millisecond-level statistical granularity maximum value of the RDMA network interface transmitted bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    TxHpbwMin
    Millisecond-level_minimum value of RDMA network interface transmitted bandwidth
    The millisecond-level statistical granularity minimum value of the RDMA network interface transmitted bandwidth within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    TxHpbwP50
    Millisecond-level_the 50th percentile of RDMA network interface transmitted bandwidth
    The millisecond-level statistical granularity 50th percentile of the RDMA network interface transmitted bandwidth from lowest to highest within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    TxHpbwP90
    Millisecond-level_the 90th percentile of RDMA network interface transmitted bandwidth
    The millisecond-level statistical granularity 90th percentile of the RDMA network interface transmitted bandwidth from lowest to highest within 10 seconds
    Mbps
    InstanceId
    10s, 60s, 300s, 3,600s
    3. Select the Hyper Computing Cluster instance ID you need to monitor.
    
    4. click OK to quickly create Dashboard.
    

    Viewing Local Monitoring Data

    RDMA millisecond-level monitoring can view bandwidth data monitoring with a minimum granularity of 10ms, but Cloud Product Monitoring only supports data reporting with a minimum granularity of 10s. If users want to obtain more precise network interface monitoring data, they can use the following command to save millisecond-level data for local viewing.
    # monitor_client With the enhanced monitoring automatically installed, /tmp/monitor.log is the customized data storage path. The file size will continue to grow, so be mindful of managing storage space.
    monitor_client -r -p raw > /tmp/monitor.log
    # -r continuously obtain data from the last 10s
    # -p; print selection
    # -p summary; default value; print statistical information
    # -p raw; print original data points
    # -p all; print both statistical information and original data points
    # You can use monitor_client -h to view more parameter descriptions.
    To view the recorded monitoring data, you can analyze the monitoring records as needed. The format of the monitoring records is as follows:
    
    Note:
    The meanings of some parameters in the figure are explained as follows:
    Device: RDMA network interface name.
    Received data points: the number of data points collected on the receiving side within 10s. There are 1,000 points collected within 10s, which means one data point was collected every 10ms and each data point represents the corresponding 10ms received bandwidth.
    Timestamp: timestamp during collection
    Data Point n: the received bandwidth collected after the timestamp n*10 ms. Each data point is spaced 10 ms from the preceding and following points.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support