tencent cloud

Feedback

High Utilization of CVM Resources (CPU, Memory, and Disk)

Last updated: 2024-09-26 15:47:38

    Background

    As one of the most basic cloud resources, Cloud Virtual Machine (CVM) is widely used. When a CVM is used, program errors, improper configuration, and other factors may result in faults such as high CPU utilization, high memory utilization, and high disk partition utilization, which will lead to CVM performance degradation and even service unavailability so users will suffer a loss.
    To improve CVM reliability and stability, fault simulation experiments are required to verify the capability of the system for normal operation when utilization of resources such as CPU, memory, and disk is excessively high so that contingency plans can be prepared in advance.

    Experiment Implementation

    Step 1: Experiment Preparation

    Prepare a CVM instance available for the experiment.
    Go to the agent management page, and install an agent for the CVM node. For specific installation steps, see Agent Management for installation.

    Step 2: Experiment Orchestration

    1. Log in to the Tencent Smart Advisor > Chaotic Fault Generator, go to the Experiment Management page, click Create a New Experiment, and click Skip and create a blank experiment.
    2. Fill in the basic information of the experiment.
    3. Fill in the experiment action group information, and select Compute-CVM.
    4. Added experiment instances.
    5. To add an experiment action, click Add Now, and configure fault action parameters.
    Configure High CPU utilization fault action parameters.
    Note:
    CPU Utilization: Specify CPU load percentage, which is 0 to 100.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    Scheduling Priority: It affects process priority in CPU scheduling. A lower nice value makes it more likely that the process would have a CPU time slice so that its execution priority can be improved. It is effective only if utilization is 100%.
    Configure High memory utilization fault action parameters.
    Note:
    Memory Usage Rate: Specify a memory load percentage that is 0 to 100.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    Enable OOM Protection: If it is enabled, the possibility of fault process OOM-KILL will be reduced, and business processes will be killed first.
    Memory Occupation Rate: Memory usage increase per second.
    Configure High disk usage fault action parameters.
    Note:
    Disk Directory: A disk directory to be populated, i.e., a directory where files are written.
    File size: Size of a file populated.
    Disk Usage Rate: Learn disk usage through staf commands, and calculate the file size required for specified utilization.
    Reserved space: Size of remaining space.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    If there are file size, disk utilization, and reserved space parameters, the priority calculation logic is disk utilization > reserved space > file size.
    Configure Disk IO load fault action parameters.
    Note:
    Disk Directory: Specify a directory to enhance disk IO, which will apply to the disk it resides on.
    Mode: Provide both read and write modes to execute high loads.
    Block Size: Specify block size for every read or write.
    Number of Blocks: Specify number of blocks to be copied.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    6. After action parameter configuration, click Next. Configure Guardrail Policy and Monitoring Metrics considering actual situations. After all configurations are completed, click Submit to complete experiment creation.

    Step 3: Experiment Execution

    1. Click Execute high CPU utilization action to start an experiment.
    2. Observe Monitoring Metrics. It can be seen that the CPU load is up to the specified utilization. Execute a rollback action and then recover.
    3. Execute high memory utilization action and configured occupation rate so that specified memory utilization is obtained. Execute a rollback action and then recover a steady status.
    Note:
    Injection tools collect memory utilization metric from /proc/meminfo, and calculation formula is Percent = (MemTotal-MemAvailable)/MemTotal.
    A metric observation system provided by cloud platform: Tencent Cloud Observability Platform, information of which is also collected from /proc/meminfo, but its algorithm contains no buffer and system cache occupancy, and there is difference from injection tools, details are given below: Percent = (MemTotal-MemFree-Buffers-Cached-SReclaimable+Shmem)/MemTotal.
    Memory information of this experiment instance is as follows. The following results are obtained through metric substitution in the above two algorithms:
    [root@VM-22-12-tencentos ~]# cat /proc/meminfo
    MemTotal: 1721620 kB //Total system memory (RAM) size
    MemFree: 111260 kB //Unused memory size
    MemAvailable: 349964 kB //Memory size available for starting a new process. Usage of system cache and buffer is considered for this value.
    Buffers:: 59624 kB //Memory size for file system buffer
    Cached: 570612 kB //Memory size for file system cache
    ......
    Shmem: 269980 kB //Shared memory size
    ......
    SReclaimable: 46308 kB //Reclaimable cache size of kernel memory
    Utilization achieved with injection tool is: (1721620-349964)/1721620 = 79.6%
    Utilization achieved through Tencent Observability Platform is: (1721620-111260-59624-570612-46308+269980)/1721620 = 69.9%
    4. To execute a high disk utilization action, log in to the machine and check that the disk has specified utilization through a df command. Execute a rollback action to recover the normal status.
    In fault
    
    
    
    After rollback
    
    
    
    5. Execute disk IO high load action, go to the terminal, and use the iostat command for observation.
    In fault
    
    
    
    After rollback
    
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support