tencent cloud

Feedback

Cluster Node Resource (CPU, Memory, Disk) Stress Test Faults

Last updated: 2024-09-26 15:47:38

    Background

    Container nodes (such as worker nodes in a Kubernetes cluster) host container resources and are responsible for running and managing container instances. When the QPS of a node business increases suddenly or the service memory leaks, the node resource utilization may increase, thus affecting the business or even causing business processes to be killed.
    To enhance the reliability and stability of container services, node fault experiments are needed. Through these experiments, you can verify whether the system can operate normally in the event of container node faults, and uncover potential issues in advance for system architecture optimization and emergency planning.

    Experiment Execution

    Step 1: Experiment Preparation

    Create a container node, add instances, and deploy the test service. If there is already a container node available for the experiment, proceed directly to create the experiment.
    Go to the agent management page, and install an agent for the CVM node. Please see Agent Management for installation.

    Step 2: Create an Experiment

    1. Log in to Tencent Smart Advisor > Chaotic Fault Generator, and enter the Experiment Management page, click Create a New Experiment.
    2. Go to Template Selection interface, click Skip and create a blank experiment .
    3. Fill in the experiment name and description, then click Next.
    4. Fill in the action group information. Select the resource type as Container and select the resource object as Standard Cluster Node.
    5. Click Add Instance, and select the instances to be included in the experiment from the instance list.
    6. In the experiment actions section, click Add Now to add experiment actions. Select actions for CPU Resources, Memory Resources, and Disk Resources.
    7. Modify the action parameters as needed.
    High CPU utilization
    Note:
    CPU Utilization: Specify CPU load percentage, which is 0 to 100.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    Scheduling Priority: It affects process priority in CPU scheduling. A lower nice value makes it more likely that the process would have a CPU time slice so that its execution priority can be improved. It is effective only if utilization is 100%.
    High memory utilization
    Note:
    Memory Usage Rate: Specify a memory load percentage that is 0 to 100.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    Enable OOM Protection: If it is enabled, the possibility of fault process OOM-KILL will be reduced, and business processes will be killed first.
    Memory Occupancy Rate: Memory usage increase per second.
    High disk usage
    Note:
    Action Parameters Description:
    Disk Directory: A disk directory to be populated, i.e., a directory where files are written.
    File Size: Size of a file populated.
    Disk Usage Rate: Learn disk usage through staf commands, and calculate the file size required for specified utilization.
    Reserved Space: Size of remaining space.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    If there are file size, disk utilization, and reserved space parameters, the priority calculation logic is disk utilization > reserved space > file size.
    Disk IO load
    Note:
    Disk Directory: Specify a directory to enhance disk IO, which will apply to the disk it resides on.
    Mode: Provide both read and write modes to execute high loads.
    Block Size: Specify block size for every read or write.
    Number of Blocks: Specify number of blocks to be copied.
    Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
    8. Complete the action group edit and click Next.
    9. Click Add Monitoring Metrics.
    10. Click Submit to complete the creation of the experiment.

    Step 3: Execute the Experiment

    1. View the node agent status through pre-check and install the nodes that have not been installed according to the pre-check result prompts.
    2. Click Execute to start injecting the high CPU load fault.
    3. Observe the node monitoring metrics to ensure CPU utilization reaches the preset value and that recovery is completed by the specified time.
    4. Execute the high memory utilization action and observe the monitoring metrics.
    5. Execute the high disk utilization action and use the df Command in the terminal to view utilization.
    
    
    
    6. Execute the high disk I/O load action and use the iostat Command in the terminal to observe.
    
    
    Note:
    Use iostat -x to view detailed information about the running status of block devices. This action primarily focuses on the %util parameter, which indicates the percentage of time the device spends processing I/O requests. When this value approaches 100%, it means the device's bandwidth utilization is nearly at maximum capacity, leading to a decline in overall disk performance and severely affecting the processing of other read/write requests.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support