Background
As one of the most basic cloud resources, Cloud Virtual Machine (CVM) is widely used. When a CVM is used, program errors, improper configuration, and other factors may result in faults such as high CPU utilization, high memory utilization, and high disk partition utilization, which will lead to CVM performance degradation and even service unavailability so users will suffer a loss.
To improve CVM reliability and stability, fault simulation experiments are required to verify the capability of the system for normal operation when utilization of resources such as CPU, memory, and disk is excessively high so that contingency plans can be prepared in advance.
Experiment Implementation
Step 1: Experiment Preparation
Prepare a CVM instance available for the experiment.
Go to the agent management page, and install an agent for the CVM node. For specific installation steps, see Agent Management for installation. Step 2: Experiment Orchestration
2. Fill in the basic information of the experiment.
3. Fill in the experiment action group information, and select Compute-CVM.
4. Added experiment instances.
5. To add an experiment action, click Add Now, and configure fault action parameters.
Configure High CPU utilization fault action parameters.
Note:
CPU Utilization: Specify CPU load percentage, which is 0 to 100.
Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
Scheduling Priority: It affects process priority in CPU scheduling. A lower nice value makes it more likely that the process would have a CPU time slice so that its execution priority can be improved. It is effective only if utilization is 100%.
Configure High memory utilization fault action parameters.
Note:
Memory Usage Rate: Specify a memory load percentage that is 0 to 100.
Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
Enable OOM Protection: If it is enabled, the possibility of fault process OOM-KILL will be reduced, and business processes will be killed first.
Memory Occupation Rate: Memory usage increase per second.
Configure High disk usage fault action parameters.
Note:
Disk Directory: A disk directory to be populated, i.e., a directory where files are written.
File size: Size of a file populated.
Disk Usage Rate: Learn disk usage through staf commands, and calculate the file size required for specified utilization.
Reserved space: Size of remaining space.
Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
If there are file size, disk utilization, and reserved space parameters, the priority calculation logic is disk utilization > reserved space > file size.
Configure Disk IO load fault action parameters.
Note:
Disk Directory: Specify a directory to enhance disk IO, which will apply to the disk it resides on.
Mode: Provide both read and write modes to execute high loads.
Block Size: Specify block size for every read or write.
Number of Blocks: Specify number of blocks to be copied.
Duration: Duration of a fault action, upon lapse of which, the agent will automatically recover the fault.
6. After action parameter configuration, click Next. Configure Guardrail Policy and Monitoring Metrics considering actual situations. After all configurations are completed, click Submit to complete experiment creation.
Step 3: Experiment Execution
1. Click Execute high CPU utilization action to start an experiment.
2. Observe Monitoring Metrics. It can be seen that the CPU load is up to the specified utilization. Execute a rollback action and then recover.
3. Execute high memory utilization action and configured occupation rate so that specified memory utilization is obtained. Execute a rollback action and then recover a steady status.
Note:
Injection tools collect memory utilization metric from /proc/meminfo
, and calculation formula is Percent = (MemTotal-MemAvailable)/MemTotal.
A metric observation system provided by cloud platform: Tencent Cloud Observability Platform, information of which is also collected from /proc/meminfo
, but its algorithm contains no buffer and system cache occupancy, and there is difference from injection tools, details are given below: Percent = (MemTotal-MemFree-Buffers-Cached-SReclaimable+Shmem)/MemTotal.
Memory information of this experiment instance is as follows. The following results are obtained through metric substitution in the above two algorithms:
[root@VM-22-12-tencentos ~]# cat /proc/meminfo
MemTotal: 1721620 kB //Total system memory (RAM) size
MemFree: 111260 kB //Unused memory size
MemAvailable: 349964 kB //Memory size available for starting a new process. Usage of system cache and buffer is considered for this value.
Buffers:: 59624 kB //Memory size for file system buffer
Cached: 570612 kB //Memory size for file system cache
......
Shmem: 269980 kB //Shared memory size
......
SReclaimable: 46308 kB //Reclaimable cache size of kernel memory
Utilization achieved with injection tool is: (1721620-349964)/1721620 = 79.6%
Utilization achieved through Tencent Observability Platform is: (1721620-111260-59624-570612-46308+269980)/1721620 = 69.9%
4. To execute a high disk utilization action, log in to the machine and check that the disk has specified utilization through a df command. Execute a rollback action to recover the normal status.
In fault
After rollback
5. Execute disk IO high load action, go to the terminal, and use the iostat command for observation.
In fault
After rollback
Was this page helpful?