Question 1: Agent Resource Utilization
CVM Agent
The fault injection agent for CVM is a pre-installed executable on the CVM host (located in the /data/cfg/chaos-executor
directory). When you select a specific fault, the installation of the fault injection agent will be required. When a fault injection is performed, this program is executed. The agent occupies disk resources less than 1 MB. During network fault injection, CPU memory usage does not exceed 1% of the system's resources. In scenes involving memory and CPU stress, the resource usage is roughly equivalent to the configured stress test target values.
Container Agent
After the fault injection agent for containers is installed, the following resources will be created in the cluster:
1. Namespace: tchaos
2. ClusterRole: chaosmonkey, with the following rules. This indicates that the probe operator will obtain the corresponding permissions for the K8S Api.
rules:
- apiGroups:
- ""
resources:
- namespaces
- nodes
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- update
- delete
- create
- patch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
3. ServiceAccount: chaosmonkey. Note that it is under the tchaos namespace.
4. ClusterRoleBinding: Bind ClusterRole and ServiceAccount.
5. Operator: A deployment named chaos-operator is started in the tchaos namespace with a replica count of 1. The pod uses the chaosmonkey ServiceAccount created in the previous step. The maximum resource allocation is 1 core CPU and 2 GB memory (Limit). After the agent is installed, the chaos-operator will remain running continuously, consuming cluster resources. To control costs, uninstall the agent promptly after the fault injection is completed.
6. During fault injection, the operator will temporarily create a helperpod on the target node to inject the fault. The helperpod is not intrusive to the target pod (it is not a sidecar). Additionally, to achieve the specified stress scene, there are no resource limits imposed on the helperpod. When the fault is recovered, the temporary helperpod will be automatically deleted.
7. The fault injection logs and experiment records of the helperpod will be saved in the /var/log/chaos
directory on the node, typically occupying less than 10 KB.
Note:
The fault injection log data of the helperpod will not be deleted when the agent is uninstalled. If necessary, please delete the logs manually.
Question 2: Detection of Abnormal Agent Status
Example
The agent failed to detect properly, and an abnormal agent status was found. Please see the article "Agent Issues FAQ" for solutions. Specific error message: NotTriggerScaleUp:1 occurrences in the last 10 minutes.
Solutions
Check the deployment load chaos-operator in the tchaos namespace and verify whether the Pod has started. If it has not started, check the logs for any abnormal messages. Here are some possible event types that may prevent the Pod from starting, along with corresponding solutions:
|
OutOfMemory or OutOfCPU | Check if there is sufficient resource in the cluster to run the agent.You may need to increase the cluster's resources or adjust other workloads to free up resources. |
InsufficientStorage | Check if there is sufficient storage space in the cluster to run the agent. You may need to increase storage capacity or clear out unnecessary data to free up storage space. |
FailedScheduling | This could be due to the lack of nodes in the cluster that can meet the Pod's scheduling requirements. Check the Pod's scheduling constraints, as well as the status and tags of the nodes in the cluster. |
CrashLoopBackOff or Error | This could be due to a program error or a configuration issue with the agent. View the Pod's logs for more detailed information, and troubleshoot the issues based on the error messages found in the logs. |
ImagePullBackOff | This could be due to the inability to pull the image from the image repository. Check whether your image repository address and credentials are correct, and ensure that the network connection is functioning properly. |
NotTriggerScaleUp | This could be due to the automatic scale-in/out policy of the cluster not being triggered. Check the automatic scale-in/out policy configuration of your cluster to ensure it can correctly trigger scale-out when needed. |
Question 3: An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.
Example
The agent installation failed, and a probe that cannot be automatically uninstalled has been detected. You need to manually uninstall it first. For more details, see the documentation Agent Issues FAQ.
Solutions
In this case, you need to manually delete the following Kubernetes resources:
clusterrole: chaosmonkey
clusterrolebinding: chaosmonkey
serviceaccount: chaosmonkey (located in the tchaos namespace)
namespace: tchaos
deployment: cloudchaos-operator (located in the tchaos namespace)
Note:
After manually uninstalling the agent, you do not need to manually install a new one. Go to the Agent Management page to install it.
After uninstalling the agent, ensure that your cluster is in a normal status so that the new agent can be installed smoothly. If you encounter any issues during the installation process, please review the relevant logs for more detailed information and troubleshoot the issues based on the error messages found in the logs.
Was this page helpful?