tencent cloud

Feedback

Agent FAQ

Last updated: 2024-09-26 15:34:19

    Question 1: Agent Resource Utilization

    CVM Agent

    The fault injection agent for CVM is a pre-installed executable on the CVM host (located in the /data/cfg/chaos-executor directory). When you select a specific fault, the installation of the fault injection agent will be required. When a fault injection is performed, this program is executed. The agent occupies disk resources less than 1 MB. During network fault injection, CPU memory usage does not exceed 1% of the system's resources. In scenes involving memory and CPU stress, the resource usage is roughly equivalent to the configured stress test target values.

    Container Agent

    After the fault injection agent for containers is installed, the following resources will be created in the cluster:
    1. Namespace: tchaos
    2. ClusterRole: chaosmonkey, with the following rules. This indicates that the probe operator will obtain the corresponding permissions for the K8S Api.
    rules:
    - apiGroups:
    - ""
    resources:
    - namespaces
    - nodes
    verbs:
    - get
    - list
    - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - get
    - list
    - update
    - delete
    - create
    - patch
    - apiGroups:
    - ""
    resources:
    - pods/exec
    verbs:
    - create
    3. ServiceAccount: chaosmonkey. Note that it is under the tchaos namespace.
    4. ClusterRoleBinding: Bind ClusterRole and ServiceAccount.
    5. Operator: A deployment named chaos-operator is started in the tchaos namespace with a replica count of 1. The pod uses the chaosmonkey ServiceAccount created in the previous step. The maximum resource allocation is 1 core CPU and 2 GB memory (Limit). After the agent is installed, the chaos-operator will remain running continuously, consuming cluster resources. To control costs, uninstall the agent promptly after the fault injection is completed.
    6. During fault injection, the operator will temporarily create a helperpod on the target node to inject the fault. The helperpod is not intrusive to the target pod (it is not a sidecar). Additionally, to achieve the specified stress scene, there are no resource limits imposed on the helperpod. When the fault is recovered, the temporary helperpod will be automatically deleted.
    7. The fault injection logs and experiment records of the helperpod will be saved in the /var/log/chaos directory on the node, typically occupying less than 10 KB.
    Note:
    The fault injection log data of the helperpod will not be deleted when the agent is uninstalled. If necessary, please delete the logs manually.

    Question 2: Detection of Abnormal Agent Status

    Example

    The agent failed to detect properly, and an abnormal agent status was found. Please see the article "Agent Issues FAQ" for solutions. Specific error message: NotTriggerScaleUp:1 occurrences in the last 10 minutes.

    Solutions

    Check the deployment load chaos-operator in the tchaos namespace and verify whether the Pod has started. If it has not started, check the logs for any abnormal messages. Here are some possible event types that may prevent the Pod from starting, along with corresponding solutions:
    Event Type
    Solutions
    OutOfMemory or OutOfCPU
    Check if there is sufficient resource in the cluster to run the agent.You may need to increase the cluster's resources or adjust other workloads to free up resources.
    InsufficientStorage
    Check if there is sufficient storage space in the cluster to run the agent. You may need to increase storage capacity or clear out unnecessary data to free up storage space.
    FailedScheduling
    This could be due to the lack of nodes in the cluster that can meet the Pod's scheduling requirements. Check the Pod's scheduling constraints, as well as the status and tags of the nodes in the cluster.
    CrashLoopBackOff or Error
    This could be due to a program error or a configuration issue with the agent. View the Pod's logs for more detailed information, and troubleshoot the issues based on the error messages found in the logs.
    ImagePullBackOff
    This could be due to the inability to pull the image from the image repository. Check whether your image repository address and credentials are correct, and ensure that the network connection is functioning properly.
    NotTriggerScaleUp
    This could be due to the automatic scale-in/out policy of the cluster not being triggered. Check the automatic scale-in/out policy configuration of your cluster to ensure it can correctly trigger scale-out when needed.

    Question 3: An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.

    Example

    The agent installation failed, and a probe that cannot be automatically uninstalled has been detected. You need to manually uninstall it first. For more details, see the documentation Agent Issues FAQ.

    Solutions

    In this case, you need to manually delete the following Kubernetes resources:
    clusterrole: chaosmonkey
    clusterrolebinding: chaosmonkey
    serviceaccount: chaosmonkey (located in the tchaos namespace)
    namespace: tchaos
    deployment: cloudchaos-operator (located in the tchaos namespace)
    Note:
    After manually uninstalling the agent, you do not need to manually install a new one. Go to the Agent Management page to install it.
    After uninstalling the agent, ensure that your cluster is in a normal status so that the new agent can be installed smoothly. If you encounter any issues during the installation process, please review the relevant logs for more detailed information and troubleshoot the issues based on the error messages found in the logs.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support