FAQs: TSA - Chaotic Fault Generator

Product Feature Issues
What is TSA - Chaotic Fault Generator (CFG)?
CFG is a product within the TSA family. It provides efficient, convenient, and secure fault simulation services, along with core features such as industry templates and monitoring guardrails. These features help you identify potential disaster recovery risks in your business, validate the effectiveness of high-availability plans, and ultimately improve the availability and resilience of your systems.
How to Use CFG?
Step 1: Create an experiment
Log in to the TSA > Chaos Engineering Console, go to the experiment management page, and click Create a New Experiment. You can either use a template from the experiment template library (which automatically fills in failure action orchestration and only requires you to specify instance objects) or create a blank experiment and freely configure fault actions. Additionally, you can pre-configure business monitoring metrics, security guardrails, and alarms.
Step 2: Execute the experiment
On the experiment details page, click Execute in the upper right corner to start the experiment process. Click the run button within the action group to execute fault injection and recovery actions one by one. During the experiment, you can observe the success or failure of each action, making the blast radius more intuitive and controllable. Additionally, you can monitor business steady-state metrics in real-time and Pause or Continue the experiment as needed.
Step 3: End the experiment
After all experiment actions have been executed, click End Experiment in the upper right corner of the experiment details page and fill in the Experiment Results. Then, click Download Experiment Report to generate an experiment report that comprehensively records all experiment details, making it easier to conduct fault analysis and review the experiment.
Step 4: Build a custom template library
For experiment tasks that need to be conducted frequently, you can save them as custom templates from the experiment list with a single click, making it easy to quickly reuse them in future experiments. In the Experience Library Management page, you can also deactivate or activate these templates as needed.
Note:
For more operation guides, please see Quick Start.
What Is the Template Library?
To help users quickly reuse proven experiment schemes, CFG provides experiment templates for various industries such as e-commerce, gaming, and multimedia. The content covers a range of typical use cases, such as cross-availability zone disaster recovery, hybrid cloud disaster recovery experiments, service stress experiments, and network fault experiments. When creating a new experiment, users can browse the industry template library in the first step under Experience Selection. By clicking Go to use, the experiment information and action orchestration scheme from the template will be automatically populated into the creation form. Users only need to select instance resources to quickly create an experiment, thereby improving efficiency.
What Object Types does CFG Support for Fault Injection?
CFG supports fault injection on object types such as Tencent Cloud CVM, TKE, MySQL, Redis, NAT, CLB, dedicated lines, and audio/video services to test system availability.
Does CFG Support Use in Private Cloud Environment?
Currently, it only supports public cloud environment and does not support private cloud.
﻿
Action Execution Issues
How to Handle the Abnormal Instance Lock Mechanism?
Action execution failed, prompting that the current instance is being injected with faults by other actions
To prevent the inaccurate observation of instance behavior after injecting the same type of fault into the same instance, the platform has implemented an instance lock mechanism. This ensures that only a single fault action can be executed on the same instance at any given time, with other actions being excluded. However, the platform also considers that some actions do not interfere with each other and do not need to compete for the same instance lock. Therefore, the platform distinguishes lock types based on the action type. For example, CPU-related fault actions and memory-related fault actions are allowed to execute simultaneously, but actions of the same type cannot be executed at the same time.
What Are Common Issues with CVM?
Network failure action execution failed, prompting the message Error: Exclusivity flag on, cannot modify.
The TC rules issued for this fault conflict with the existing rules, making it impossible to overwrite the current rules. You can check if there are any previous experiment actions that were not properly recovered, or modify the configuration parameters to forcefully overwrite the existing rules.
Failed to execute IO Hang action, prompting that the operating system acquisition failed
The user's operating system does not support the execution of this action. Currently, the platform supports the following system versions: CentOS 7.2 and later, Debian 8.2 and later, Ubuntu 16.0.4 and later, and TencentOS.
What Are Common Issues with Redis?
Failed to execute primary-secondary switchover action, prompting that the instance does not exist cross AZ replicas and primary-secondary switchover cannot be executed
The instance is upgraded to support cross-availability zone deployment, but no cross-availability zone nodes are present, so Redis cannot perform the primary-secondary switch. You need to go to the Redis instance details and add replicas in other availability zones before you can simulate the primary-secondary switch.
﻿
Agent FAQ
Question 1: Agent Resource Utilization
CVM Agent
The fault injection agent for CVM is a pre-installed executable on the CVM host (located in the /data/cfg/chaos-executor directory). When you select a specific fault, the installation of the fault injection agent will be required. When a fault injection is performed, this program is executed. The agent occupies disk resources less than 1 MB. During network fault injection, CPU memory usage does not exceed 1% of the system's resources. In scenes involving memory and CPU stress, the resource usage is roughly equivalent to the configured stress test target values.
Container Agent
After the fault injection agent for containers is installed, the following resources will be created in the cluster:
1. Namespace: tchaos
2. ClusterRole: chaosmonkey, with the following rules. This indicates that the probe operator will obtain the corresponding permissions for the K8S Api.
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - nodes
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - update
  - delete
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
3. ServiceAccount: chaosmonkey. Note that it is under the tchaos namespace.
4. ClusterRoleBinding: Bind ClusterRole and ServiceAccount.
5. Operator: A deployment named chaos-operator is started in the tchaos namespace with a replica count of 1. The pod uses the chaosmonkey ServiceAccount created in the previous step. The maximum resource allocation is 1 core CPU and 2 GB memory (Limit). After the agent is installed, the chaos-operator will remain running continuously, consuming cluster resources. To control costs, uninstall the agent promptly after the fault injection is completed.
6. During fault injection, the operator will temporarily create a helperpod on the target node to inject the fault. The helperpod is not intrusive to the target pod (it is not a sidecar). Additionally, to achieve the specified stress scene, there are no resource limits imposed on the helperpod. When the fault is recovered, the temporary helperpod will be automatically deleted.
7. The fault injection logs and experiment records of the helperpod will be saved in the /var/log/chaos directory on the node, typically occupying less than 10 KB.
Note: 
The fault injection log data of the helperpod will not be deleted when the agent is uninstalled. If necessary, please delete the logs manually.
Question 2: Detection of Abnormal Agent Status
Example
The agent failed to detect properly, and an abnormal agent status was found. Please see the article "Agent Issues FAQ" for solutions. Specific error message: NotTriggerScaleUp:1 occurrences in the last 10 minutes.
Solutions
Check the deployment load chaos-operator in the tchaos namespace and verify whether the Pod has started. If it has not started, check the logs for any abnormal messages. Here are some possible event types that may prevent the Pod from starting, along with corresponding solutions:
Event Type
Solutions
OutOfMemory or OutOfCPU
Check if there is sufficient resource in the cluster to run the agent.You may need to increase the cluster's resources or adjust other workloads to free up resources.
InsufficientStorage
Check if there is sufficient storage space in the cluster to run the agent. You may need to increase storage capacity or clear out unnecessary data to free up storage space.
FailedScheduling
This could be due to the lack of nodes in the cluster that can meet the Pod's scheduling requirements. Check the Pod's scheduling constraints, as well as the status and tags of the nodes in the cluster.
CrashLoopBackOff or Error 
This could be due to a program error or a configuration issue with the agent. View the Pod's logs for more detailed information, and troubleshoot the issues based on the error messages found in the logs.
ImagePullBackOff
This could be due to the inability to pull the image from the image repository. Check whether your image repository address and credentials are correct, and ensure that the network connection is functioning properly.
NotTriggerScaleUp
This could be due to the automatic scale-in/out policy of the cluster not being triggered. Check the automatic scale-in/out policy configuration of your cluster to ensure it can correctly trigger scale-out when needed.
Question 3: An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.
Example
The agent installation failed, and a probe that cannot be automatically uninstalled has been detected. You need to manually uninstall it first. For more details, see the documentation Agent Issues FAQ.
Solutions
In this case, you need to manually delete the following Kubernetes resources:
clusterrole: chaosmonkey
clusterrolebinding: chaosmonkey
serviceaccount: chaosmonkey (located in the tchaos namespace)
namespace: tchaos
deployment: cloudchaos-operator (located in the tchaos namespace)
Note: 
After manually uninstalling the agent, you do not need to manually install a new one. Go to the Agent Management page to install it.
After uninstalling the agent, ensure that your cluster is in a normal status so that the new agent can be installed smoothly. If you encounter any issues during the installation process, please review the relevant logs for more detailed information and troubleshoot the issues based on the error messages found in the logs.
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Product Feature Issues

What is TSA - Chaotic Fault Generator (CFG)?

How to Use CFG?

What Is the Template Library?

What Object Types does CFG Support for Fault Injection?

Does CFG Support Use in Private Cloud Environment?

Action Execution Issues

How to Handle the Abnormal Instance Lock Mechanism?

What Are Common Issues with CVM?

What Are Common Issues with Redis?

Agent FAQ

Question 1: Agent Resource Utilization

CVM Agent

Container Agent

Question 2: Detection of Abnormal Agent Status

Example

Solutions

Question 3: An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.

Example

Solutions

About Tencent Cloud

Help & Support

Resources

User Center

Event Type	Solutions
OutOfMemory or OutOfCPU	Check if there is sufficient resource in the cluster to run the agent.You may need to increase the cluster's resources or adjust other workloads to free up resources.
InsufficientStorage	Check if there is sufficient storage space in the cluster to run the agent. You may need to increase storage capacity or clear out unnecessary data to free up storage space.
FailedScheduling	This could be due to the lack of nodes in the cluster that can meet the Pod's scheduling requirements. Check the Pod's scheduling constraints, as well as the status and tags of the nodes in the cluster.
CrashLoopBackOff or Error	This could be due to a program error or a configuration issue with the agent. View the Pod's logs for more detailed information, and troubleshoot the issues based on the error messages found in the logs.
ImagePullBackOff	This could be due to the inability to pull the image from the image repository. Check whether your image repository address and credentials are correct, and ensure that the network connection is functioning properly.
NotTriggerScaleUp	This could be due to the automatic scale-in/out policy of the cluster not being triggered. Check the automatic scale-in/out policy configuration of your cluster to ensure it can correctly trigger scale-out when needed.

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Tencent Big Model

Internet of Things

Stream Services

Cloud Real-time Rendering

Management and Audit Tools

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Developer Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Voice Technology

Natural Language Processing

Communication

Media Process Services

Education Sevices

Monitor and Operation

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Image Creation

Optical Character Recognition

Interactive Video Services

Media SDK

Cloud Resource Management

More

Product Feature Issues

What is TSA - Chaotic Fault Generator (CFG)?

How to Use CFG?

What Is the Template Library?

What Object Types does CFG Support for Fault Injection?

Does CFG Support Use in Private Cloud Environment?

﻿

Action Execution Issues

How to Handle the Abnormal Instance Lock Mechanism?

What Are Common Issues with CVM?

What Are Common Issues with Redis?

Agent FAQ

Question 1: Agent Resource Utilization

CVM Agent

Container Agent

Question 2: Detection of Abnormal Agent Status

Example

Solutions

Question 3: An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.

Example

Solutions

About Tencent Cloud

Help & Support

Resources

User Center