Background
Container nodes (such as worker nodes in a Kubernetes cluster) host container resources and are responsible for running and managing container instances. However, container nodes may encounter hardware faults, resource shortages, network faults, etc., which could lead to container instances not operating correctly.
To enhance the reliability and stability of container services, node fault experiments are needed. Through these experiments, you can verify whether the system can operate normally in the event of container node faults, and uncover potential issues in advance for system architecture optimization and emergency planning.
Experiment Execution
Step 1: Experiment Preparation
Create a container node, add instances, and deploy the test service. If there is already a container node available for the experiment, proceed directly to create the experiment.
Step 2: Create an Experiment
2. Click Skip and create a blank experiment, and fill in the experiment details.
3. Select Container as the instance type, and select Standard Cluster Node as the instance object, then click Add Instance.
4. Click Add Now to add fault action.
5. Select the fault action Node Operation - Node Shutdown.
6. Set action parameters and click Confirm.
7. After action parameter configuration, click Next. Configure Guardrail Policy and Monitoring Metrics considering actual situations, click Submit to complete experiment creation.
Step 3: Execute the Experiment
1. View the node status before executing the fault.
2. Go to experiment details, click Go to the action group for execution.
3. Click Execute to start an experiment.
4. Click the Action Card, and check details of action execution.
5. View the execution logs to confirm it has been executed successfully.
6. View the node status after the fault execution. You can see that the node is in an abnormal status now. It indicates that the fault injection was successful, and the Pods under the cluster node are also running abnormally.
7. Execute the recovery actions, view the execution logs, and confirm that the recovery actions were successful.
8. After successful execution of the fault recovery action, view the status of the cluster node. You can see that the node is operating normally, and the Pods under the cluster node are also functioning properly, indicating that the fault has been successfully resolved.
Was this page helpful?