Experiment on Container Resource Node Faults

Background
Container nodes (such as worker nodes in a Kubernetes cluster) host container resources and are responsible for running and managing container instances. However, container nodes may encounter hardware faults, resource shortages, network faults, etc., which could lead to container instances not operating correctly.
To enhance the reliability and stability of container services, node fault experiments are needed. Through these experiments, you can verify whether the system can operate normally in the event of container node faults, and uncover potential issues in advance for system architecture optimization and emergency planning.
Experiment Execution
Step 1: Experiment Preparation
Create a container node, add instances, and deploy the test service. If there is already a container node available for the experiment, proceed directly to create the experiment.
Enter Agent  Management page and install the agents.
Step 2: Create an Experiment
1. Log in Tencent Smart Advisor > Chaotic Fault Generator, go to Experiment Management page, and click Create a New Experiment.
2. Click Skip and create a blank experiment, and fill in the experiment details.
3. Select Container as the instance type, and select Standard Cluster Node as the instance object, then click Add Instance.
4. Click Add Now to add fault action.
5. Select the fault action Node Operation - Node Shutdown.
6. Set action parameters and click Confirm.
7. After action parameter configuration, click Next. Configure Guardrail Policy and Monitoring Metrics considering actual situations, click Submit to complete experiment creation.
Step 3: Execute the Experiment
1. View the node status before executing the fault.
2. Go to experiment details, click Go to the action group for execution.
3. Click Execute to start an experiment.
4. Click the Action Card, and check details of action execution.
5. View the execution logs to confirm it has been executed successfully.
6. View the node status after the fault execution. You can see that the node is in an abnormal status now. It indicates that the fault injection was successful, and the Pods under the cluster node are also running abnormally.
7. Execute the recovery actions, view the execution logs, and confirm that the recovery actions were successful.
8. After successful execution of the fault recovery action, view the status of the cluster node. You can see that the node is operating normally, and the Pods under the cluster node are also functioning properly, indicating that the fault has been successfully resolved.
﻿
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Tencent Big Model

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Image Creation

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Voice Technology

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Background

Experiment Execution

Step 1: Experiment Preparation

Step 2: Create an Experiment

Step 3: Execute the Experiment