tencent cloud

Feedback

Standard Cluster and Serverless Cluster Super Node Faults

Last updated: 2024-09-26 15:47:38

    Background

    Super nodes in container services, compared to regular nodes, support rapid auto scaling. They consolidate what would have been multiple nodes into a single node for management, simplifying the administration of container resources. However, container nodes may encounter hardware faults, resource shortages, network faults, etc., which could lead to container instances not operating correctly.
    Through these experiments, you can verify whether the system can operate normally in the event of container node faults, and uncover potential issues in advance for system architecture optimization and emergency planning. Through these experiments, you can verify whether the system can operate normally in the event of container node faults, and uncover potential issues in advance for system architecture optimization and emergency planning.

    Experiment Execution

    Step 1: Experiment Preparation

    Purchase standard cluster container instances and super nodes, and deploy test services.
    Purchase Serverless cluster container instances with built-in super nodes and deploy test services.

    Step 2: Create an Experiment

    1. Log in to Tencent Smart Advisor > Chaotic Fault Generator, enter the Experiment Management page, and click Create a New Experiment.
    2. Create a blank experiment. Click Skip and create a blank experiment.
    3. Fill in the basic information for the experiment, add related resource tags as needed, then click Next.
    4. Create two action groups, select Container as the Resource Type, and add Standard Cluster Super Node and Serverless Cluster Super Node Resource Objects respectively.
    5. Find the experiment instances and add them to the action groups.
    6. Click Add Now to add experiment actions, then select fault actions: Node lockdown and Node drain (Node Eviction).
    7. For global configuration, confirm the Experiment Action Group, select the Experiment Execution Method and configure the Guardrail Policies.
    8. Click Submit to complete the experiment creation.

    Step 3: Execute the Experiment

    1. Go to experiment details, click Go to Action Group Execution.
    2. Execute the actions in the sequential execution:
    3. Execute Node lockdown and Node lockdown Recovery.
    3.1 Execute the Node Block Failure Action, view the execution logs on the action card, and observe the node status to confirm it is locked and no longer schedulable.
    3.2 Execute the Node Block Fault Recovery Action to unlock the node and observe its status.
    4. Execute Node Drain and Node Drain Recovery.
    4.1 View the current Pod list on the node, check the high availability policies for the services, and ensure that there is sufficient capacity on other nodes to restart the Pods after eviction.
    4.2 Execute the Node Drain Failure Action, which will evict resources from the node. The action card will show the evicted Pods, making it easier to observe the affected services. The super node will be locked and not schedulable. At the same time, through cluster scheduling, the evicted Pods are reconstructed on other nodes to recover services.
    4.3 Execute fault recovery. Unblock the node and recover its schedulable status.
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support