tencent cloud

Feedback

Serverless Pod Virtual Node Shutdown Faults

Last updated: 2024-09-26 15:47:38

    Background

    The TKE Serverless clusters, along with the traditional clusters, provide super nodes capabilities to enhance the efficiency of dynamic resource scaling and reduce resource usage costs. The cluster delivers computing resources in the form of Pods, eliminating the need for users to manage the actual nodes running behind the Pods. Using super nodes is similar to using a very large specification CVM, facilitating resource management and scaling-in.
    The super node itself is just a logical concept. When Pods are run on super nodes, the cluster dynamically creates a temporary virtual node exclusive to the Pod, which is terminated when the Pod is deleted. When a virtual node fault is detected, the Pod will perform disaster recovery drift, rebuild the virtual node, and rebuild the Pod. You can deploy super nodes across different availability zones to achieve resource dispersion and mitigate disaster risks due to availability zone faults. The CFG provides a Serverless Pod virtual node shutdown fault scenario, which can help you verify the impact of Pod disaster recovery drift on your services after a virtual node fails. Additionally, it can simulate the impact of single availability zone faults on Pods scheduled to super nodes to verify your disaster recovery design's effectiveness.
    Note:
    Injecting this fault scenario on Pods not scheduled to super nodes will fail.
    The Kubernetes clusters have some self-healing capabilities, but during large-scale faults, such as sudden abnormalities in over 50% of nodes or 50% of Serverless containers, this self-healing ability may retreat or even break. This is to prevent large-scale eviction or traffic removal actions that could lead to greater fault risks, such as a cluster-wide avalanche effect. It is recommended that you do not exceed half of the total Serverless Pods in the cluster in a single-fault experiment.
    The fault determination time is affected by the timeout configuration on the Agent within the instance (The default value is 5 minutes). If you need to promptly remove abnormal Pod traffic, see Automatic Rebuilding and Self-Healing for configuration.

    Fault Parameters Description

    Duration (s) : The maximum time allowed for a fault. The default value is 30 minutes. After the fault injection is successful, the fault will automatically recover after the configured duration (even if manual recovery actions are not executed).
    Whether to auto-Rebuild Pod or not: Whether to enable the Pod's automatic rebuilding feature, which is enabled by default. It is used to simulate persistent faults that cannot self-heal. For details, please see Automatic Rebuilding and Self-Healing.
    Wait Time for Pod Rebuilding After Recovery (s): After the fault is self-healed, the system will continue to detect the Pod's health status within this time to determine if the fault is recovered. The default value is 8 minutes. Ensure this duration is greater than the maximum Pod rebuilding time.

    Experiment Execution

    Step 1: Experiment Preparation

    Run Pod instances on the super nodes, such as Serverless clusters and super nodes in a standard cluster.
    Enter the Agent Management page to install fault injection agents in the cluster where the Pods are located.

    Step 2: Create an Experiment

    1. Log in to Tencent Cloud Smart Advisor > Chaotic Fault Generator, go to Experiment Management page, and click Create a New Experiment.
    2. Click Skip and create a blank experiment, and fill in the experiment details.

    Step 3: Add Experiment Instances and Actions

    1. In the experiment object configuration section, select either Standard Cluster Pod or Serverless Cluster Pod for the object type container.
    2. Add an instance and select the corresponding Pod name. Select the Cluster ID and Namespace for injection, which will automatically retrieve Pods in that namespace under the cluster.
    3. Add experiment actions. Select the Serverless Pod Virtual node shutdown fault under Pod Operation category.
    4. For detailed action parameters configuration, see Fault Parameter Description.

    Step 4: Global Configuration

    You can configure advancing methods, monitoring metrics, guardrails, and other policies during the global configuration phase.

    Fault Result Observation

    Click Execute and wait for the fault injection to complete. Observe the Pod status in the TKE Console > Cluster. Enter the cluster basic information page, and click Pod. When the Pod’s Ready status shows ProbeTimeout, it indicates that the underlying virtual node has been successfully shut down, causing the Agent to report a timeout (At this point, the traffic of the abnormal Pod has been removed).
    If the Auto-Rebuild Pod is set to Yes in the fault parameters, the Pod rebuild will be automatically triggered after the fault. Pods will rebuild and self-heal within a certain time, and you can observe the corresponding Pod rebuild events.
    If the Auto-Rebuild Pod is set to No in the fault parameters, the fault will persist until a manual execution of rollback is executed or until the fault duration ends, at which point the fault will automatically recover. The fault recovery process will also involve reallocating underlying virtual nodes and rebuilding Pods.
    Additionally, by observing the Pod monitoring metrics, you can see the fault process. You can observe that the various metrics of the Pod show checkpoints during the fault and then recover after the fault is recovered (Note: Since Pods are re-scheduled, the Pod restart count will not increase).

    FAQ

    How to Simulate a Single Availability Zone Fault of a Super Node?

    If you configured too many availability zones when purchasing the super nodes, you can first lockdown the super nodes in the availability zone to be faulted. Then, execute the virtual node shutdown fault on all Pods scheduled to that super node, simulating a scenario where the entire super node is unavailable due to a single availability zone fault, and the Pods are migrated to other availability zones.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support