tencent cloud

Feedback

Simulating DC Tunnel Disconnection Faults

Last updated: 2024-09-26 15:49:18

    Background

    Direct connect (DC) provides a fast and secure method to connect the cloud services and the local data center. Users can use a connection to connect Tencent Cloud computing resources across multiple regions, enabling flexible and reliable hybrid cloud deployment. DC tunnel is a network linkage segmentation of the connection, allowing creation of different DC tunnels associated with various DC gateways. In production environments, improper alarm configurations or unreasonable disaster recovery planning can lead to failures in receiving alarms or triggering disaster recovery plans during real faults, resulting in business losses. To proactively identify architectural risks, you can use chaos engineering to verify the disaster recovery of your direct connect deployment architecture in advance.
    By simulating the DC tunnel disconnection fault, you can:
    Verifying Alarm Reachability After DC Tunnel Disconnection
    You can set alarm rules for connections, DC tunnels, and DC gateways in the Tencent Cloud observability platform. In the case of the DC tunnel fault, the corresponding alarm policy can be triggered. To verify the effectiveness of alarm rule configuration, you can simulate a DC tunnnel fault via the Channel Disconnection Fault Simulation action to observe whether the alarm rules are triggered.
    Note:
    For direct connect alarm configuration, see DC - Configure Alarms.
    Verifying Disaster Recovery Capability of High Availability DC Deployment Architecture
    Tencent Cloud DC can ensure the high availability of services in various fault scenarios (e.g. port abnormalities/optical module faults, network device failures, and access point IDC faults). To enhance the disaster recovery capability, DC typically adopt high-availability deployment. To verify the effectiveness of high availability deployment architecture and the business disaster recovery performance for actual faults, you can use the Tunnel Disconnection Fault Simulation action for fault verification.
    Note:
    For direct connect disaster recovery architecture, see DC - Network Planning.

    Practical Examples

    Verifying Alarm Reachability After DC Tunnel Disconnection

    Experiment Preparation

    An exclusive or shared Dedicated Private Channel. Note that the DC tunnel must be in a connected status and the version must be 2.0 (Currently, fault simulation supports only Direct Connect version 2.0).
    In the Tencent Cloud observability platform, the corresponding alarm policy is configured for the DC tunnel disconnection fault.

    Experiment Steps

    Step 1: Create an experiment
    Create an experiment. Tencent Smart Advisor-Chaotic Fault Generator (CFG) provides two types of DC tunnel disconnection fault actions: Exclusive direct connect tunnel disable and Shared direct connect tunnel disable. The corresponding fault recovery actions are Exclusive direct connect tunnel enable and Shared direct connect tunnel enable. You can select the appropriate fault action based on your tunnel type. The following takes an exclusive DC tunnel as an example.
    1. Log in to Tencent Smart Advisor > Chaotic Fault Generator and enter the Experiment Management page. Click Create a New Experiment, select Dedicated Line - Exclusive direct connect tunnel for the object type, and then click Add Instance.
    2. After clicking Add Instance, you can filter your DC tunnels based on the search criteria.
    3. After selecting the instance, click Add experiment action.
    4. Select Exclusive direct connect tunnel disable.
    5. The fault action will automatically bring out the corresponding recovery action:
    If the recovery action is Automatic Execution, you can click the action and set pre-action and post-action waiting time to control the fault duration.
    If the recovery action is Manual Execution, you can manually control the timing of the fault and recovery. Click Next to enter Global Configuration.
    6. In Global Configuration, you can set the experiment execution method to Manual or Automatic. The default is Manual. Continue to Add Monitoring Metrics, which will refresh in real-time during the experiment execution process (There may be a 1~2 minute delay for different objects). Click Submit to enter Environmental Check.
    7. Environmental Check will not execute the experiment. It checks only whether the status of your experiment objects meets the experiment requirements. For example, it checks if your DC tunnel version is 2.0.
    8. For now, the experiment creation is complete. You can click Experiment Details to execute the experiment.
    Step 2: Execute the experiment action
    1. Click the Action Execution button.
    2. Wait for the fault action to successfully execute, and meantime you can real-time observe the fault performance through monitoring metrics (Network inbound and outbound bandwidth dropping to 0).
    3. Once the fault injection is complete, you can click Execute recovery action to recover the tunnel status at the appropriate time.

    Step 3: Observe the results

    After the fault, you can see the DC tunnel in a disabled status, with 100% package loss detected using the tunnel detection tool.
    After the fault, verify the effectiveness of alarm delivery by viewing if the TCOP Alert Strategy is triggered and if the alarm is recovered after the fault recovery. You can also observe the overall fault injection and recovery effects using monitoring metrics.
    Note:
    A certain delay exists between monitoring metrics and actual faults.

    Verifying the Disaster Recovery Capability of the High Availability DC Deployment Architecture (Taking Dual Lines with Dual Access Points as an Example)

    Experiment Preparation

    Dual lines with dual access points deployment architecture: The user's IDC is connected to two Tencent Cloud access points through two connections. The local router on the IDC side establishes BGP neighbor relationships with two DSR clusters via the BGP protocol. When a fault is detected on Connection 1, the system automatically switches traffic to Connection 2, ensuring normal business operations. After the fault is fixed, the traffic automatically switches back.

    Experiment Steps

    1. Create an experiment, select Dedicated Line - Exclusive direct connect tunnel, and click Add Instance to filter all DC tunnels by DC ID.
    2. The subsequent steps for experiment creation are the same as that in 1.2 Experiment Steps. For details, see 1.2 Experiment Steps.

    Result Observation

    DC tunnel disconnection fault will disable the BGP sub-API of the tunnel, resulting in the inability to establish a BGP connection. After the fault, you can see that the BGP connection is in a disabled status in the console. If the tunnel is configured with BFD or NQA health detection, the corresponding detection will also fail. If your DC is configured with automatic convergence rules, the tunnel traffic should automatically switch to the backup DC. If 50% capacity is reserved in connection capacity planning, you can observe the inbound and outbound traffic of the DC doubling by monitoring the DC traffic.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support