Background
To ensure the ability of your business to provide continuous service, CVM products allow cross-AZ deployment so that your applications can be protected from impact in the situation of regional or availability zone faults in some special scenes.
If you are not confident in your services or cloud products, and worry that the impact of an IDC fault on the production environment may result in inaccessibility to your business, you can execute fault simulation and experiment through Tencent Smart Advisor-Chaotic Fault Generator to allow timely avoidance of hidden dangers.
Experiment Objectives
Objective 1
Check whether the cross-AZ service architecture can provide normal services in the case of an availability zone instance down.
Objective 2
Check whether the service recovery time and recovery effects meet business requirements.
Experiment Implementation
Step 1: Preliminary Preparation
Prepare several CVM instances close to the production environment for tests in different availability zones in the same region, and provide identical services.
Prepare complete log recording tools.
Provide emergency measures against unexpected situations.
Count the visits to daily business, and write scripts for simulating user requests.
Step 2: Experimental Design
2. Click Skip and create a blank experiment, and fill in the experiment information.
3. Select the ready test instance object, and configure instance shutdown action for instances in the same availability zone to simulate instance down fault.
After a fault action is added, the 'start up' recovery action will be automatically added. For the experiment, the shell script custom action is added for simulating start up and self-start to start original services in the instance and facilitate observing the recovery situation of the instance.
4. Cloud monitoring metrics or guardrail policy can be configured to observe the operating status of CVM instances.
Step 3: Experiment Implementation
1. Go to Experiment Details, and click Execute.
2. Execute fault injection, shut down the instance, and monitor data forwarding by the load balancing traffic.
3. After completing the fault injection experiment, execute fault recovery, and click Execute of the recovery action 'start up' to recover instance status. The platform will automatically execute and perform recovery verification.
Note
If booting from startup is not configured, manually trigger the shell script for service recovery.
4. An experiment is completed if all experiment actions are completed. You can click Record Drill Conclusion at the top right-hand corner to record experiment results. Register the experiment and record issues in the experiment to allow subsequent replay.
Experiment Result Analysis
Monitoring Metrics via Platform Tools
For a target fault instance, fault injection during execution time will result in the instance down, and CLB monitor will detect that the instance is inaccessible.
Now, traffic will be forwarded to a CVM instance in another availability zone, resulting in a sudden increase in traffic at the time point.
When the fault instance is repaired, that is, after the start up and service restart are completed, CLB monitor will detect that the instance port is healthy and restored to steady status.
Objective Attainment:
In the case of an availability zone down, CLB will automatically forward traffic to another availability zone, making the entire zone available.
When the availability zone is recovered and service is restarted, the steady-status metrics before fault injection can be recovered, and requests can be received and processed normally.
Considering the two results, the overall performance of the cross-AZ fault experiment in CVM meets the expectations.
Theoretical Analysis
Qualitative Analysis: Compare the difference between system metrics and the steady-status metrics during fault injection.
Quantitative Analysis:
System Performance Metrics = Performance metrics in the experiment / Performance metrics at steady status
System Recovery Rate = Performance metrics after an experiment and recovery action / Performance metrics at steady status
Analysis of Causes of System Defects:
Analyze system weaknesses.
Analyze deficiencies in fault handling.
Analyze disturbance resistance of the system.
Analyze monitoring alarm effectiveness.
Analyze dependency relations between modules.
Was this page helpful?