Background
Tencent Smart Advisor-Chaotic Fault Generator provides fault actions for simulating a primary-secondary switch scene in TencentDB for PostgreSQL so that a scene in which primary-secondary switch occurs in PostgreSQL for some reason can be simulated. Primary-secondary switch experiment helps developers in system tests and experiments in a more complex and realistic environment so that possible problems and risks can be identified. Through Chaos Engineering experiments and tests, developers can have a more comprehensive understanding of system operating mode and performance characteristics and develop countermeasures and policies for different fault scenes to improve system stability and availability.
Note:
After a primary-secondary switch in PostgreSQL, replica machine rebuilding and migration will be started. The time required for replica machine readiness depends on data size. When an execution fails, retry the execution and do not execute the primary/replica switch experiment frequently. After a switch, it is advised that a second switch be executed after 10-20 minutes. Reference document of primary/replica switch instance in PostgreSQL . Impact of Primary-secondary Switch
There will be momentary disconnections in primary-secondary switch. Make sure that the application has a reconnection mechanism.
If a read-only instance is mounted for primary instance, there is minute-level delay in read-only instance after primary-secondary switch.
Experiment Implementation
Step 1: Experiment Preparation
A PostgreSQL instance with cross-availability zone primary node and secondary node.
Step 2: Experiment Orchestration
2. Click Skip and create a blank experiment at the lower left quarter.
3. Fill in experiment information, select Object Type PostgreSQL, and click Add Instance to add instances for the experiment.
4. After selecting an instance, click Add Nowin Experiment Action Module.
5. Add Primary-secondary switch fault action, and click Next.
6. Configure action parameters, close the forced switch at action parameters, and click Confirm.
Action Parameters Description:
Forced Switch: When it is enabled, conditions for the primary-secondary switch will not be checked and the primary-secondary switch will be performed directly; when it is closed, conditions for the primary-secondary switch must be checked before a primary-secondary switch is performed.
Conditions for primary-secondary switch: Go to PostgreSQL Console, go to the corresponding instance details page for check and editing in the availability information module. 7. Click Next to go to Global Configuration. See Quick Start for Global Configuration. 8. After confirmation, click Submit.
9. Click Experiment Details to go to Experiment Details page and start an experiment.
Step 3: Experiment Execution
1. Observe instance availability information data before an experiment, and pay attention to the primary node and secondary node of the instance before the experiment.
2. As the experiment is manually executed, fault actions must be executed manually. Click Execute in Action Card to start fault injection.
3. During a fault injection, the state of the instance observed in the basic information module of the instance corresponding to PostgreSQL Console is Switching. 4. After a successful fault injection, you can click blank area on the Action Card to check details of the action. It can seen that the primary node in the instance has been switched.
5. In Details page of an instance corresponding to PostgreSQL Console, a successful primary node switch can be confirmed. 6. The instance deployment state can be recovered to the state before the fault through fault recovery action. (It is advised that recovery action be executed after 10 minutes.)
7. After a successful fault recovery action, you can click the blank area on the Action Card to check the details of the action. You can see that the primary node of the instance has been switched.
8. In the Details page of an instance corresponding to PostgreSQL Console, you can confirm that the primary node has successfully switched to the pre-fault primary availability zone.
Was this page helpful?