Overview
Messaging middleware plays a crucial role in the technical architecture of business systems. TDMQ Pulsar inherently supports multi-AZ disaster recovery. To address region-level disasters and enable quick business migration to ensure continuity, a Cross-Region Disaster Recovery solution has been introduced.
The introduction of the cross-region disaster recovery solution is as follows.
Under normal circumstances, to ensure the access of business to the Pulsar server in region A, users need to complete two main actions:
1. Establish cross-city network connections through Cloud Connect Network (CCN) to interconnect VPC networks across regions; 2. Synchronize metadata by establishing cluster metadata synchronization in two locations through the Pulsar console, including namespace, topic, subscription, role, etc.
When an exception occurs, the TDMQ for Apache Pulsar console provides a domain name resolution switch feature, redirecting the original domain name used in the Region A to the disaster recovery Region B's cluster. This avoids the need for clients to modify access point addresses, thus implementing a disaster recovery cluster solution for the Region B and ensuring business continuity.
After recovery from the exception in the Region A, users first need to decide whether to rewind the messages produced in the Region B back to the Region A to ensure message integrity. If rewinding is required, please contact our after-sales team for operation. Then, users can switch the access point domain name resolution back to the Region B's cluster. After the switchback operation, clients can access the Region A normally.
Operation Guide
Disaster Recovery Feature Launch Configuration
1. In the Backup region, create a professional cluster. On the cluster purchase page, you need to enable the [Cross-Region Disaster Recovery] switch and select the cluster to backup;
2. Through the console, configure the cluster's metadata synchronization link:
Replication link name: Define a name for the synchronization link.
Source cluster selection: Choose the Pulsar cluster for disaster recovery backup.
Destination cluster selection: Choose the created disaster recovery cluster in a different region. Only clusters with the same [Cluster IDs] will be displayed here.
Replication level: Supports two levels, cluster level and namespace level.
Cluster level is suitable for cold backup at the cluster level.
Namespace level, suitable for scenarios where clusters in two locations are both active, and different namespaces are distributed across different regions, with mutual primary-backup between regions.
Cloud Connect Network Ensures Cross-Region Network Connectivity
Based on Cloud Connect Network, a network access channel is established between the production region and the backup region, enabling cross-region access to the backup cluster from clients in the production region during disasters. For details on the configuration, please see the Cloud Connect Network Operation Guide, and follow these steps: In the Event of a Disaster
User decision. Client access switches to the backup region:
1. Initiate domain name resolution switch through the console (if available);
2. If the console is unavailable, customers can contact the after-sales architect to initiate switch from the TDMQ service side.
After Disaster Recovery
User decision. Client access switches back to the original region cluster:
1. User determines whether message rewinding is necessary. If rewinding is needed, please contact our after-sales team for operation;
2. Initiate domain name switch back through the console, and client access to the original region resumes normally.
Notes
1. Supported Scope
This feature is only supported by professional clusters.
2. Message Rewinding
Message rewinding is a prerequisite judgment when users switch traffic back to the original region. The purpose is to avoid data loss and ensure data integrity. Please be sure to decide whether to rewind before switching back the domain name.
User provided information:
List of Topics to be migrated, for example: cluster ID, namespace, or specific Topic list.
The start and end times, messages sent within this time range in the Topic are considered data to be migrated, with reference to the publishTime in the message header.
Impacts of message rewinding:
A large number of duplicate messages. The server inherently will not consider the complex state machine involved in the synchronization of offsets between the source and destination clusters. It will treat all migrated messages as new messages, even if the historical data already contains the same message, it is considered two different messages. If duplicate messages have a certain impact on the business, it is recommended that the client perform Idempotent Processing. Fewer messages are out of order.
3. About Roles
In the source cluster, there needs to be at least one Role that does not need to be bound to namespace. The goal is to ensure that during synchronization, the Role and Token can remain consistent with the disaster recovery cluster.
4. Cloud Connect Network Configuration
In Cloud Connect Network configuration, the VPC CIDRs created in two regions must not overlap. For example: Guangzhou 10.0.0.0/16 and Shanghai 10.1.0.0/16. In this way, Cloud Connect Network can connect these two VPCs without IP conflicts.
5. Time for Domain Name Switch to Take Effective
The domain name switch takes about 5 s to 5 mins to take effect, including two parts: domain name resolution switch and clients disconnecting and reconnecting to the new cluster's Broker.
6. After Switch During a Disaster
In the event of a disaster, after traffic switches to the disaster recovery cluster, try not to change the metadata in the backup cluster, including changing namespace attributes and creating new Topics.
Was this page helpful?