Cross-Region Disaster Recovery
Message middleware is a vital component in the technical architecture of business systems. While TDMQ for Apache Pulsar already supports disaster recovery across multiple availability zones, it introduces the Cross-region disaster recovery solution to address region-level disasters. This solution enables customers to quickly migrate their business operations, ensuring uninterrupted continuity.
The following document provides an overview of the cross-region disaster recovery solution.
Under normal circumstances, business operations in Region A access the Pulsar server. Users need to complete two main actions:
1. Establish cross-city network connectivity using Cloud Connect Network (CCN) to enable cross-region VPC communication. 2. Synchronize metadata between the two regions via the Pulsar console, including namespaces, Topics, subscriptions, and roles.
When an exception occurs, the TDMQ for Apache Pulsar console provides a domain name parsing switch feature. This feature redirects the domain name originally used in Region A to the disaster recovery cluster in Region B. This avoids the need for clients to modify access point addresses, enabling seamless access to the Region B cluster and ensuring business continuity.
Once the exception in Region A is resolved, users need to determine whether to write back the messages generated in Region B to Region A to ensure message integrity. If a write-back is needed, please contact our after-sales team for assistance. Afterward, users can switch the access point domain name parsing back to the Region A cluster from Region B. Once the switch is completed, clients can resume normal access to Region A.
Operation Guide
Configuring Disaster Recovery Features
1. In the backup region, create a professional cluster. On the cluster purchase page, enable the Cross-region Replication switch and select the cluster to be backed up;
2. Configure the cluster metadata synchronization linkage through the console:
Replication linkage name: Define a name for the synchronization linkage.
Linkage type: Select metadata.
Source cluster selection: Choose the Pulsar cluster for disaster recovery backup.
Target cluster selection: Select the pre-created disaster recovery cluster in a different region. Only clusters with the same cluster ID will be displayed.
Replication level: Choose between cluster-level and namespace-level replication.
Cluster level: Suitable for cold backups at the cluster level.
Namespace Level: Suitable for scenarios where clusters in both regions are actively used, with different namespaces distributed across regions. Regions act as mutual primary and backup for each other.
Establishing CCN
Use Cloud Connect Network to link the production region and the backup region, creating a network access channel. This ensures that, in the event of a disaster, clients in the production region can access the backup cluster across regions. For detailed configuration steps, see CCN Operation Guide and perform the following operations: When Disaster Occurs
Users can decide to switch client access to the backup region:
1. If the console is available: Initiate a domain name parsing switch via the console;
2. If the console is unavailable: Contact the after-sales architect to request a switch, which will be initiated by the TDMQ service.
After Disaster Recovery
Users can decide to switch client access back to the original region cluster:
1. Evaluate whether messages need to be written back to the original region. If write-back is required, contact our after-sales team for assistance.
2. Initiate a domain name switch-back via the console to restore normal client access to the original region.
Notes
1. Supported Scope
This feature is supported only in professional clusters.
2. Message Write-Back
Message write-back is a prerequisite assessment when switching traffic back to the original region. It aims to prevent data loss and ensure data integrity. Be sure to decide whether to perform a write-back before initiating the domain name switch-back.
User-provided information:
The list of Topics to be migrated, including details such as cluster ID, namespace, or specific Topic lists.
The start and end time. Messages sent within this time range, based on the publishTime field in the message header, will be identified as data to be migrated.
Impacts of message write-back:
A large number of duplicate messages may occur. The server does not account for the complex state machine of offset synchronization between the source and target clusters. All migrated messages are treated as new messages, even if identical messages already exist in the historical data. They will be regarded as separate messages. If duplicate messages impact your business, it is recommended to implement idempotent processing on the client side. A small number of messages may arrive out of order.
3. About Roles
The source cluster should have at least one Role, which does not need to be bound to a namespace. This ensures that during synchronization, the Role and Token remain consistent with the disaster recovery cluster.
4. CCN Configuration
When you configure CCN, the VPC CIDRs of the two regions should not overlap. For example, use 10.0.0.0/16 for Guangzhou and 10.1.0.0/16 for Shanghai. This ensures that CCN can link the two VPCs without IP conflicts.
5. Domain Name Switch Effectiveness Time
The domain name switch takes approximately 5 seconds to 5 minutes to become effective. This duration includes two parts: domain name parsing switch and client disconnection and reconnection to the new cluster’s Broker.
6. Post-Switch Actions During a Disaster
After traffic is switched to the disaster recovery cluster during a disaster, avoid making metadata changes on the backup cluster, such as modifying namespace attributes or creating Topics.
Was this page helpful?