Overview
CKafka supports cross-availability zone deployment. When purchasing CKafka instances in a region with three or more availability zones, you can select up to four availability zones for the professional edition and up to two availability zones for the advanced edition. The instance partition replicas will be forcibly distributed across nodes in each availability zone. This deployment method enables your instances to provide services properly even if a single availability zone becomes unavailable.
Deployment Architecture
The cross-AZ deployment of CKafka is divided into network layer, data layer and control layer.
Network Layer
CKafka exposes a VIP to clients. After connecting to the VIP, clients can get the metadata information on partitions (this metadata usually maps addresses one by one through different ports of the same VIP). This is a VIP that can failover to another AZ at any time. When an AZ is unavailable, the VIP will automatically drift to another available node in this region, thereby achieving cross-AZ disaster recovery.
Data Layer
The data layer of CKafka and Native Kafka use the same distributed deployment method, that is, multiple data replicas are distributed among different broker nodes, and different nodes are deployed in different AZs. When processing a partition, there will be a leader-follower relationship between different nodes. When the leader has an exception and is not online, the cluster control node (Controller) will elect a new partition leader to handle the requests of this partition.
For clients, when an AZ becomes unavailable, if the leader of a certain topic partition is located on a broker node in the unavailable AZ, the previously established connections will experience timeouts or closures. After the leader node of the partition fails, the Controller (if the Controller node fails, the remaining nodes will elect a new Controller node) will elect a new leader node to provide services. The leader switch time is in seconds (the specific switch time is directly proportional to the number of cluster nodes and the size of the metadata). Clients will periodically refresh the metadata information of topic partitions and connect to the new leader node for production and consumption. Control Layer
The control layer of CKafka and Native Kafka adopt the same technical solution and depend on zookeeper for service discovery and cluster Controller election of broker nodes. CKafka instances that support cross-availability zone deployment, where the zookeeper cluster zk nodes (hereinafter referred to as zk nodes) are deployed in three AZs (or IDCs). When the zk nodes in any one of the AZs fail and disconnect, the entire zk cluster can still provide services normally.
Advantages and Disadvantages of Cross-AZ Deployment
Advantage
Can significantly enhance the cluster disaster recovery capability. When unexpected risks such as network instability and power-off restart occur in a single AZ, it can still ensure that clients can resume message production and consumption after a short waiting time to reconnect.
Disadvantage
If a cross-AZ deployment is adopted, since partition replicas are distributed in multiple availability zones, there will be additional cross-region network latency for message replication compared with that in a single availability zone. This latency will directly affect the write time consumed by production clients (when the client Ack parameter is greater than 1, or equal to -1, all). Currently, the cross-AZ latency in major regions such as Guangzhou, Shanghai and Beijing is generally 10 ms to 40 ms.
Cross-AZ Deployment Scene Analysis
Single AZ Unavailability
After a single AZ becomes unavailable, as previously explained in the principle, the client will experience disconnection and reconnection, and the service can still be provided normally after reconnection.
Since the management and control API service is currently unavailable for cross-availability zone deployment, after one AZ becomes unavailable, there may occur symptoms such as being unable to create a Topic through the console, configure an ACL policy, or view monitoring, but it will not affect the production and consumption of existing services.
Network Isolation between Two AZs
If network isolation occurs between two AZs, that is, they cannot communicate with each other, a cluster split-brain event may occur. That is, the nodes of both availability zones provide services, but the data written in one of the availability zones will be deemed as dirty data after cluster recovery.
Consider the following scenario: when the cluster Controller node and a zk node of the zk cluster are network isolated from other nodes. At this point, other nodes will re-elect to generate a new Controller (because the majority of the zk cluster nodes have normal network communication, the Controller can be successfully elected). However, the Controller that has experienced network isolation still considers itself as the Controller node. At this time, a split-brain situation may occur in the cluster.
At this point, the client's writing needs to be considered on a case-by-case basis. For example, when the client's Ack strategy is equal to -1 or all and the number of replicas is 2, assume that the cluster has 3 nodes. After a split-brain event, the distribution will be 2:1. The writing in the partition where the original leader is located in node 1 region will report an error, while the other side will write successfully. However, if there are 3 replicas and Ack = -1 or all is configured, neither side will write successfully. At this point, further handling solutions need to be determined based on specific parameter configurations.
After the cluster network recovers, the client can resume production consumption without performing any operation. However, since the server will normalize the data again, the data of one of the split nodes will be directly truncated. But for the multi-replica cross-region data storage method, this kind of truncation will not bring data loss.
Multi-AZ Disaster Recovery Limitations
Capacity Constraint
CKafka achieves multi-AZ disaster recovery through the multi-AZ distribution of underlying resources. In case of AZ failure, CKafka will switch the Partition Leader to the Broker node in the available AZ. Therefore, the underlying resources carrying traffic will become fewer, so there will be capacity issues. The following is the resource availability in case of single AZ failure:
2-AZ disaster recovery instance, usable capacity = instance / 2. To ensure proper usage, customers need to have 100% redundant resources.
3-AZ disaster recovery instance, usable capacity = instance / 3 * 2. To ensure normal usage, customers need to have 50% redundant resources.
4-AZ disaster recovery instance, usable capacity = instance / 4 * 3. To ensure normal usage, customers need to have 33.3% redundant resources.
n-AZ disaster recovery instance, usable capacity = instance / n * (n - 1). To ensure normal usage, customers need to have 1 / (n - 1) redundant resources.
Parameters and Configuration Limits
CKafka supports modifying the topic-level configuration `min.insync.replicas` in the console. This configuration takes effect when the client setting ack=-1. It can ensure that a message is returned as successfully produced only after being synchronized by `min.insync.replicas` replicas simultaneously (for example: if the topic replica = 3 and min.insync.replicas=2, then the produced message needs to be synchronized by at least 2 replicas to be considered successful).
Therefore, the configuration of the topic (`min.insync.replicas` < number of AZs) must be... to ensure production availability in case of AZ failure; the configuration of the topic `min.insync.replicas` < number of topic replicas must be... to ensure production availability in case of single node failure. The setting rule is: `min.insync.replicas` < number of topic replicas <= number of AZs.
2-AZ instance, topic replica quantity = 2, min.insync.replicas = 1
3-AZ instance, topic replica quantity = 2, min.insync.replicas = 1
3-AZ instance, topic replica quantity = 3, min.insync.replicas <= 2