Multi-AZ Deployment

Overview
CKafka supports cross-availability zone deployment. When purchasing CKafka instances in a region with three or more availability zones, you can select up to four availability zones for the professional edition and up to two availability zones for the advanced edition. The instance partition replicas will be forcibly distributed across nodes in each availability zone. This deployment method enables your instances to provide services properly even if a single availability zone becomes unavailable.
Deployment Architecture
The cross-AZ deployment of CKafka is divided into network layer, data layer and control layer.
Network Layer
CKafka exposes a VIP to clients. After connecting to the VIP, clients can get the metadata information on partitions (this metadata usually maps addresses one by one through different ports of the same VIP). This is a VIP that can failover to another AZ at any time. When an AZ is unavailable, the VIP will automatically drift to another available node in this region, thereby achieving cross-AZ disaster recovery.
Data Layer
The data layer of CKafka and Native Kafka use the same distributed deployment method, that is, multiple data replicas are distributed among different broker nodes, and different nodes are deployed in different AZs. When processing a partition, there will be a leader-follower relationship between different nodes. When the leader has an exception and is not online, the cluster control node (Controller) will elect a new partition leader to handle the requests of this partition.
For clients, when an AZ becomes unavailable, if the leader of a certain topic partition is located on a broker node in the unavailable AZ, the previously established connections will experience timeouts or closures. After the leader node of the partition fails, the Controller (if the Controller node fails, the remaining nodes will elect a new Controller node) will elect a new leader node to provide services. The leader switch time is in seconds (the specific switch time is directly proportional to the number of cluster nodes and the size of the metadata). Clients will periodically refresh the metadata information of topic partitions and connect to the new leader node for production and consumption.
Control Layer
The control layer of CKafka and Native Kafka adopt the same technical solution and depend on zookeeper for service discovery and cluster Controller election of broker nodes. CKafka instances that support cross-availability zone deployment, where the zookeeper cluster zk nodes (hereinafter referred to as zk nodes) are deployed in three AZs (or IDCs). When the zk nodes in any one of the AZs fail and disconnect, the entire zk cluster can still provide services normally.
﻿
Advantages and Disadvantages of Cross-AZ Deployment
Advantage
Can significantly enhance the cluster disaster recovery capability. When unexpected risks such as network instability and power-off restart occur in a single AZ, it can still ensure that clients can resume message production and consumption after a short waiting time to reconnect.
Disadvantage
If a cross-AZ deployment is adopted, since partition replicas are distributed in multiple availability zones, there will be additional cross-region network latency for message replication compared with that in a single availability zone. This latency will directly affect the write time consumed by production clients (when the client Ack parameter is greater than 1, or equal to -1, all). Currently, the cross-AZ latency in major regions such as Guangzhou, Shanghai and Beijing is generally 10 ms to 40 ms.
Cross-AZ Deployment Scene Analysis
Single AZ Unavailability
After a single AZ becomes unavailable, as previously explained in the principle, the client will experience disconnection and reconnection, and the service can still be provided normally after reconnection.
Since the management and control API service is currently unavailable for cross-availability zone deployment, after one AZ becomes unavailable, there may occur symptoms such as being unable to create a Topic through the console, configure an ACL policy, or view monitoring, but it will not affect the production and consumption of existing services.
Network Isolation between Two AZs
If network isolation occurs between two AZs, that is, they cannot communicate with each other, a cluster split-brain event may occur. That is, the nodes of both availability zones provide services, but the data written in one of the availability zones will be deemed as dirty data after cluster recovery.
Consider the following scenario: when the cluster Controller node and a zk node of the zk cluster are network isolated from other nodes. At this point, other nodes will re-elect to generate a new Controller (because the majority of the zk cluster nodes have normal network communication, the Controller can be successfully elected). However, the Controller that has experienced network isolation still considers itself as the Controller node. At this time, a split-brain situation may occur in the cluster.
At this point, the client's writing needs to be considered on a case-by-case basis. For example, when the client's Ack strategy is equal to -1 or all and the number of replicas is 2, assume that the cluster has 3 nodes. After a split-brain event, the distribution will be 2:1. The writing in the partition where the original leader is located in node 1 region will report an error, while the other side will write successfully. However, if there are 3 replicas and Ack = -1 or all is configured, neither side will write successfully. At this point, further handling solutions need to be determined based on specific parameter configurations.
After the cluster network recovers, the client can resume production consumption without performing any operation. However, since the server will normalize the data again, the data of one of the split nodes will be directly truncated. But for the multi-replica cross-region data storage method, this kind of truncation will not bring data loss.
Multi-AZ Disaster Recovery Limitations
Capacity Constraint
CKafka achieves multi-AZ disaster recovery through the multi-AZ distribution of underlying resources. In case of AZ failure, CKafka will switch the Partition Leader to the Broker node in the available AZ. Therefore, the underlying resources carrying traffic will become fewer, so there will be capacity issues. The following is the resource availability in case of single AZ failure:
2-AZ disaster recovery instance, usable capacity = instance / 2. To ensure proper usage, customers need to have 100% redundant resources.
3-AZ disaster recovery instance, usable capacity = instance / 3 * 2. To ensure normal usage, customers need to have 50% redundant resources.
4-AZ disaster recovery instance, usable capacity = instance / 4 * 3. To ensure normal usage, customers need to have 33.3% redundant resources.
n-AZ disaster recovery instance, usable capacity = instance / n * (n - 1). To ensure normal usage, customers need to have 1 / (n - 1) redundant resources.
Parameters and Configuration Limits
CKafka supports modifying the topic-level configuration `min.insync.replicas` in the console. This configuration takes effect when the client setting ack=-1. It can ensure that a message is returned as successfully produced only after being synchronized by `min.insync.replicas` replicas simultaneously (for example: if the topic replica = 3 and min.insync.replicas=2, then the produced message needs to be synchronized by at least 2 replicas to be considered successful).
Therefore, the configuration of the topic (`min.insync.replicas` < number of AZs) must be... to ensure production availability in case of AZ failure; the configuration of the topic `min.insync.replicas` < number of topic replicas must be... to ensure production availability in case of single node failure. The setting rule is: `min.insync.replicas` < number of topic replicas <= number of AZs.
2-AZ instance, topic replica quantity = 2, min.insync.replicas = 1
3-AZ instance, topic replica quantity = 2, min.insync.replicas = 1
3-AZ instance, topic replica quantity = 3, min.insync.replicas <= 2
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Overview

Deployment Architecture

Network Layer

Data Layer

Control Layer

Advantages and Disadvantages of Cross-AZ Deployment

Advantage

Disadvantage

Cross-AZ Deployment Scene Analysis

Single AZ Unavailability

Network Isolation between Two AZs

Multi-AZ Disaster Recovery Limitations

Capacity Constraint

Parameters and Configuration Limits

About Tencent Cloud

Help & Support

Resources

User Center

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Voice Technology

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Image Creation

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Tencent Big Model

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Overview

Deployment Architecture

Network Layer

Data Layer

Control Layer

Advantages and Disadvantages of Cross-AZ Deployment

Advantage

Disadvantage

Cross-AZ Deployment Scene Analysis

Single AZ Unavailability

Network Isolation between Two AZs

Multi-AZ Disaster Recovery Limitations

Capacity Constraint

Parameters and Configuration Limits

About Tencent Cloud

Help & Support

Resources

User Center