Background
Nowadays, Kafka, a highly performed, highly reliable, and distributed message queue system, is widely applied in large-scale internet services by renowned companies like Tencent, Facebook, LinkedIn, Netflix, and Airbnb. However, in large-scale distributed systems, the unpredictability, complexity, and coupling of services often lead to unforeseen fault events. When a Kafka Broker node is down, the following faults may appear:
Data loss: If messages are being written to a broker that is down, data loss may occur. During this period, the producer may not be able to write messages to the partitions and copy them to other replicas, resulting in message loss.
Reduced availability: A down broker no longer processes requests, which may cause producer and consumer requests to time out. If multiple broker nodes go down, the availability of the cluster will be further reduced.
Increased delay: A down broker no longer processes requests, which may cause increased delay for producer and consumer requests. When requests time out and are resent, they may have to wait for responses from other nodes, resulting in longer delays.
Unbalanced leader election : If a down broker is the leader of the partition, leader election is required. If the down broker is restarted and its backups are not deleted before exit, it may result in an unbalanced leader election.
Replica synchronization delay: If a down broker is the replicator of one or more replicas, it may cause a delay in replica synchronization. If this delay is high, it will cause producers and consumers to read or write outdated data.
Tencent Cloud Message Queue CKafka (Cloud Kafka)
Based on the open-source Apache Kafka TDMQ engine, Tencent Cloud Message Queue CKafka (Cloud Kafka) provides a TDMQ service with high throughput performance and high scalability. It is perfectly compatible with the APIs of Apache Kafka v0.9, v0.10, v1.1, v2.4 v2.8, and v3.2. It has significant strengths in performance, scalability, business security assurance, and Ops, allowing you to enjoy powerful features at a low cost while eliminating tedious Ops work.
Fault Principle
Fault effect: For the CKafka instance, a broker or a broker in a specific availability zone is down and goes offline. This will result in unsynchronized replicas in the instance.
CKafka Broker down logic: When a broker is down (a new broker node will not automatically start), a new topic-partition leader replica is elected from the remaining brokers in the cluster. If one broker fails, the replicas on that machine become unavailable. It will result in one fewer replica for the topic and will not recover to the expected number of replicas for the user. For CKafka, as long as there is one available replica, production and consumption can proceed normally. After the fault is recovered, the replicas on the original broker will be copied from other broker nodes to recover.
User-side risks of CKafka Broker node down: In the case of a fault, the leader replica switch usually occurs within seconds. During this period, users may receive retry warnings. Due to the very fast switching, message loss usually does not occur (Extremely low probability: Messages are written to the page cache and flushed to disk by an asynchronous thread. If the data in the page cache has not been flushed to disk, and the follower has not yet synchronized the messages from the leader replica, there is a possibility of losing some messages).
Fault example diagram: Consider a CKafka instance (4 broker nodes deployed across two availability zones). After Broker A is down and goes offline, the CKafka instance will select the replica of Partition 0 located on Broker C as the new leader replica to provide messaging services. Must-Knows
Instance type: This action is only open for fault injection on instances of the CKafka Pro Edition type.
Instance status: The instance used for experiments must contain valid data (i.e., existing topics).
Avoid frequent experiments in a short time: Frequent experiments can result in insufficient instance health status data, making it difficult to determine whether the issues were caused by the experiment or other factors. This can lead to experiment request rejections and experiment failures. If an experiment fails, please retry after the interval of 15 minutes.
Prevent overloading instance production and consumption traffic during the experiment: During the broker down fault experiment, the bandwidth utilization of other brokers, excluding the one that is down, must not exceed 80%.
Experiment Preparation
Prepare a CKafka Pro Edition instance for the experiment, with a certain amount of valid data. This document uses a cross-AZ disaster recovery instance (with 4 brokers) as an example.
Step 1: Create an experiment
2. In the left sidebar, select Experiment Management page, and click Create a New Experiment.
3. Click Skip and create a blank experiment.
4. After filling in the basic information, enter the Experiment Object Configuration. Select Middleware > CKafka for the object type, and click Add Instance. After clicking Add Instance, all CKafka instances in the target region will be listed. You can filter instances based on Instance ID, VPC ID, Subnet ID, and Tags.
5. After you select the target instance, click Add Now to add the experiment action.
6. Select the experiment action Broker down (Professional Edition), and then click Next.
7. Set action parameters. In this document, the Specified availability zone downtime mode is selected. Select the target availability zone for injection, and click Confirm.
Note:
This fault action supports the following 2 injection methods:
1. Random single Broker downtime: This will randomly select one node from all the broker nodes in the CKafka instance to inject the fault.
2. Random availability zone downtime (Disaster recovery instance): This injection method is only supported for disaster recovery instances. It will randomly select one availability zone from the CKafka instance’s availability zones to inject the fault. After the fault is injected, all brokers in the selected availability zone will be down and go offline. If a non-disaster recovery instance is used, no fault will be injected, and the injection will fail.
3. Specified availability zone downtime (Disaster recovery instance): When you inject a fault into a specified availability zone, the fault will only be injected successfully if there is an instance in the specified availability zone. If no instance exists in the specified availability zone, the injection will fail. If you need to specify multiple availability zones for fault injection, it is recommended to split them into separate action groups.
8. Click Next to enter the Global Configuration. For global configuration, see Quick Start. 9. After confirmation, click Submit.
10. Click Experiment Details to start the experiment.
Step 2: Execute the experiment
1. Observe the monitoring data of the instance before the experiments, focusing on two key monitoring metrics: the broker survival rate and the number of unsynchronized replicas. You can view this on the CKafka Console. Note:
There will be a certain delay in CKafka Console monitoring. Data changes can be observed after the fault injection is successfully carried out. The Broker survival rate is collected every 5 minutes. 2. As the experiment is manually executed, fault actions must be executed manually. Click Execute in Action Card to start fault injection. Start fault injection and wait for the fault injection to succeed.
3. Once the fault injection is successful, you can click the action card to view the corresponding execution details. It can be found that the fault has been injected successfully. At this point, you can go to the CKafka Console to observe data changes. Monitor the broker survival rate metric. Notice if it decreases, it indicates that one broker has been down.
At the same time, observe the number of unsynchronized replicas metric, and there are unsynchronized replicas.
4. To execute recovery, click Execute recovery action.
5. After the recovery is successful, continue to observe the metrics.
The broker survival rate metric should recover to 100%.
The number of unsynchronized replicas should recover to 0.
Was this page helpful?