This document will introduce the factors that affect the reliability of TDMQ for CKafka from the production side, server side (CKafka) and consumer side, and provide corresponding solutions.
How to Handle Data Loss on the Production Side?
Causes of Data Loss
When the producer sends data to the CKafka, data may be lost due to network jitter. At this time, the CKafka does not receive the data. Possible situations:
The network load is high or the disk is busy, and the producer has no retry mechanism.
disk exceeds purchase specification limit. For example, if the instance disk spec is 9000 GB and the disk is not scaled out in time after full, data cannot be written to CKafka.
Sudden or continuously growing peak traffic exceeds the purchase specification limit. For example, if the instance peak throughput specification is 100 MB/s and the peak throughput exceeds the limit for a long time without timely scale-out, data writing to CKafka will slow down. When the producer has a queue timeout mechanism, data cannot be written to CKafka.
Solution
The producer enables the failure retry mechanism for its own important data.
For disk usage, set up monitoring and alert policy when configuring the instance to help you take preventive measures. When the disk is full, you can upgrade in time in the console (the upgrade between non-exclusive instances of CKafka is a smooth upgrade without stopping the service, and the disk can also be upgraded separately) or reduce disk storage by changing the message retention time.
To minimize message loss on the producer side, you can tune the buffer size through buffer.memory and batch.size (in bytes). The larger the buffer, the better. If the producer goes down for some reason, the more data there is in the buffer, the more garbage needs to be reclaimed, and the slower the recovery will be. You should pay attention to the number of messages produced by the producer, the average message size, etc. (There are a wide range of monitoring metrics in CKafka monitoring.).
Configure Production End ACK When the producer sends data to the leader, the level of data reliability can be set through the request.required.acks parameter and min.insync.replicas.
When acks = 1 (default value), the producer can continue sending the next entry after the leader in ISR has successfully received data. If the leader goes down, data loss may occur as the data may not have been synchronized to its followers yet.
When acks = 0, the producer sends the next message without waiting for confirmation from the broker. In such cases, the data transmission efficiency is highest, but the data reliability is lowest.
Note:
When the production side configures acks = 0, if the current instance is rate limited, the server will actively close the connection with the client in order to protect the server to provide services properly.
When acks = -1 or all, the producer needs to wait for all followers in the ISR to confirm receipt of the message before sending the next message, in which case the reliability is the highest. Even if ACK is configured as described above, it cannot be guaranteed that data will not be lost. For example, when there is only a leader in the ISR (the number of members in the ISR may increase or decrease due to certain circumstances, and at least there will be only one leader left), it will become a situation where acks = 1. Therefore, you need to use the min.insync.replicas parameter (which can be configured in the Topic Configuration -> Enable Advanced Configuration in the TDMQ-C for MySQL console) at the same time. The min.insync.replicas indicates the minimum number of replicas in the ISR. The default value is 1 and it takes effect only when acks = -1 or all.
Recommended Configured Parameter Value
This parameter value is for reference only. The actual value depends on the actual business situation.
Retry Mechanism: message.send.max.retries=3;retry.backoff.ms=10000;
Guaranteed High Reliability: request.required.acks=-1;min.insync.replicas=2;
Guaranteed High Performance: request.required.acks=0;
Reliability and Performance: request.required.acks=1;
How to Handle Data Loss on the Server (CKafka)?
Causes of Data Loss
The leader of the partition goes down before completing the backup of the follower replicas. Even if a new leader is elected, the data is lost because it fails to be backed up in time.
The disk writing mechanism of open source Kafka is asynchronous disk writing, that is, data is first stored in PageCache. When the data has not been officially written to disk yet, if the broker is disconnected, restarted, or fails, the data in PageCache will be lost since it fails to be written to disk in time.
Disk failure causes data that has already been written to disk to be lost.
Solution
Open-source Kafka is multi-replica. The official recommendation is to use replicas to ensure data integrity. At this point, if it is multi-replica, data will be lost only when multiple replicas and multiple brokers fail at the same time. Its reliability is much higher than that of single replica data. So TDMQ CKafka officially recommends that topics are dual-replica, and 3 replicas are configurable.
The CKafka service is configured with more reasonable parameters log.flush.interval.messages and log.flush.interval.ms to write data to disk.
CKafka performs special processing on disk, and data reliability is not affected when part of the disk is damaged.
Recommended Configured Parameter Value
A replica in an asynchronous state can be elected as a leader: unclean.leader.election.enable=false // off
How to Handle Data Loss on the Consumer Side?
Causes of Data Loss
Submit a commit for the offset before actually consuming the data. If the consumer crashes during the process but the offset has been refreshed, the consumer will miss a piece of data. You need to reset the offset of the consumption group to retrieve the data.
The consumption and production rates differ for too long, and the message retention time is too short, causing messages not to be consumed in time and to be expired and deleted.
Solution
Reasonably configure the parameter auto.commit.enable. When equal to true, it indicates automatic submission. It is recommended to use scheduled commit to avoid frequent offset commits.
Monitor consumer situations and adjust data retention time appropriately. Monitor current consumption offset and the number of unconsumed messages, and configure alarms to prevent messages from being expired and deleted due to too slow consumption speed.
Troubleshooting Plan for Data Loss
Print the Partition and Offset Locally for Troubleshooting
The print information code is as follows:
Future<RecordMetadata> future = producer.send(new ProducerRecord<>(topic, messageKey, messageStr));
RecordMetadata recordMetadata = future.get();
log.info("partition: {}", recordMetadata.partition());
log.info("offset: {}", recordMetadata.offset());
If the partition and offset can be printed, it means that the currently sent messages have been correctly saved on the server. At this point, you can use the message query tool to query the messages of related locations.
If the partition and offset cannot be printed, it means that the message has not been saved by the server, and the client needs to retry.