This document describes the factors that affect the reliability of CKafka from the perspectives of the producer, the server (CKafka), and the consumer, respectively, and provides corresponding solutions.
What should I do if data gets lost on the producer?
Causes of data loss
When the producer sends data to CKafka, the data may get lost due to network jitters, and CKafka will not receive the data. Other possible causes are as follows:
The network load is high or the disk is busy, and the producer does not have a retry mechanism.
The purchased disk capacity is exceeded. For example, if the disk capacity of an instance is 9,000 GB and it is not expanded promptly after being used up, data cannot be written to CKafka.
Sudden or continuously increased peak traffic exceeds the purchased peak throughput. For example, if the peak throughput of the instance is 100 MB/sec, but it is not scaled up promptly after the peak throughput is exceeded for a long period of time, data writes to CKafka will become slower. In this case, if the producer has a queuing timeout mechanism in place, data cannot be written to CKafka.
Solutions
Enable the retry mechanism on the producer for important data.
When the disk capacity is used up, upgrade the instance timely in the console. Upgrading Ckafka instances of Standard Edition will not interrupt the service. The disk capacity can be expanded separately. You can also shorten the message retention period to reduce disk usage.
To minimize the loss of messages on the producer, you can fine-tune the size of the buffer by using buffer.memory
and batch.size
(in bytes). A larger buffer is not necessarily better. When the producer fails for any reason, more data in the buffer means more garbage to be recycled, which slows down data recovery. Pay close attention to the number of messages produced by the producer and the average message size (through the rich set of monitoring metrics available in CKafka).
Configure acknowledgment (ACK) for the producer.
When the producer sends data to the leader, it can set the data reliability level by using the request.required.acks
and min.insync.replicas
parameters.
When acks = 1
(default value), the leader in the ISR has successfully received a message sent by the producer, and the next message can be sent. If the leader goes down, the data unsynced to its followers will get lost.
When acks = 0
, the producer sends the next message without waiting for acknowledgment from the broker. In this case, data transfer efficiency is the highest, but data reliability is the lowest.
Note:
When the producer is configured with acks = 0
, if the current instance is throttled, in order for the server to provide services normally, the server will actively close the connection to the client.
When acks = -1
or acks = all
, the producer needs to wait for the acknowledgment of message receipt from all the followers in the ISR before sending the next message, which ensures the highest reliability.
Even if acks
is configured as above, there is no guarantee that data will never get lost. For example, when there is only one leader in the ISR (the number of members in the ISR may increase or decrease in certain circumstances, and in some cases, only one leader is left), the value of acks
will be 1
. Therefore, you also need to configure the min.insync.replicas
parameter in the CKafka console by enabling the advanced configuration in Topic Management > Edit Topic. This parameter specifies the minimum number of replicas in the ISR, and its default value is 1. It only takes effect when acks = -1
or acks = all
.
Recommended parameter values
These parameter values are for reference only, and the actual values depend on the actual conditions of your business.
Retry mechanism: message.send.max.retries=3;retry.backoff.ms=10000;
Guarantee of high reliability: request.required.acks=-1;min.insync.replicas=2;
Guarantee of high performance: request.required.acks=0;
Reliability + performance: request.required.acks=1;
What should I do if data gets lost on the broker (CKafka)?
Causes of data loss
The partition's leader goes down before the followers complete the data backup. Even if a new leader has been selected, data will get lost because it has not been backed up yet.
Open-source Kafka stores data to disks in an async manner. Specifically, data is first stored in PageCache before persistence. If the broker disconnects, restarts, or fails, the data stored in PageCache will get lost because it has not been stored persistently to the disks yet.
Stored data may get lost due to disk failures.
Solutions
Open-source Kafka has multiple replicas that are used to ensure data integrity. Data will get lost only if multiple replicas and brokers fail at the same time, so data reliability is higher than that in the single-replica case. Therefore, CKafka requires at least two replicas for a topic and supports configuring three replicas.
CKafka performs data flushing by configuring more reasonable parameters, such as log.flush.interval.messages
and log.flush.interval.ms
.
In CKafka, the disk is specially designed to ensure that data reliability will not be compromised even if the disk is partially damaged.
Recommended parameter values
Whether a replica that is not in sync status can be elected as a leader: unclean.leader.election.enable=false // Disabled
What should I do if data gets lost on the consumer?
Causes of data loss
The offset is committed before data is consumed. If the consumer goes down in the process but the offset has been updated, the consumer will miss a data entry, and the consumer group will have to reset the offset in order to retrieve it.
The consumption speed differs significantly from the production speed, but the message retention period is too short; therefore, the message will be deleted upon expiration before it is consumed.
Solutions
Configure the auto.commit.enable
parameter appropriately. When it is set to true
, the commit is performed automatically. We recommend that you use the scheduled commit feature to avoid committing offsets frequently.
Monitor the consumer and correctly adjust the data retention period. Monitor the consumption offset and the number of unconsumed messages, and configure an alarm to prevent messages from being deleted upon expiration due to slow consumption.
Troubleshooting data loss
Printing partition and offset information locally for troubleshooting
Below is the code for printing information:
Future<RecordMetadata> future = producer.send(new ProducerRecord<>(topic, messageKey, messageStr));
RecordMetadata recordMetadata = future.get();
log.info("partition: {}", recordMetadata.partition());
log.info("offset: {}", recordMetadata.offset());
If the partition and offset can be printed out, the currently sent message has been correctly saved on the server. At this time, you can use the message query tool to query the information of the relevant offset.
If the partition and offset information cannot be printed out, the message has not been saved on the server, and the client needs to retry.
Was this page helpful?