Overview
Data compression can reduce network I/O transmission traffic and disk usage. This document describes the message formats supported for data compression and how to configure data compression based on your needs.
Message Format
Currently, CKafka supports two versions of message formats: v1 and v2 (imported in Kafka 0.11.0.0). CKafka is compatible with Kafka 0.9, 0.10, 1.1, 2.4, and 2.8.
Different configurations apply to different versions, which are described as below:
Message format conversion is mainly for compatibility with consumer programs on legacy versions. A CKafka cluster usually has message formats on multiple versions (v1 and v2).
The broker will decompress and recompress messages on a new version to convert them to the legacy format.
Message format conversion affects performance greatly as it requires extra compression and decompression operations. It also makes Ckafka's excellent zero-copy feature useless. Therefore, you must use the same message format.
Zero-copy: This feature can prevent costly data copy in kernel state when data is transferred in disks or over networks to implement fast data transfer.
Compression Algorithm Comparison
Snappy is the officially recommended compression algorithm. Its analysis process is as follows:
The performance of a compression algorithm is evaluated mainly based on two metrics: compression ratio and compression/decompression throughput.
Versions earlier than CKafka 2.1.0 support three compression algorithms: Gzip, Snappy, and LZ4.
The comparison of performance metric between the three algorithms in the actual use of CKafka is as shown below:
Compression ratio: LZ4 > Gzip > Snappy
Throughput: LZ4 > Snappy > Gzip
Comparison of physical resource usage is as shown below:
Bandwidth: As Snappy has the lowest compression ratio, its network bandwidth usage is the highest.
CPU: CPU usage is similar for each compression algorithm. Snappy uses more CPU resources during compression, while Gzip uses more CPU resources during decompression.
Therefore, the recommended order of the three compression algorithms under normal circumstances is LZ4 > Gzip > Snappy.
This recommended order has been well tested in most cases in the production environment. However, in extreme cases, LZ4 will increase the CPU load.
The analysis shows that LZ4 performs differently depending on the source data. Therefore, we recommend that you use the more stable Snappy compression algorithm if you are more concerned about the CPU usage.
Configuring Data Compression
A producer can use the following method to configure data compression:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// After the producer is started, all its produced message sets will be compressed, which can greatly reduce the network transmission bandwidth and disk usage of the Kafka broker.
// Note that different versions have different configurations. Currently, versions 0.9 or earlier do not support compression. Versions 0.10 or later do not support Gzip compression.
props.put("compression.type", " lz4 ");
Producer<String, String> producer = new KafkaProducer<>(props);
In most cases, after receiving a message from the producer, the broker will retain it as-is without making any modifications.
Note
When data is sent to CKafka, compression.codec
cannot be set.
Gzip compression is not supported by default. To use it, submit a ticket.
As Gzip compression causes high CPU consumption, if it is used, all messages will become InValid
. The program cannot run properly when the LZ4 compression is used. Possible causes include:
The message format is incorrect. The default message version of CKafka is v0.10.2. You need to use the message format v1.
The setting method for SDK varies by CKafka client. You can query the setting method in the open-source community (such as the description for C/C++ Client) to set the version of the message format.
Was this page helpful?