Overview
Data compression can reduce network IO transfer volume and disk space. You can learn about supported message formats for data compression through this document and configure data compression based on requirements.
Message Format
Currently, CKafka supports two types of message formats, V1 and V2 (introduced in 0.11.0.0). Currently, CKafka is compatible with versions 0.9, 0.10, 1.1, 2.4, 2.8, and 3.2.
Different versions correspond to different configurations, as follows:
Message format conversion is mainly for compatibility with earlier consumer programs. In a CKafka cluster, multiple message formats (V1/V2) are usually saved simultaneously.
The Broker will perform conversion from new version messages to legacy format. The process involves decompression and recompression of messages.
Message format conversion has a significant impact on performance. Besides adding additional compression and decompression operations, it also causes CKafka to lose its excellent Zero-copy feature. Therefore, it is essential to ensure the unification of message formats.
Zero-copy: When data is transmitted between disk and network, avoid expensive kernel-space data copying, thereby achieving fast data transmission.
Comparison of Compression Algorithms
The officially recommended compression algorithm is the Snappy algorithm. The analysis process is as follows:
Evaluate the pros and cons of a compression algorithm. There are mainly two metrics: compression ratio and compression/decompression throughput.
Versions prior to CKafka 2.1.0 support three compression algorithms: GZIP, Snappy, and LZ4.
In the actual use of CKafka, the comparison of performance metrics of the three algorithms is as follows:
Compression ratio: LZ4 > GZIP > Snappy
Throughput: LZ4 > Snappy > GZIP
Physical resource occupancy as follows:
Bandwidth: Since Snappy has the lowest compression ratio, it occupies the most network bandwidth.
CPU: Snappy uses more CPU during compression, while GZIP uses more CPU during decompression.
Therefore, under normal circumstances, the recommended order of the three compression algorithms is: LZ4 > GZIP > Snappy.
After long-term running tests in the existing network, it is found that in most cases, the above model is no problem. However, in some extreme cases, the LZ4 compression algorithm will cause an increase in CPU load.
Analysis shows that different source data content of the business leads to different performance performances of the compression algorithm. Therefore, it is recommended that users who are sensitive to CPU metrics adopt the more stable Snappy compression algorithm.
Notes:
CKafka does not recommend using the Gzip compression algorithm. Enabling Gzip compression consumes additional CPU on the CKafka server. According to performance testing data, if you enable Gzip compression, it is recommended to reserve about 75% bandwidth buffer (the reserved ratio is for reference only. In actual use, you need to observe the specific monitoring data for judgment).
For example, for an instance with a bandwidth of 40 MB/s, after enabling Gzip compression, it is recommended to increase the bandwidth to 40 / (1 - 75%) = 160 MB/s.
Configure Data Compression
Producers can configure data compression in the following method:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// After the Producer starts, each message collection produced will be compressed, which can greatly save network transmission bandwidth and disk occupancy on the Kafka Broker.
// Please note that different versions correspond to different configurations. Compression is not allowed to use in version 0.9 and lower versions. Gzip compression format is not supported by default in version 1.1 and lower versions.
props.put("compression.type", " lz4 ");
Producer<String, String> producer = new KafkaProducer<>(props);
In most cases, after receiving a message from the Producer, the Broker only stores it as it is without any modification.
Notes and Cautions
Send data to CKafka. Cannot set compression.codec.
Version 1.1 and lower versions do not support Gzip compression format by default. If support is needed, please submit a ticket to apply. Gzip compression has high CPU consumption. Using Gzip will cause all messages to be InValid messages. Gzip compression is not recommended for CKafka.
After enabling Gzip, the CPU usage is high, becoming a bottleneck for bandwidth usage. If Gzip is enabled, it is recommended to increase the linger.ms and batch.size configurations of the producer.
When using the LZ4 compression method, the program cannot run normally. Possible reasons: incorrect message format. Please check the CKafka version and confirm whether the applicable message format is correct.
The SDK configuration methods of different CKafka Clients vary. You can query through the open-source community (for example, Instructions for the C/C++ Client) to set the version of the message format.