Note
This feature is applicable to Doris 1.2.2 and later versions.
Users often set inappropriate buckets, causing various problems. Here, we provide a method to set the number of buckets automatically.
Implementation
Previously, it was necessary to manually set the number of buckets when you are creating them. The auto bucketing calculation feature allows Doris to dynamically calculate the number of buckets, always keeping the bucket count within a suitable range, so users no longer need to fret over the minutiae of bucket numbers. For easier explanation of this feature, this section will split buckets into two periods: initial and subsequent buckets. The terms 'initial' and 'subsequent' are used here merely for the sake of clarity in describing this feature; Doris buckets themselves do not have such a distinction. BUCKET_DESC is very simple, but it needs to specify the number of buckets. In the auto bucketing calculation feature, the syntax of BUCKET_DESC directly changes the number of buckets to "Auto", and adds a Properties configuration:
DISTRIBUTED BY HASH(site) BUCKETS 20
DISTRIBUTED BY HASH(site) BUCKETS AUTO properties("estimate_partition_size" = "100G")
The additional configuration parameter estimate_partition_size represents the data volume of a single partition. This parameter is optional, and if the value is not given, Doris will set the default value of estimate_partition_size as 10 GB. We already know from the above that a bucket is a Tablet at the physical level. For the best performance, it is recommended that the size of the Tablet be between 1 GB and 10 GB. So how does auto bucketing calculation ensure the Tablet size is within this range? The main principles are as follows:
If the total data volume is relatively small, do not set the number of buckets too high.
If the total data volume is large, the number of buckets should be related to the total number of disk blocks, fully utilizing the capacity of each BE machine and each disk.
Initial Bucket Calculation
Starting from these principles, it is easy to understand the detailed logic of the auto bucketing calculation feature. Let's first look at the initial bucketing:
1. Firstly, a bucket number N is derived from the data volume. The value of estimate_partition_size divided by 5 (as the data compression ratio is 5 to 1 when stored in Doris in text format) yields the following results:
(, 100 MB), then set N=1.
[100 MB, 1 GB), then set N=2.
[1 GB, ), then one bucket per GB.
2. Calculate the number of buckets M based on the number of BE nodes and each BE node's disk capacity. Take 1 per BE node, and take every 50 GB of disk capacity as 1. Thus, the calculation rule for M is: M = number of BE nodes × (disk block size/50 GB) × number of disk blocks. For example, if there are 3 BEs, each with four 500 GB disks, M = 3 × (500 GB / 50 GB) × 4 = 120.
3. Final bucket count calculation logic is: first calculate an intermediate value x = min(M, N, 128). If x < N and x < the number of BE nodes, the final number of buckets y equals to the number of BE nodes; otherwise, the number of final buckets is x.
The pseudocode representation of the above process is:
int N = calculate N value;
int M = calculate M value;
int y = BE node number;
int x = min(M, N, 128);
if (x < N && x < y) {
return y;
}
return x;
With the above calculations, some examples are further introduced to better understand this logic:
case1:
Data volume 100 MB, 10 BE machines, 2TB *3 disk blocks
Data volume N = 1
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) = 1
Finally: 1
case2:
Data volume 1GB, 3 BE machines, 500GB *2 disk blocks
Data volume N = 2
BE disk M = 3* (500 GB/50 GB) * 2 = 60
x = min(M, N, 128) = 2
Finally: 2
case3:
Data volume 100GB, 3 BE machines, 500GB *2 disk blocks
Data volume N = 20
BE disk M = 3* (500 GB/50 GB) * 2 = 60
x = min(M, N, 128) = 20
Final: 20
case4:
Data volume 500GB, 3 BE machines, 1TB*1 disk
Data volume N = 100
BE disk M = 3* (1 TB /50 GB) * 1 = 60
x = min(M, N, 128) = 63
Final: 63
case5:
Data volume 500 GB, 10 BE machines, 2TB*3 disks
Data volume N = 100
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) = 100
Final: 100
case 6:
Data volume 1TB, 10 BE machines, 2TB*3 disks
Data volume N = 205
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) = 128
Final: 128
case 7:
Data volume 500GB, 1 BE machine, 100TB*1 disk
Data volume N = 100
BE disk M = 1* (100 TB / 50 GB) * 1 = 2048
x = min(M, N, 128) = 100
Final: 100
case 8:
Data volume 1 TB, 200 BE machines, 4 TB*7 disks
Data volume N = 205
BE disk M = 200* (4TB / 50GB) * 7 = 114800
x = min(M, N, 128) = 128
Final: 200
As you can see, detailed logic matches the principles.
Subsequent Bucketing Calculation
The above is about the calculation logic of initial bucketing, and the subsequent number of buckets can be calculated based on the existing partition data. The number of subsequent buckets will be critiqued according to the EMA (short-term exponential moving average) value of the data volume of the most recent 7 partitions, as the estimate_partition_size. There are two ways to calculate the buckets. Let's partition by day. If we count backwards from the first day, the size of the partition is S7, the second day preceding it is S6, and so forth until S1.
1. If the data for each partition within 7 days is strictly increasing daily, the trend value will be used at this point. There are 6 delta values that are:
S7 - S6 = delta1,
S6 - S5 = delta2,
...
S2 - S1 = delta6
From this we get ema(delta) value: So, today's estimate_partition_size = S7 + ema(delta)
2. In the case other than the first, the EMA of the last few days will be taken directly at this point:
Today's estimate_partition_size = EMA(S1, ..., S7)
According to the above calculations, the number of initial and subsequent buckets can be calculated. Unlike before, where a fixed number of buckets had to be set. The number of buckets in the previous partition may not be the same as the number of buckets in the following partition due to changes in business data. This is transparent for users who do not have to worry about how many buckets each partition has. This auto-calculated feature makes the number of buckets more reasonable.
Best Practice
Creating Table Using Auto-Bucketing
CREATE TABLE IF NOT EXISTS example_auto_bucket_tbl
(
user_id VARCHAR(128) NOT NULL COMMENT "User ID",
date DATE NOT NULL COMMENT "Data injection date and time",
data varchar(512) NOT NULL
)
ENGINE=OLAP
DUPLICATE KEY(user_id)
PARTITION BY RANGE(date)()
DISTRIBUTED BY HASH(user_id) BUCKETS AUTO
PROPERTIES
(
"replication_num" = "1",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.start" = "-5",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p_",
"dynamic_partition.create_history_partition" = "true",
"dynamic_partition.history_partition_num" = "3",
"estimate_partition_size" = "1G"
);
As you can see, after you create the table, the Doris system automatically sets the number of buckets for each partition to 2.
Next, we import a certain amount of data into the table, and then adjust the dynamic_partition.end parameter, allowing Doris to automatically create new partitions. From the bucketing situation of the new partition, we can see that the number of buckets in the new partition has increased to 7.
Disabling Auto Bucketing Feature
Temporary Disablement: Disabling Dynamic Partition
ALTER TABLE example_auto_bucket_tbl SET
(
"dynamic_partition.enable" = "false"
);
Permanent disablement: After you disable dynamic partition, adjusting the number of buckets of the table is disabled.
alter table example_auto_bucket_tbl modify DISTRIBUTION DISTRIBUTED BY HASH(user_id
) BUCKETS 16;
Note the effects of both operations:
After temporary disablement, if the dynamic partition feature is enabled again, the number of buckets in the newly created dynamic partition is a result calculated by Doris based on the data volume of the recent partitions.
After you disable the dynamic partition, if the number of buckets of the table is modified, it will not be possible to enable the auto bucketing feature in the future.
Description
It only takes effect on the partitions created by the dynamic partition.
Once auto bucketing is enabled, the schema seen in the 'show create table' is also BUCKETS AUTO. If you want to view the exact number of buckets, you can view through the 'show partitions from ${table};'.
Auto bucketing feature cannot be enabled on existing tables. The auto bucketing feature can only be enabled during table creation.
Auto bucketing will not affect any existing partitions. Auto bucketing is just used when you are creating a new partition in a dynamic partition, based on the data from the historical partitions, disk, and the number of Doris BE nodes. By using a specific algorithm, it determines the number of buckets of the new partition, but will not affect any existing partitions.
Was this page helpful?