tencent cloud

Feedback

Auto Bucketing

Last updated: 2024-06-27 11:09:39
    Note
    This feature is applicable to Doris 1.2.2 and later versions.
    Users often set inappropriate buckets, causing various problems. Here, we provide a method to set the number of buckets automatically.

    Implementation

    Previously, it was necessary to manually set the number of buckets when you are creating them. The auto bucketing calculation feature allows Doris to dynamically calculate the number of buckets, always keeping the bucket count within a suitable range, so users no longer need to fret over the minutiae of bucket numbers. For easier explanation of this feature, this section will split buckets into two periods: initial and subsequent buckets. The terms 'initial' and 'subsequent' are used here merely for the sake of clarity in describing this feature; Doris buckets themselves do not have such a distinction. BUCKET_DESC is very simple, but it needs to specify the number of buckets. In the auto bucketing calculation feature, the syntax of BUCKET_DESC directly changes the number of buckets to "Auto", and adds a Properties configuration:
    -- Create syntax specifying the number of buckets in the old version
    DISTRIBUTED BY HASH(site) BUCKETS 20
    
    -- Create syntax using auto bucketing calculation in the new version
    DISTRIBUTED BY HASH(site) BUCKETS AUTO properties("estimate_partition_size" = "100G")
    The additional configuration parameter estimate_partition_size represents the data volume of a single partition. This parameter is optional, and if the value is not given, Doris will set the default value of estimate_partition_size as 10 GB. We already know from the above that a bucket is a Tablet at the physical level. For the best performance, it is recommended that the size of the Tablet be between 1 GB and 10 GB. So how does auto bucketing calculation ensure the Tablet size is within this range? The main principles are as follows:
    If the total data volume is relatively small, do not set the number of buckets too high.
    If the total data volume is large, the number of buckets should be related to the total number of disk blocks, fully utilizing the capacity of each BE machine and each disk.

    Initial Bucket Calculation

    Starting from these principles, it is easy to understand the detailed logic of the auto bucketing calculation feature. Let's first look at the initial bucketing:
    1. Firstly, a bucket number N is derived from the data volume. The value of estimate_partition_size divided by 5 (as the data compression ratio is 5 to 1 when stored in Doris in text format) yields the following results:
    (, 100 MB), then set N=1.
    [100 MB, 1 GB), then set N=2.
    [1 GB, ), then one bucket per GB.
    2. Calculate the number of buckets M based on the number of BE nodes and each BE node's disk capacity. Take 1 per BE node, and take every 50 GB of disk capacity as 1. Thus, the calculation rule for M is: M = number of BE nodes × (disk block size/50 GB) × number of disk blocks. For example, if there are 3 BEs, each with four 500 GB disks, M = 3 × (500 GB / 50 GB) × 4 = 120.
    3. Final bucket count calculation logic is: first calculate an intermediate value x = min(M, N, 128). If x < N and x < the number of BE nodes, the final number of buckets y equals to the number of BE nodes; otherwise, the number of final buckets is x.
    The pseudocode representation of the above process is:
    int N = calculate N value;
    int M = calculate M value;
    
    int y = BE node number;
    int x = min(M, N, 128);
    
    if (x < N && x < y) {
    return y;
    }
    return x;
    With the above calculations, some examples are further introduced to better understand this logic:
    case1:
    Data volume 100 MB, 10 BE machines, 2TB *3 disk blocks
    Data volume N = 1
    BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
    x = min(M, N, 128) = 1
    Finally: 1
    
    case2:
    Data volume 1GB, 3 BE machines, 500GB *2 disk blocks
    Data volume N = 2
    BE disk M = 3* (500 GB/50 GB) * 2 = 60
    x = min(M, N, 128) = 2
    Finally: 2
    
    case3:
    Data volume 100GB, 3 BE machines, 500GB *2 disk blocks
    Data volume N = 20
    BE disk M = 3* (500 GB/50 GB) * 2 = 60
    x = min(M, N, 128) = 20
    Final: 20
    
    case4:
    Data volume 500GB, 3 BE machines, 1TB*1 disk
    Data volume N = 100
    BE disk M = 3* (1 TB /50 GB) * 1 = 60
    x = min(M, N, 128) = 63
    Final: 63
    
    case5:
    Data volume 500 GB, 10 BE machines, 2TB*3 disks
    Data volume N = 100
    BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
    x = min(M, N, 128) = 100
    Final: 100
    
    case 6:
    Data volume 1TB, 10 BE machines, 2TB*3 disks
    Data volume N = 205
    BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
    x = min(M, N, 128) = 128
    Final: 128
    
    case 7:
    Data volume 500GB, 1 BE machine, 100TB*1 disk
    Data volume N = 100
    BE disk M = 1* (100 TB / 50 GB) * 1 = 2048
    x = min(M, N, 128) = 100
    Final: 100
    
    case 8:
    Data volume 1 TB, 200 BE machines, 4 TB*7 disks
    Data volume N = 205
    BE disk M = 200* (4TB / 50GB) * 7 = 114800
    x = min(M, N, 128) = 128
    Final: 200
    As you can see, detailed logic matches the principles.

    Subsequent Bucketing Calculation

    The above is about the calculation logic of initial bucketing, and the subsequent number of buckets can be calculated based on the existing partition data. The number of subsequent buckets will be critiqued according to the EMA (short-term exponential moving average) value of the data volume of the most recent 7 partitions, as the estimate_partition_size. There are two ways to calculate the buckets. Let's partition by day. If we count backwards from the first day, the size of the partition is S7, the second day preceding it is S6, and so forth until S1.
    1. If the data for each partition within 7 days is strictly increasing daily, the trend value will be used at this point. There are 6 delta values that are:
    S7 - S6 = delta1,
    S6 - S5 = delta2,
    ...
    S2 - S1 = delta6
    From this we get ema(delta) value: So, today's estimate_partition_size = S7 + ema(delta)
    2. In the case other than the first, the EMA of the last few days will be taken directly at this point:
    Today's estimate_partition_size = EMA(S1, ..., S7)
    According to the above calculations, the number of initial and subsequent buckets can be calculated. Unlike before, where a fixed number of buckets had to be set. The number of buckets in the previous partition may not be the same as the number of buckets in the following partition due to changes in business data. This is transparent for users who do not have to worry about how many buckets each partition has. This auto-calculated feature makes the number of buckets more reasonable.

    Best Practice

    Creating Table Using Auto-Bucketing

    CREATE TABLE IF NOT EXISTS example_auto_bucket_tbl ( user_id VARCHAR(128) NOT NULL COMMENT "User ID", date DATE NOT NULL COMMENT "Data injection date and time", data varchar(512) NOT NULL ) ENGINE=OLAP DUPLICATE KEY(user_id) PARTITION BY RANGE(date)() DISTRIBUTED BY HASH(user_id) BUCKETS AUTO PROPERTIES ( "replication_num" = "1", # Enable dynamic partition "dynamic_partition.enable" = "true", # Dynamic partition by month "dynamic_partition.time_unit" = "MONTH", # Keep 5 past partitions "dynamic_partition.start" = "-5", # Create 10 future partitions in advance "dynamic_partition.end" = "3", # Prefix for partition naming "dynamic_partition.prefix" = "p_", # Enable the creation of historical partitions "dynamic_partition.create_history_partition" = "true", # Create 3 historical partitions "dynamic_partition.history_partition_num" = "3", # Estimated data volume of the partition "estimate_partition_size" = "1G" );
    As you can see, after you create the table, the Doris system automatically sets the number of buckets for each partition to 2.
    enter image description here
    
    
    enter image description here
    
    
    Next, we import a certain amount of data into the table, and then adjust the dynamic_partition.end parameter, allowing Doris to automatically create new partitions. From the bucketing situation of the new partition, we can see that the number of buckets in the new partition has increased to 7.
    
    enter image description here
    
    

    Disabling Auto Bucketing Feature

    Temporary Disablement: Disabling Dynamic Partition

    ALTER TABLE example_auto_bucket_tbl SET
    (
    # Disabling Dynamic Partition
    "dynamic_partition.enable" = "false"
    );

    Permanent disablement: After you disable dynamic partition, adjusting the number of buckets of the table is disabled.

    # After you disable the dynamic partition, adjust the number of buckets of the table.
    alter table example_auto_bucket_tbl modify DISTRIBUTION DISTRIBUTED BY HASH(user_id) BUCKETS 16;
    Note the effects of both operations:
    After temporary disablement, if the dynamic partition feature is enabled again, the number of buckets in the newly created dynamic partition is a result calculated by Doris based on the data volume of the recent partitions.
    After you disable the dynamic partition, if the number of buckets of the table is modified, it will not be possible to enable the auto bucketing feature in the future.

    Description

    It only takes effect on the partitions created by the dynamic partition.
    Once auto bucketing is enabled, the schema seen in the 'show create table' is also BUCKETS AUTO. If you want to view the exact number of buckets, you can view through the 'show partitions from ${table};'.
    Auto bucketing feature cannot be enabled on existing tables. The auto bucketing feature can only be enabled during table creation.
    Auto bucketing will not affect any existing partitions. Auto bucketing is just used when you are creating a new partition in a dynamic partition, based on the data from the historical partitions, disk, and the number of Doris BE nodes. By using a specific algorithm, it determines the number of buckets of the new partition, but will not affect any existing partitions.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support