Auto Bucketing

Tencent Cloud TCHouse-D

Product Introduction

Purchase Guide

Configuration Adjustment Billing Instructions

Getting Started

Using Tencent Cloud TCHouse-D Through the Console

Using Tencent Cloud TCHouse-D Through a Client

Operation Guide

Cluster Operation

Creating and Terminating

Scale-Out and Scale-In

Configuration Adjustment

Monitoring and Alarm Configuration

Cluster Monitoring

FE Monitoring Metric

BE Monitoring Metric

Business Monitoring Metrics

Alarm Configuration

Account Privilege Management

Privilege Management Through a Client

Data Management

Data Backup

Data Recovery

Cross-Cluster DataMigration

Database Auditing

Cold-Hot Data Tiering

Query Management

Query Analysis

Slow Query Analysis

Modify Configurations

Modify Configurations Through Console

Modify Configurations Through Client

Node Management

Log Analysis

SQL Studio

Enabling Resource Isolation

Development Guide

Design of Data Table

Data Table and Data Model

Data Partition and Bucket

Data Distribution and Replica

Index, Sorted Column and Prefix Index

Importing Data

Import Overview

Import Methods

Stream Load (local file)

Broker Load (HDFS Data)

S3 Load ( Cloud Object Storage, COS)

Spark Load

Routine Load (Kafka Data)

Flink Connector (Real-time or Batch Data with Flink)

INSERT INTO

Synchronizing Data with JDBC

Synchronizing Data Through External Tables

Real-time or Batch Writing of MySQL Data

Importing Data from Kafka

Importing with DataX

Importing Doris from Logstash

Data Transformation, Column Mapping and Filtering

Importing Strict Mode

Importing Data in JSON Format

Exporting Data

Exporting via the EXPORT Statement

Using Mysqldump Tool to Export Table Structure or Data

Basic Feature

Table Structure Change

Two Tables of Atomic Swap (Swap)

Query Optimization

Caching Table or Partition to Memory

Colocation Join

Bucket Shuffle Join

Runtime Filter

Query Profiling and Optimization

Query Cache Configuration

Ecological Expansion Feature

API Documentation

Making API Requests

Cluster Operation APIs

ModifyClusterConfigs

ModifyInstanceKeyValConfigs

ModifyNodeStatus

RestartClusterForConfigs

RestartClusterForNode

DescribeClusterConfigsHistory

ModifyInstance

DescribeCreateTablesDDL

DescribeInstanceOperationHistory

Database and Table APIs

ExecuteParametrizedQuery

Cluster Information Viewing APIs

DescribeClusterConfigs

DescribeAreaRegion

DescribeInstancesHealthState

DescribeSpec

DescribeInstance

DescribeInstanceState

DescribeInstances

DescribeInstanceNodesInfo

DescribeInstanceUsedSubnets

DescribeSqlApis

DescribeInstanceNodes

DescribeInstanceNodesRole

Hot-Cold Data Layering APIs

CheckCoolDownWorkingVariableConfigCorrect

DescribeCoolDownBackends

DescribeCoolDownPolicies

DescribeCoolDownTableData

Database and Operation Audit APIs

DescribeBackUpJobDetail

DescribeBackUpTables

DescribeBackUpTaskDetail

DescribeRestoreTaskDetail

DescribeBackUpSchedules

RecoverBackUpJob

DescribeInstanceOperations

DescribeDatabaseAuditDownload

DescribeDatabaseAuditRecords

DescribeSlowQueryRecords

DescribeSlowQueryRecordsDownload

DescribeQueryAnalyse

User and Permission APIs

ActionAlterUser

ModifyUserPrivilegesV3

DescribeUserPolicy

ModifyDatabaseTableAccess

Resource Group Management APIs

DescribeUserBindWorkloadGroup

CreateWorkloadGroup

DeleteWorkloadGroup

DescribeWorkloadGroup

ModifyWorkloadGroup

ModifyUserBindWorkloadGroup

ModifyWorkloadGroupStatus

Data Types

Error Codes

Cloud Ecosystem

Granting CAM Policies to Sub-accounts

Query Acceleration for Tencent Cloud DLC

Practical Tutorial

Basic Feature Usage

Advanced Features Usage

Resource Specification Selection and Optimization Suggestions

Naming Specifications and Limits to the Database and Data Table

Table Design and Data Import

Query Optimization

Suggested Usage to Avoid

Accessing TCHouse-D via JDBC over the Public Network

Performance Testing

TPC-H Performance Testing

Test Scheme Introduction

Test Results for Reference

SSB Performance Testing

Test Scheme Introduction

Test Results for Reference

TPC-DS 性能测试

Testing Scheme Introduction

Test Results for Reference

FAQs

Common Operational Issues

Common Errors

Glossary

Product Policy

Service Level Agreement

Data Processing And Security Agreement

DocumentationTencent Cloud TCHouse-DDevelopment GuideBasic FeatureAuto Bucketing

Auto Bucketing

Download PDF

Last updated: 2024-06-27 11:09:39

Auto Bucketing

Last updated: 2024-06-27 11:09:39

Download PDF

Note
This feature is applicable to Doris 1.2.2 and later versions.
Users often set inappropriate buckets, causing various problems. Here, we provide a method to set the number of buckets automatically.
Implementation
Previously, it was necessary to manually set the number of buckets when you are creating them. The auto bucketing calculation feature allows Doris to dynamically calculate the number of buckets, always keeping the bucket count within a suitable range, so users no longer need to fret over the minutiae of bucket numbers. For easier explanation of this feature, this section will split buckets into two periods: initial and subsequent buckets. The terms 'initial' and 'subsequent' are used here merely for the sake of clarity in describing this feature; Doris buckets themselves do not have such a distinction. BUCKET_DESC is very simple, but it needs to specify the number of buckets. In the auto bucketing calculation feature, the syntax of BUCKET_DESC directly changes the number of buckets to "Auto", and adds a Properties configuration:
-- Create syntax specifying the number of buckets in the old version
DISTRIBUTED BY HASH(site) BUCKETS 20
﻿
-- Create syntax using auto bucketing calculation in the new version
DISTRIBUTED BY HASH(site) BUCKETS AUTO properties("estimate_partition_size" = "100G")
The additional configuration parameter estimate_partition_size represents the data volume of a single partition. This parameter is optional, and if the value is not given, Doris will set the default value of estimate_partition_size as 10 GB. We already know from the above that a bucket is a Tablet at the physical level. For the best performance, it is recommended that the size of the Tablet be between 1 GB and 10 GB. So how does auto bucketing calculation ensure the Tablet size is within this range? The main principles are as follows:
If the total data volume is relatively small, do not set the number of buckets too high.
If the total data volume is large, the number of buckets should be related to the total number of disk blocks, fully utilizing the capacity of each BE machine and each disk.
Initial Bucket Calculation
Starting from these principles, it is easy to understand the detailed logic of the auto bucketing calculation feature. Let's first look at the initial bucketing:
1. Firstly, a bucket number N is derived from the data volume. The value of estimate_partition_size divided by 5 (as the data compression ratio is 5 to 1 when stored in Doris in text format) yields the following results:
(, 100 MB), then set N=1.
[100 MB, 1 GB), then set N=2.
[1 GB, ), then one bucket per GB.
2. Calculate the number of buckets M based on the number of BE nodes and each BE node's disk capacity. Take 1 per BE node, and take every 50 GB of disk capacity as 1. Thus, the calculation rule for M is: M = number of BE nodes × (disk block size/50 GB) × number of disk blocks. For example, if there are 3 BEs, each with four 500 GB disks, M = 3 × (500 GB / 50 GB) × 4 = 120.
3. Final bucket count calculation logic is: first calculate an intermediate value x = min(M, N, 128). If x < N and x < the number of BE nodes, the final number of buckets y equals to the number of BE nodes; otherwise, the number of final buckets is x.
The pseudocode representation of the above process is:
int N = calculate N value;
int M = calculate M value;
﻿
int y = BE node number;
int x = min(M, N, 128);
﻿
if (x < N && x < y) {  
    return y;
}
return x;
With the above calculations, some examples are further introduced to better understand this logic:
case1:
Data volume 100 MB, 10 BE machines, 2TB *3 disk blocks
Data volume N = 1
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) =  1
Finally: 1
﻿
case2:
Data volume 1GB, 3 BE machines, 500GB *2 disk blocks
Data volume N = 2
BE disk M = 3* (500 GB/50 GB) * 2 = 60
x = min(M, N, 128) =  2
Finally: 2
﻿
case3:
Data volume 100GB, 3 BE machines, 500GB *2 disk blocks
Data volume N = 20
BE disk M = 3* (500 GB/50 GB) * 2 = 60
x = min(M, N, 128) =  20
Final: 20
﻿
case4:
Data volume 500GB, 3 BE machines, 1TB*1 disk
Data volume N = 100
BE disk M = 3* (1 TB /50 GB) * 1 = 60
x = min(M, N, 128) =  63
Final: 63
﻿
case5:
Data volume 500 GB, 10 BE machines, 2TB*3 disks
Data volume N = 100
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) =  100
Final: 100
﻿
case 6:
Data volume 1TB, 10 BE machines, 2TB*3 disks
Data volume N = 205
BE disk M = 10* (2 TB / 50 GB) * 3 = 1230
x = min(M, N, 128) =  128
Final: 128
﻿
case 7:
Data volume 500GB, 1 BE machine, 100TB*1 disk
Data volume N = 100
BE disk M = 1* (100 TB / 50 GB) * 1 = 2048
x = min(M, N, 128) =  100
Final: 100
﻿
case 8:
Data volume 1 TB, 200 BE machines, 4 TB*7 disks
Data volume N = 205
BE disk M = 200* (4TB / 50GB) * 7 = 114800
x = min(M, N, 128) =  128
Final: 200
As you can see, detailed logic matches the principles.
Subsequent Bucketing Calculation
The above is about the calculation logic of initial bucketing, and the subsequent number of buckets can be calculated based on the existing partition data. The number of subsequent buckets will be critiqued according to the EMA (short-term exponential moving average) value of the data volume of the most recent 7 partitions, as the estimate_partition_size. There are two ways to calculate the buckets. Let's partition by day. If we count backwards from the first day, the size of the partition is S7, the second day preceding it is S6, and so forth until S1.
1. If the data for each partition within 7 days is strictly increasing daily, the trend value will be used at this point. There are 6 delta values that are:
S7 - S6 = delta1,
S6 - S5 = delta2,
...
S2 - S1 = delta6
From this we get ema(delta) value: So, today's estimate_partition_size = S7 + ema(delta)
2. In the case other than the first, the EMA of the last few days will be taken directly at this point:
  Today's estimate_partition_size = EMA(S1, ..., S7)
According to the above calculations, the number of initial and subsequent buckets can be calculated. Unlike before, where a fixed number of buckets had to be set. The number of buckets in the previous partition may not be the same as the number of buckets in the following partition due to changes in business data. This is transparent for users who do not have to worry about how many buckets each partition has. This auto-calculated feature makes the number of buckets more reasonable.
Best Practice
Creating Table Using Auto-Bucketing
CREATE TABLE IF NOT EXISTS example_auto_bucket_tbl
(
    user_id VARCHAR(128) NOT NULL COMMENT "User ID",
    date DATE NOT NULL COMMENT "Data injection date and time",
    data varchar(512) NOT NULL
)
ENGINE=OLAP
DUPLICATE KEY(user_id)
PARTITION BY RANGE(date)()
DISTRIBUTED BY HASH(user_id) BUCKETS AUTO
PROPERTIES
(
    "replication_num" = "1",
    # Enable dynamic partition 
    "dynamic_partition.enable" = "true",
    # Dynamic partition by month 
    "dynamic_partition.time_unit" = "MONTH",
    # Keep 5 past partitions 
    "dynamic_partition.start" = "-5",
    # Create 10 future partitions in advance 
    "dynamic_partition.end" = "3",
    # Prefix for partition naming
    "dynamic_partition.prefix" = "p_",
    # Enable the creation of historical partitions 
    "dynamic_partition.create_history_partition" = "true",
    # Create 3 historical partitions 
    "dynamic_partition.history_partition_num" = "3",
    # Estimated data volume of the partition
    "estimate_partition_size" = "1G"
);
As you can see, after you create the table, the Doris system automatically sets the number of buckets for each partition to 2.
﻿
﻿
﻿
﻿
Next, we import a certain amount of data into the table, and then adjust the dynamic_partition.end parameter, allowing Doris to automatically create new partitions. From the bucketing situation of the new partition, we can see that the number of buckets in the new partition has increased to 7.
﻿
﻿
﻿
Disabling Auto Bucketing Feature
Temporary Disablement: Disabling Dynamic Partition
ALTER TABLE example_auto_bucket_tbl SET
(
  # Disabling Dynamic Partition
    "dynamic_partition.enable" = "false"
);
Permanent disablement: After you disable dynamic partition, adjusting the number of buckets of the table is disabled.
# After you disable the dynamic partition, adjust the number of buckets of the table.
alter table example_auto_bucket_tbl modify DISTRIBUTION DISTRIBUTED BY HASH(user_id) BUCKETS 16;
Note the effects of both operations:
After temporary disablement, if the dynamic partition feature is enabled again, the number of buckets in the newly created dynamic partition is a result calculated by Doris based on the data volume of the recent partitions.
After you disable the dynamic partition, if the number of buckets of the table is modified, it will not be possible to enable the auto bucketing feature in the future.
Description
It only takes effect on the partitions created by the dynamic partition.
Once auto bucketing is enabled, the schema seen in the 'show create table' is also BUCKETS AUTO. If you want to view the exact number of buckets, you can view through the 'show partitions from ${table};'.
Auto bucketing feature cannot be enabled on existing tables. The auto bucketing feature can only be enabled during table creation.
Auto bucketing will not affect any existing partitions. Auto bucketing is just used when you are creating a new partition in a dynamic partition, based on the data from the historical partitions, disk, and the number of Doris BE nodes. By using a specific algorithm, it determines the number of buckets of the new partition, but will not affect any existing partitions.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service Special Offers

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

Financial Services

Financial Services Solution

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Real Estate

Tencent Cloud LinkBase(Weiling)

E-commerce

E-commerce retail solutions

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

TencentDB for TcaplusDB

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha