Create a table
Choosing a Data Model
The Doris data model is currently divided into three categories: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. In all three models, data is sorted by KEY.
1. AGGREGATE KEY
When AGGREGATE KEYs are the same, the new and old records are aggregated. Currently supported aggregate functions include SUM, MIN, MAX, and REPLACE.
The AGGREGATE KEY model can aggregate data in advance and is suitable for reporting and multidimensional analysis business.
CREATE TABLE site_visit
(
siteid INT,
city SMALLINT,
username VARCHAR(32),
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY (siteid, city, username)
DISTRIBUTED BY HASH (siteid) BUCKETS 10;
2. UNIQUE KEY
When UNIQUE KEYs are the same, the new record overwrites the old record. Currently, the implementation of UNIQUE KEYs is the same as the REPLACE aggregation method of AGGREGATE KEYs, and the two are essentially the same. It is suitable for analytical businesses requiring updates.
CREATE TABLE sales_order
(
orderid BIGINT,
status TINYINT,
username VARCHAR(32),
amount BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid) BUCKETS 10;
3. DUPLICATE KEY
Only the sort column is specified, and the same rows will not be merged. It is suitable for analytical businesses where advanced data aggregation is not required.
CREATE TABLE session_data
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
brower CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
Large Wide Tables and Star Schema
When the business side creates a table, in order to adapt to the front-end business, they often do not distinguish between dimension information and metric information, but define the Schema as a large wide table. For Doris, these large, such large wide tables often do not perform as well as one expects.
There are many fields in the Schema, and there may be more key columns in the aggregation model, which will increase the columns to be ordered during the import process.
The update of dimension information will be reflected in the entire table, and its update frequency directly affects query efficiency.
During use, users should use Star Schema to distinguish between dimension tables and metric tables whenever possible. Frequently updated dimension tables can also be placed in MySQL external tables. If there are only a few updates, they can be placed directly in Doris. When storing dimension tables in Doris, you can set more replicas of the dimension tables to improve the performance of Join.
Partitioning and Bucketing
Doris supports two-level partition storage. The first level is partition, which currently supports two types: RANGE partition and LIST partition. The second level is HASH bucket.
1. Partition
Partitions are used to divide data into different intervals, which can be logically understood as dividing a base table into multiple sub-tables. Data can be easily managed by partition, for example, deleting data is faster with partition.
1.1 RANGE Partition
In business, most users choose to partition by time.
1.2 LIST Partition
In business, users can choose cities or other enumeration values for partitioning.
2. HASH Bucket
Data is divided into different buckets by hash value.
It is recommended to use columns with high discrimination for bucketing to avoid data skew.
To facilitate data recovery, it is recommended to restrict the size of a single bucket to 10 GB. You should consider the number of buckets when creating a table or adding partitions. Different numbers of buckets can be designated to different partitions.
Sparse Indexes and Bloom Filters
Doris stores data in order and creates a sparse index for the data based on the order of the data. The index granularity is block (1,024 rows).
For the sparse index, a prefix of fixed length in schema is chosen as the index content. Currently in Doris, a prefix of 36 bytes is chosen as the index.
When creating a table, it is recommended to place the commonly used filter fields in the query in front of Schema. The more distinctive and frequently used query fields are, the earlier they are placed.
There is one thing needing attention, that is, the varchar type field. A varchar type field can only be the last field of a sparse index. The index is truncated at varchar, so if varchar appears first, the index may be less than 36 bytes long. For details, see Data Model, ROLLUP and Prefix Index. In addition to sparse indexes, Doris also provides bloomfilter indexes, which have a significant filtering effect on columns with high discrimination. Considering that varchar cannot be placed in a sparse index, you can create a bloomfilter index.
Materialized Views (Rollup)
Rollup can essentially be understood as a materialized index of the Base Table. When creating a Rollup, you can select certain columns in the Base Table as Schema. The order of fields in Schema can also be different from that in the Base Table.
A Rollup can be created in the following situations:
1. The data in the Base Table is not highly aggregated.
This is usually because that there are highly discriminate fields in the Base Table. At this time, you can consider selecting certain columns and creating a Rollup.
For example, for the site_visit
table:
site_visit(siteid, city, username, pv)
SiteID may result in low data aggregation. If the business side often needs to count PVs by cities, a Rollup with only city and PV can be created:
ALTER TABLE site_visit ADD ROLLUP rollup_city(city, pv);
2. The prefix index in the Base Table cannot be hit.
This is generally because that the Base Table creation method cannot cover all query modes. At this time, you can consider adjusting the column order and creating a Rollup.
For example, for the session_data table:
session_data(visitorid, sessionid, visittime, city, province, ip, brower, url)
If you want to analyze visits by browser and province in addition to visitorid, you can create a separate Rollup.
ALTER TABLE session_data ADD ROLLUP rollup_brower(brower,province,ip,url) DUPLICATE KEY(brower,province);
Schema Change
In Doris, there are currently three ways to perform Schema Changes: Sorted Schema Change, Direct Schema Change, and Linked Schema Change.
1. Sorted Schema Change
The sorting method of the column has been changed, and the data needs to be re-sorted. For example, when a sorted column is deleted, the fields need to be re-sorted.
ALTER TABLE site_visit DROP COLUMN city;
2. Direct Schema Change: No re-sorting is required, but a data transformation is required. For example, modify the column type, add a column to a sparse index, and so on.
ALTER TABLE site_visit MODIFY COLUMN username varchar(64);
3. Linked Schema Change: No need to convert data, it can be completed directly. For example, add a column.
ALTER TABLE site_visit ADD COLUMN click bigint SUM default '0';
It is recommended to consider the Schema when creating a table, so that the process of Schema Change can be accelerated.
Was this page helpful?