tencent cloud

All product documents
Tencent Cloud TCHouse-D
Data Table and Data Model
Last updated: 2024-06-27 10:54:06
Data Table and Data Model
Last updated: 2024-06-27 10:54:06
This document mainly describes the table and data model of Doris from a logical perspective so you can make better use of Doris in different business scenarios.

Data Table

In Doris, data is logically described in the form of table. A table is a collection of homogeneous data with the same pattern. A table includes rows and columns. A row is a row of user data. Column is used to describe different fields in a row of data and can adopt different data types (such as integer, character string, and boolean, etc.) according to actual situations.
In the OLAP scenario, columns can be divided into two categories: key and value. Key represents dimension columns, and value represents metric columns.

Data Model

The data model of Doris is mainly divided into three categories: Aggregate, Unique and Duplicate.

Aggregate Model

We illustrate what aggregation model is and how to use it correctly with practical examples.

Example 1: Importing Data Aggregation

Assume that the business has the following data table schema:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
city
VARCHAR(20)
-
User's city
age
SMALLINT
-
User's age
sex
TINYINT
-
User's gender
last_visit_date
DATETIME
REPLACE
User's last access time
cost
BIGINT
SUM
Total user consumption
max_dwell_time
INT
MAX
User's maximum stay time
min_dwell_time
INT
MIN
User's minimum stay time
The corresponding table to CREATE TABLE statement would be as follows (omitting the Partition and Distribution information):
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`date` DATE NOT NULL COMMENT "data import date and time",
`city` VARCHAR(20) COMMENT "user's city",
`age` SMALLINT COMMENT "user's age",
`sex` TINYINT COMMENT "user's gender",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "last visit time of user",
`cost` BIGINT SUM DEFAULT "0" COMMENT "total consumption of user",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "user's maximum stay time",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "user's minimum stay time"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
As you can see, this is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for the purposes of explanation, we store the two types of information in one single table. The columns in the table are divided into Key (dimension column) and Value (index column) according to whether they are set with an AggregationType. Those without a set AggregationType, such as user_id, date, age, and sex, are referred to as keys, while those with a set AggregationType are referred to as values. When we import data, rows with the same key column will be aggregated into one row, and their values in the Value columns will be aggregated according to the set AggregationType. The AggregationType currently has the following four ways of aggregation:
1. SUM: total sum, the values of multiple rows are accumulated.
2. REPLACE: Replacement. The newly imported value will replace the previous value.
3. MAX: Retain maximum value.
4. MIN: Retain minimum value.
Assume we have the following imported data (original data):
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
Assume this is a table that records the user behaviors when visiting a certain commodity page. We will explain the first row of data as follows:
Data
Description
10000
User ID, each user's unique identification ID
2017-10-01
Data storage time, accurate to the day
Beijing
User's city
20
User's age
0
Gender Male (1 for Female)
2017-10-01 06:00:00
The user’s time to visit this page, accurate to the second
20
The consumption generated by the user's visit this time
10
User's visit, time to stay on the page
10
User's visit, time spent on the page (Redundancy)
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
As can be seen, the data of user 10000 have been aggregated to one row after Aggregate. The data of other users remain the same. The explanation for the aggregated data of user 10000 is as follows (The first five columns remain unchanged, starting with column 6?last_visit_date?:
2017-10-01 07:00:00: The last_visit_date column is aggregated by REPLACE, so 2017-10-01 07:00:00 has replaced 2017-10-01 06:00:00.
Note
When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved Doris could be 2017-10-01 06:00:00. However, for different import batches, it is certain that data from the data from the new batch will replace those from the old batch.
35: The cost column is aggregated by SUM, so the update value 35 is the result of 20 + 15.
10: The max_dwell_time column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.
2: The min_dwell_time column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.
After aggregation, Doris will only store the aggregated data. In other words, the detailed raw data will no longer be available.

Example 2: Keep Detailed Data

Here is a modified version of the table schema in Example 1:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
timestamp
DATETIME
-
Data import time precise to the second
city
VARCHAR(20)
-
User's city
age
SMALLINT
-
User's age
sex
TINYINT
-
User's gender
last_visit_date
DATETIME
REPLACE
User's last access time
cost
BIGINT
SUM
Total user consumption
max_dwell_time
INT
MAX
User's maximum stay time
min_dwell_time
INT
MIN
User's minimum stay time
One column has been added timestamp that records the data import time accurate to the second. Also, set AGGREGATE KEY as AGGREGATE KEY (user_id, date, timestamp, city, age, sex).
The imported data are as follows:
user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
We can see that the stored data are exactly the same as the imported data, without any aggregation. This is because, this newly added timestamp column results in difference of keys among the rows. That is to say, as long as the Keys of the rows are not exactly the same in the import data, Doris can also save complete detailed data even in the Aggregate Model.

Example 3: Aggregate Imported Data and the Existing Data

Based on Example 1, suppose that you have the following data stored in Doris:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
We will import another batch of data:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10004
2017-10-03
Shenzhen
35
0
2017-10-03 11:22:00
44
19
19
10005
2017-10-03
Changsha
29
1
2017-10-03 18:11:02
3
1
1
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 11:22:00
55
19
6
10005
2017-10-03
Changsha
29
1
2017-10-03 18:11:02
3
1
1
As you can see that the existing data and newly imported data of user 10004 have been aggregated. At the same time, the new data of user 10005 have been added.
Data aggregation occurs in the following three stages in Doris:
1. The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
2. The data compaction stage of the underlying BE. AT this stage, BE will aggregate data from different batches that have been imported.
3. Data query stage. The data involved in the query is aggregated accordingly.
Data may be aggregated to varying degrees at different times. For example, when a batch of data is just imported, it may not be aggregated with the existing data. However, for users, they can only query the aggregated data. That is, what users see are the aggregated data, and they should not assume that what they have seen are not aggregated.

Unique model

In some multi-dimensional analysis scenarios, what users care more is how to ensure the uniqueness of keys, that is, how to get primary key uniqueness constraints. Therefore, we introduced the unique data model. This model is essentially a special case of the aggregation model and a simplified way of representing the table structure. For example:
ColumnName
Type
IsKey
Comment
user_id
BIGINT
Yes
User ID
username
VARCHAR(50)
Yes
User nickname
city
VARCHAR(20)
No
User's city
age
SMALLINT
No
User's age
sex
TINYINT
No
User's gender
phone
LARGEINT
No
User phone
address
VARCHAR(500)
No
User address
register_time
DATETIME
No
User registration time
This is a typical user basic information table. This kind of data doesn't need aggregation, we just need to ensure the uniqueness of the primary key. (Here the primary key is user_id + username). Therefore, our table creation statement is as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
`city` VARCHAR(20) COMMENT "user's city",
`age` SMALLINT COMMENT "user's age",
`sex` TINYINT COMMENT "user's gender",
`phone` LARGEINT COMMENT "User's Phone Number",
`address` VARCHAR(500) COMMENT "User's Address",
`register_time` DATETIME COMMENT "User Registration Time"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
The structure of this table matches completely to the structure of the table described by the following Aggregate model:
ColumnName
Type
AggregationType
Comment
user_id
BIGINT
-
User ID
username
VARCHAR(50)
-
User nickname
city
VARCHAR(20)
REPLACE
User's city
age
SMALLINT
REPLACE
User's age
sex
TINYINT
REPLACE
User's gender
phone
LARGEINT
REPLACE
User phone
address
VARCHAR(500)
REPLACE
User address
register_time
DATETIME
REPLACE
User registration time
And table creation statement:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
`city` VARCHAR(20) REPLACE COMMENT "City where the user is located",
`age` SMALLINT REPLACE COMMENT "User Age",
`sex` TINYINT REPLACE COMMENT "User Gender",
`phone` LARGEINT REPLACE COMMENT "User's Phone Number",
`address` VARCHAR(500) REPLACE COMMENT "User's Address",
`register_time` DATETIME REPLACE COMMENT "User Registration Time"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
The Unique model can be completely replaced by the REPLACE method in the Aggregate model. Its internal implementation method and data storage method are exactly the same. We will not continue to give examples here.

Duplicate model

In some multi-dimensional analysis scenarios, data neither has primary keys, nor aggregate requirements. Therefore, we introduce the duplicate data model to meet these needs. Here is an example.
ColumnName
Type
SortKey
Comment
timestamp
DATETIME
Yes
Log Time
type
INT
Yes
Log Type
error_code
INT
Yes
Error Code
error_msg
VARCHAR(1024)
No
Detailed Error Information
op_id
BIGINT
No
Person in charge ID
op_time
DATETIME
No
Processing time
The table creation statement is as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`timestamp` DATETIME NOT NULL COMMENT "Log time",
`type` INT NOT NULL COMMENT "Log type",
`error_code` INT COMMENT "Error code",
`error_msg` VARCHAR(1024) COMMENT "Detailed error info",
`op_id` BIGINT COMMENT "Person in charge ID",
`op_time` DATETIME COMMENT "Processing time"
)
DUPLICATE KEY(`timestamp`, `type`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
This data model differs from the Aggregate and Unique models. Data is stored according to the data in the imported file without any aggregation. Even if two rows of data are identical, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify based on which columns the data are ordered by. When choosing DUPLICATE KEY, we recommend the first 2-4 columns. This data model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints.

Limitations of the Aggregate Model

Aggregate Model & Unique Model

Here we introduce the limitations of the Aggregate model (including the Unique model). In the Aggregate model, what the model presents is the data finally aggregated. That is, for any data not yet aggregated (for example, the data of two different import batches), a certain way must be ensured to present consistency. Assume the table structure is as follows:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total user consumption
Assume there are the following two batches of data that have been successfully imported into the storage engine: Batch 1
user_id
date
cost
10001
2017-11-20
50
10002
2017-11-21
39
batch 2
user_id
date
cost
10001
2017-11-20
1
10001
2017-11-21
5
10003
2017-11-22
22
As you can see, the data of user 10001 in the two different import batches have not been aggregated yet. However, in order to ensure that the user can only query the following aggregated data as follows:
user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
We have added an aggregation operator to the query engine to ensure the presentation consistency of data.
Moreover, on the aggregate column (Value), when executing aggregate class queries that are inconsistent with the aggregate type, please pay attention to the semantics. For example, if we execute the following query in the above example:
SELECT MIN(cost) FROM table;
The result will be 5, not 1.
At the same time, this guarantee of consistency will significantly reduce efficiency in some queries.
Let's use the most basic count(*) query as an example:
SELECT COUNT(*) FROM table;
In other databases, such queries return results quickly. This is because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only a particular column of data to get the count value during the query, with very little overhead. However, in Doris's Aggregation model, the cost of this kind of query is very large.
Let's take our previous data as an example:
batch 1
user_id
date
cost
10001
2017-11-20
50
10002
2017-11-21
39
batch 2
user_id
date
cost
10001
2017-11-20
1
10001
2017-11-21
5
10003
2017-11-22
22
Since the final Aggregate results as:
user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
Therefore, the correct result of select count(*) from table; should be 4. But if the model only scans the user_id column and operates aggregation upon query, the final result will be 3(10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). You can see that both of these results are incorrect.
In order to get the correct result, we must read both the user_id and date column, and perform aggregation when querying, to return the correct result of 4. That is to say, in a count( ) query, Doris must scan all the AGGREGATE KEY columns (in this case, user_id and date), and aggregate them to get the semantically correct results. That means if there are many aggregated columns, count( ) queries could involve scanning large amounts of data.
Therefore, if you need to perform frequent count( ) queries, we recommend users to simulate count( ) by adding a column **which value is always 1 and the aggregation type SUM)**. In this way, the table schema in the previous example will be modified as follows:
ColumnName
Type
AggregateType
Comment
user_id
BIGINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total user consumption
count
BIGINT
SUM
For Count Calculation
Add a count column, and in the data import, the value of this column is always 1. Then the result of select count(*) from table; is equivalent to select sum(count) from table;. The latter's query efficiency is much higher than the former. However, this approach has limitations in use, that is, users need to ensure that they do not import rows with the same AGGREGATE KEY column. Otherwise, select sum(count) from table; can only represent the number of rows in the original import, not the semantics of select count(*) from table;, and the former value will erroneously increase.
Another way is to change the aggregation type of the above count column to REPLACE, and keep its value constant at 1. Then, the results of select sum(count) from table; and select count(*) from table; will be consistent. Moreover, this approach has no restriction on the import of duplicate rows.

Duplicate model

The Duplicate model does not impose the same limitation as the Aggregate model because it does not involve aggregation semantics. For any columns, it can return the semantically correct results in count(*) queries.

Best Practice

Since the data model is determined when creating the table and cannot be modified. Therefore, choosing a suitable data model is very important.

Choosing a Data Model

There are currently three types of Doris data models: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. Data in the three models are all ordered according to the KEY.

Aggregate Model

The Aggregate model can significantly reduce the amount of data scanned and the calculation load during aggregation queries by Pre-Aggregate, which is very suitable for fixed pattern reporting query scenarios. However, this model is not friendly for count( *) queries. Also, because the aggregation method on the Value column is fixed, the semantic correctness needs to be considered when conducting other types of aggregation queries. When the Aggregate Key is the same, the new and old records are aggregated, and the currently supported aggregation functions are SUM, MIN, MAX, and REPLACE.
CREATE TABLE site_visit
(
siteid INT,
city SMALLINT,
username VARCHAR(32),
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY (siteid, city, username)
DISTRIBUTED BY HASH (siteid) BUCKETS 10;

Unique model

The Unique model is designed for scenarios that need unique primary key constraints. When the Unique key is the same, the new record replaces the old one, ensuring the uniqueness constraint of the primary key. It is suitable for analysis businesses with update requirements. The current implementation of Unique key is the same as the REPLACE aggregation method of Aggregate key, which are essentially the same. However, it can't take advantage of the query advantages brought by Pre-Aggregate such as ROLLUP (because the essence is REPLACE, there is no SUM, etc. of the aggregation method).
Note
The Unique model only supports whole-row updates. If users need both unique primary key constraints and updating of specific columns only (for instance, merging columns from multiple source tables into one Doris table), the Aggregate model might be considered, setting the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL. For specific usage, see Data Model.
CREATE TABLE sales_order
(
orderid BIGINT,
status TINYINT,
username VARCHAR(32),
amount BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid) BUCKETS 10;

Duplicate model

The Duplicate model does not merge identical rows and is suitable for Ad-hoc queries of any dimension. Although it can't use the characteristics of Pre-Aggregate, it is not subject to the constraints of the Aggregate model, and it can take advantage of the column storage model (column pruning, vector execution, etc.).
CREATE TABLE session_data
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
brower CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;

Large Wide Tables and Star Schema

When business party creates tables, they usually do not distinguish between dimension information and metric information, and define the Schema into a large wide table, in order to adapt to front-end businesses. For Doris, the performance of these large wide tables is not satisfactory:
There are many fields in the Schema, and there may be more key columns in the aggregation model, which will increase the columns to be ordered during the import process.
The update of dimension information will be reflected in the entire table, and its update frequency directly affects query efficiency.
In use, the user is advised to use the Star Schema to distinguish between dimension tables and metric tables. Frequently updated dimension tables can also be placed in an external table in MySQL. And if there are only a few updates, it can be placed directly in Doris. When a dimension table is stored in Doris, more replicas can be set for the dimension table to improve the performance of Join.
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon