Data Table and Data Model

Tencent Cloud TCHouse-D

Product Introduction

Purchase Guide

Configuration Adjustment Billing Instructions

Getting Started

Using Tencent Cloud TCHouse-D Through the Console

Using Tencent Cloud TCHouse-D Through a Client

Operation Guide

Cluster Operation

Creating and Terminating

Scale-Out and Scale-In

Configuration Adjustment

Monitoring and Alarm Configuration

Cluster Monitoring

FE Monitoring Metric

BE Monitoring Metric

Business Monitoring Metrics

Alarm Configuration

Account Privilege Management

Privilege Management Through a Client

Data Management

Data Backup

Data Recovery

Cross-Cluster DataMigration

Database Auditing

Cold-Hot Data Tiering

Query Management

Query Analysis

Slow Query Analysis

Modify Configurations

Modify Configurations Through Console

Modify Configurations Through Client

Node Management

Log Analysis

SQL Studio

Enabling Resource Isolation

Development Guide

Design of Data Table

Data Table and Data Model

Data Partition and Bucket

Data Distribution and Replica

Index, Sorted Column and Prefix Index

Importing Data

Import Overview

Import Methods

Stream Load (local file)

Broker Load (HDFS Data)

S3 Load ( Cloud Object Storage, COS)

Spark Load

Routine Load (Kafka Data)

Flink Connector (Real-time or Batch Data with Flink)

INSERT INTO

Synchronizing Data with JDBC

Synchronizing Data Through External Tables

Real-time or Batch Writing of MySQL Data

Importing Data from Kafka

Importing with DataX

Importing Doris from Logstash

Data Transformation, Column Mapping and Filtering

Importing Strict Mode

Importing Data in JSON Format

Exporting Data

Exporting via the EXPORT Statement

Using Mysqldump Tool to Export Table Structure or Data

Basic Feature

Table Structure Change

Two Tables of Atomic Swap (Swap)

Query Optimization

Caching Table or Partition to Memory

Colocation Join

Bucket Shuffle Join

Runtime Filter

Query Profiling and Optimization

Query Cache Configuration

Ecological Expansion Feature

API Documentation

Making API Requests

Cluster Operation APIs

ModifyClusterConfigs

ModifyInstanceKeyValConfigs

ModifyNodeStatus

RestartClusterForConfigs

RestartClusterForNode

DescribeClusterConfigsHistory

ModifyInstance

DescribeCreateTablesDDL

DescribeInstanceOperationHistory

Database and Table APIs

ExecuteParametrizedQuery

Cluster Information Viewing APIs

DescribeClusterConfigs

DescribeAreaRegion

DescribeInstancesHealthState

DescribeSpec

DescribeInstance

DescribeInstanceState

DescribeInstances

DescribeInstanceNodesInfo

DescribeInstanceUsedSubnets

DescribeSqlApis

DescribeInstanceNodes

DescribeInstanceNodesRole

Hot-Cold Data Layering APIs

CheckCoolDownWorkingVariableConfigCorrect

DescribeCoolDownBackends

DescribeCoolDownPolicies

DescribeCoolDownTableData

Database and Operation Audit APIs

DescribeBackUpJobDetail

DescribeBackUpTables

DescribeBackUpTaskDetail

DescribeRestoreTaskDetail

DescribeBackUpSchedules

RecoverBackUpJob

DescribeInstanceOperations

DescribeDatabaseAuditDownload

DescribeDatabaseAuditRecords

DescribeSlowQueryRecords

DescribeSlowQueryRecordsDownload

DescribeQueryAnalyse

User and Permission APIs

ActionAlterUser

ModifyUserPrivilegesV3

DescribeUserPolicy

ModifyDatabaseTableAccess

Resource Group Management APIs

DescribeUserBindWorkloadGroup

CreateWorkloadGroup

DeleteWorkloadGroup

DescribeWorkloadGroup

ModifyWorkloadGroup

ModifyUserBindWorkloadGroup

ModifyWorkloadGroupStatus

Data Types

Error Codes

Cloud Ecosystem

Granting CAM Policies to Sub-accounts

Query Acceleration for Tencent Cloud DLC

Practical Tutorial

Basic Feature Usage

Advanced Features Usage

Resource Specification Selection and Optimization Suggestions

Naming Specifications and Limits to the Database and Data Table

Table Design and Data Import

Query Optimization

Suggested Usage to Avoid

Accessing TCHouse-D via JDBC over the Public Network

Performance Testing

TPC-H Performance Testing

Test Scheme Introduction

Test Results for Reference

SSB Performance Testing

Test Scheme Introduction

Test Results for Reference

TPC-DS 性能测试

Testing Scheme Introduction

Test Results for Reference

FAQs

Common Operational Issues

Common Errors

Glossary

Product Policy

Service Level Agreement

Data Processing And Security Agreement

DocumentationTencent Cloud TCHouse-DDevelopment GuideDesign of Data TableData Table and Data Model

Data Table and Data Model

Download PDF

Last updated: 2024-06-27 10:54:06

Data Table and Data Model

Last updated: 2024-06-27 10:54:06

Download PDF

This document mainly describes the table and data model of Doris from a logical perspective so you can make better use of Doris in different business scenarios.
Data Table
In Doris, data is logically described in the form of table. A table is a collection of homogeneous data with the same pattern. A table includes rows and columns. A row is a row of user data. Column is used to describe different fields in a row of data and can adopt different data types (such as integer, character string, and boolean, etc.) according to actual situations.
In the OLAP scenario, columns can be divided into two categories: key and value. Key represents dimension columns, and value represents metric columns.
Data Model
The data model of Doris is mainly divided into three categories: Aggregate, Unique and Duplicate.
Aggregate Model
We illustrate what aggregation model is and how to use it correctly with practical examples.
Example 1: Importing Data Aggregation
Assume that the business has the following data table schema:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
city
VARCHAR(20)
-
User's city
age
SMALLINT
-
User's age
sex
TINYINT
-
User's gender
last_visit_date
DATETIME
REPLACE
User's last access time
cost
BIGINT
SUM
Total user consumption
max_dwell_time
INT
MAX
User's maximum stay time
min_dwell_time
INT
MIN
User's minimum stay time
The corresponding table to CREATE TABLE statement would be as follows (omitting the Partition and Distribution information):
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `date` DATE NOT NULL COMMENT "data import date and time",
    `city` VARCHAR(20) COMMENT "user's city",
    `age` SMALLINT COMMENT "user's age",
    `sex` TINYINT COMMENT "user's gender",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "last visit time of user",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "total consumption of user",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "user's maximum stay time",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "user's minimum stay time"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
As you can see, this is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for the purposes of explanation, we store the two types of information in one single table.
The columns in the table are divided into Key (dimension column) and Value (index column) according to whether they are set with an AggregationType. Those without a set AggregationType, such as user_id, date, age, and sex, are referred to as keys, while those with a set AggregationType are referred to as values.
When we import data, rows with the same key column will be aggregated into one row, and their values in the Value columns will be aggregated according to the set AggregationType. The AggregationType currently has the following four ways of aggregation:
1. SUM: total sum, the values of multiple rows are accumulated.
2. REPLACE: Replacement. The newly imported value will replace the previous value.
3. MAX: Retain maximum value.
4. MIN: Retain minimum value.
Assume we have the following imported data (original data):
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
Assume this is a table that records the user behaviors when visiting a certain commodity page. We will explain the first row of data as follows:
Data
Description
10000
User ID, each user's unique identification ID
2017-10-01
Data storage time, accurate to the day
Beijing
User's city
20
User's age
0
Gender Male (1 for Female)
2017-10-01 06:00:00
The user’s time to visit this page, accurate to the second
20
The consumption generated by the user's visit this time
10
User's visit, time to stay on the page
10
User's visit, time spent on the page (Redundancy)
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
As can be seen, the data of user 10000 have been aggregated to one row after Aggregate. The data of other users remain the same. The explanation for the aggregated data of user 10000 is as follows (The first five columns remain unchanged, starting with column 6?last_visit_date?:
2017-10-01 07:00:00: The last_visit_date column is aggregated by REPLACE, so 2017-10-01 07:00:00 has replaced 2017-10-01 06:00:00.
Note
 When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved Doris could be 2017-10-01 06:00:00. However, for different import batches, it is certain that data from the data from the new batch will replace those from the old batch.
35: The cost column is aggregated by SUM, so the update value 35 is the result of 20 + 15.
10: The max_dwell_time column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.
2: The min_dwell_time column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.
After aggregation, Doris will only store the aggregated data. In other words, the detailed raw data will no longer be available.
Example 2: Keep Detailed Data
Here is a modified version of the table schema in Example 1:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
timestamp
DATETIME
-
Data import time precise to the second
city
VARCHAR(20)
-
User's city
age
SMALLINT
-
User's age
sex
TINYINT
-
User's gender
last_visit_date
DATETIME
REPLACE
User's last access time
cost
BIGINT
SUM
Total user consumption
max_dwell_time
INT
MAX
User's maximum stay time
min_dwell_time
INT
MIN
User's minimum stay time
One column has been added timestamp that records the data import time accurate to the second.
Also, set AGGREGATE KEY as AGGREGATE KEY (user_id, date, timestamp, city, age, sex).
The imported data are as follows:
user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
timestamp
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
2017-10-01 08:00:05
Beijing
20
0
2017-10-01 06:00:00
20
10
10
10000
2017-10-01
2017-10-01 09:00:05
Beijing
20
0
2017-10-01 07:00:00
15
2
2
10001
2017-10-01
2017-10-01 18:12:10
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
2017-10-02 13:10:00
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
2017-10-02 13:15:00
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
2017-10-01 12:12:48
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
2017-10-03 12:38:20
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
We can see that the stored data are exactly the same as the imported data, without any aggregation. This is because, this newly added timestamp column results in difference of keys among the rows. That is to say, as long as the Keys of the rows are not exactly the same in the import data, Doris can also save complete detailed data even in the Aggregate Model.
Example 3: Aggregate Imported Data and the Existing Data
Based on Example 1, suppose that you have the following data stored in Doris:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 10:20:22
11
6
6
We will import another batch of data:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10004
2017-10-03
Shenzhen
35
0
2017-10-03 11:22:00
44
19
19
10005
2017-10-03
Changsha
29
1
2017-10-03 18:11:02
3
1
1
So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
user_id
date
city
age
sex
last_visit_date
cost
max_dwell_time
min_dwell_time
10000
2017-10-01
Beijing
20
0
2017-10-01 07:00:00
35
10
2
10001
2017-10-01
Beijing
30
1
2017-10-01 17:05:45
2
22
22
10002
2017-10-02
Shanghai
20
1
2017-10-02 12:59:12
200
5
5
10003
2017-10-02
Guangzhou
32
0
2017-10-02 11:20:00
30
11
11
10004
2017-10-01
Shenzhen
35
0
2017-10-01 10:00:15
100
3
3
10004
2017-10-03
Shenzhen
35
0
2017-10-03 11:22:00
55
19
6
10005
2017-10-03
Changsha
29
1
2017-10-03 18:11:02
3
1
1
As you can see that the existing data and newly imported data of user 10004 have been aggregated. At the same time, the new data of user 10005 have been added.
Data aggregation occurs in the following three stages in Doris:
1. The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
2. The data compaction stage of the underlying BE. AT this stage, BE will aggregate data from different batches that have been imported.
3. Data query stage. The data involved in the query is aggregated accordingly.
Data may be aggregated to varying degrees at different times. For example, when a batch of data is just imported, it may not be aggregated with the existing data. However, for users, they can only query the aggregated data. That is, what users see are the aggregated data, and they should not assume that what they have seen are not aggregated.
Unique model
In some multi-dimensional analysis scenarios, what users care more is how to ensure the uniqueness of keys, that is, how to get primary key uniqueness constraints. Therefore, we introduced the unique data model. This model is essentially a special case of the aggregation model and a simplified way of representing the table structure. For example:
ColumnName
Type
IsKey
Comment
user_id
BIGINT
Yes
User ID
username
VARCHAR(50)
Yes
User nickname
city
VARCHAR(20)
No
User's city
age
SMALLINT
No
User's age
sex
TINYINT
No
User's gender
phone
LARGEINT
No
User phone
address
VARCHAR(500)
No
User address
register_time
DATETIME
No
User registration time
This is a typical user basic information table. This kind of data doesn't need aggregation, we just need to ensure the uniqueness of the primary key. (Here the primary key is user_id + username). Therefore, our table creation statement is as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
    `city` VARCHAR(20) COMMENT "user's city",
    `age` SMALLINT COMMENT "user's age",
    `sex` TINYINT COMMENT "user's gender",
    `phone` LARGEINT COMMENT "User's Phone Number",
    `address` VARCHAR(500) COMMENT "User's Address",
    `register_time` DATETIME COMMENT "User Registration Time"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
The structure of this table matches completely to the structure of the table described by the following Aggregate model:
ColumnName
Type
AggregationType
Comment
user_id
BIGINT
-
User ID
username
VARCHAR(50)
-
User nickname
city
VARCHAR(20)
REPLACE
User's city
age
SMALLINT
REPLACE
User's age
sex
TINYINT
REPLACE
User's gender
phone
LARGEINT
REPLACE
User phone
address
VARCHAR(500)
REPLACE
User address
register_time
DATETIME
REPLACE
User registration time
And table creation statement:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
    `city` VARCHAR(20) REPLACE COMMENT "City where the user is located",
    `age` SMALLINT REPLACE COMMENT "User Age",
    `sex` TINYINT REPLACE COMMENT "User Gender",
    `phone` LARGEINT REPLACE COMMENT "User's Phone Number",
    `address` VARCHAR(500) REPLACE COMMENT "User's Address",
    `register_time` DATETIME REPLACE COMMENT "User Registration Time"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
The Unique model can be completely replaced by the REPLACE method in the Aggregate model. Its internal implementation method and data storage method are exactly the same. We will not continue to give examples here.
Duplicate model
In some multi-dimensional analysis scenarios, data neither has primary keys, nor aggregate requirements. Therefore, we introduce the duplicate data model to meet these needs. Here is an example.
ColumnName
Type
SortKey
Comment
timestamp
DATETIME
Yes
Log Time
type
INT
Yes
Log Type
error_code
INT
Yes
Error Code
error_msg
VARCHAR(1024)
No
Detailed Error Information
op_id
BIGINT
No
Person in charge ID
op_time
DATETIME
No
Processing time
The table creation statement is as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `timestamp` DATETIME NOT NULL COMMENT "Log time",
    `type` INT NOT NULL COMMENT "Log type",
    `error_code` INT COMMENT "Error code",
    `error_msg` VARCHAR(1024) COMMENT "Detailed error info",
    `op_id` BIGINT COMMENT "Person in charge ID",
    `op_time` DATETIME COMMENT "Processing time"
)
DUPLICATE KEY(`timestamp`, `type`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
This data model differs from the Aggregate and Unique models. Data is stored according to the data in the imported file without any aggregation. Even if two rows of data are identical, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify based on which columns the data are ordered by. When choosing DUPLICATE KEY, we recommend the first 2-4 columns. This data model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints.
Limitations of the Aggregate Model
Aggregate Model & Unique Model
Here we introduce the limitations of the Aggregate model (including the Unique model).
In the Aggregate model, what the model presents is the data finally aggregated. That is, for any data not yet aggregated (for example, the data of two different import batches), a certain way must be ensured to present consistency.
Assume the table structure is as follows:
ColumnName
Type
AggregationType
Comment
user_id
LARGEINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total user consumption
Assume there are the following two batches of data that have been successfully imported into the storage engine:
Batch 1
user_id
date
cost
10001
2017-11-20
50
10002
2017-11-21
39
batch 2
user_id
date
cost
10001
2017-11-20
1
10001
2017-11-21
5
10003
2017-11-22
22
As you can see, the data of user 10001 in the two different import batches have not been aggregated yet. However, in order to ensure that the user can only query the following aggregated data as follows:
user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
We have added an aggregation operator to the query engine to ensure the presentation consistency of data.
Moreover, on the aggregate column (Value), when executing aggregate class queries that are inconsistent with the aggregate type, please pay attention to the semantics. For example, if we execute the following query in the above example:
SELECT MIN(cost) FROM table;
The result will be 5, not 1.
At the same time, this guarantee of consistency will significantly reduce efficiency in some queries.
Let's use the most basic count(*) query as an example:
SELECT COUNT(*) FROM table;
In other databases, such queries return results quickly. This is because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only a particular column of data to get the count value during the query, with very little overhead. However, in Doris's Aggregation model, the cost of this kind of query is very large.
Let's take our previous data as an example:
batch 1
user_id
date
cost
10001
2017-11-20
50
10002
2017-11-21
39
batch 2
user_id
date
cost
10001
2017-11-20
1
10001
2017-11-21
5
10003
2017-11-22
22
Since the final Aggregate results as:
user_id
date
cost
10001
2017-11-20
51
10001
2017-11-21
5
10002
2017-11-21
39
10003
2017-11-22
22
Therefore, the correct result of select count(*) from table; should be 4. But if the model only scans the user_id column and operates aggregation upon query, the final result will be 3(10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). You can see that both of these results are incorrect.
In order to get the correct result, we must read both the user_id and date column, and perform aggregation when querying, to return the correct result of 4. That is to say, in a count(  ) query, Doris must scan all the AGGREGATE KEY columns (in this case, user_id and date), and aggregate them to get the semantically correct results. That means if there are many aggregated columns, count(  ) queries could involve scanning large amounts of data.
Therefore, if you need to perform frequent count( ) queries, we recommend users to simulate count( ) by adding a column **which value is always 1 and the aggregation type SUM)**. In this way, the table schema in the previous example will be modified as follows:
ColumnName
Type
AggregateType
Comment
user_id
BIGINT
-
User ID
date
DATE
-
Data import date
cost
BIGINT
SUM
Total user consumption
count
BIGINT
SUM
For Count Calculation
Add a count column, and in the data import, the value of this column is always 1. Then the result of select count(*) from table; is equivalent to select sum(count) from table;. The latter's query efficiency is much higher than the former. However, this approach has limitations in use, that is, users need to ensure that they do not import rows with the same AGGREGATE KEY column. Otherwise, select sum(count) from table; can only represent the number of rows in the original import, not the semantics of select count(*) from table;, and the former value will erroneously increase.
Another way is to change the aggregation type of the above count column to REPLACE, and keep its value constant at 1. Then, the results of select sum(count) from table; and select count(*) from table; will be consistent. Moreover, this approach has no restriction on the import of duplicate rows.
Duplicate model
The Duplicate model does not impose the same limitation as the Aggregate model because it does not involve aggregation semantics. For any columns, it can return the semantically correct results in count(*) queries.
Best Practice
Since the data model is determined when creating the table and cannot be modified. Therefore, choosing a suitable data model is very important.
Choosing a Data Model
There are currently three types of Doris data models: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. Data in the three models are all ordered according to the KEY.
Aggregate Model
The Aggregate model can significantly reduce the amount of data scanned and the calculation load during aggregation queries by Pre-Aggregate, which is very suitable for fixed pattern reporting query scenarios. However, this model is not friendly for count( *) queries. Also, because the aggregation method on the Value column is fixed, the semantic correctness needs to be considered when conducting other types of aggregation queries.
When the Aggregate Key is the same, the new and old records are aggregated, and the currently supported aggregation functions are SUM, MIN, MAX, and REPLACE.
CREATE TABLE site_visit
(
    siteid      INT,
    city        SMALLINT,
    username    VARCHAR(32),
    pv BIGINT   SUM DEFAULT '0'
)
AGGREGATE KEY (siteid, city, username)
DISTRIBUTED BY HASH (siteid) BUCKETS 10;
Unique model
The Unique model is designed for scenarios that need unique primary key constraints. When the Unique key is the same, the new record replaces the old one, ensuring the uniqueness constraint of the primary key. It is suitable for analysis businesses with update requirements. The current implementation of Unique key is the same as the REPLACE aggregation method of Aggregate key, which are essentially the same. However, it can't take advantage of the query advantages brought by Pre-Aggregate such as ROLLUP (because the essence is REPLACE, there is no SUM, etc. of the aggregation method).
Note
 The Unique model only supports whole-row updates. If users need both unique primary key constraints and updating of specific columns only (for instance, merging columns from multiple source tables into one Doris table), the Aggregate model might be considered, setting the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL. For specific usage, see Data Model.
CREATE TABLE sales_order
(
    orderid     BIGINT,
    status      TINYINT,
    username    VARCHAR(32),
    amount      BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid) BUCKETS 10;
Duplicate model
The Duplicate model does not merge identical rows and is suitable for Ad-hoc queries of any dimension. Although it can't use the characteristics of Pre-Aggregate, it is not subject to the constraints of the Aggregate model, and it can take advantage of the column storage model (column pruning, vector execution, etc.).
CREATE TABLE session_data
(
    visitorid   SMALLINT,
    sessionid   BIGINT,
    visittime   DATETIME,
    city        CHAR(20),
    province    CHAR(20),
    ip          varchar(32),
    brower      CHAR(20),
    url         VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
Large Wide Tables and Star Schema
When business party creates tables, they usually do not distinguish between dimension information and metric information, and define the Schema into a large wide table, in order to adapt to front-end businesses. For Doris, the performance of these large wide tables is not satisfactory:
There are many fields in the Schema, and there may be more key columns in the aggregation model, which will increase the columns to be ordered during the import process.
The update of dimension information will be reflected in the entire table, and its update frequency directly affects query efficiency.
In use, the user is advised to use the Star Schema to distinguish between dimension tables and metric tables. Frequently updated dimension tables can also be placed in an external table in MySQL. And if there are only a few updates, it can be placed directly in Doris. When a dimension table is stored in Doris, more replicas can be set for the dimension table to improve the performance of Join.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	-	User ID
date	DATE	-	Data import date
city	VARCHAR(20)	-	User's city
age	SMALLINT	-	User's age
sex	TINYINT	-	User's gender
last_visit_date	DATETIME	REPLACE	User's last access time
cost	BIGINT	SUM	Total user consumption
max_dwell_time	INT	MAX	User's maximum stay time
min_dwell_time	INT	MIN	User's minimum stay time

Data	Description
10000	User ID, each user's unique identification ID
2017-10-01	Data storage time, accurate to the day
Beijing	User's city
20	User's age
0	Gender Male (1 for Female)
2017-10-01 06:00:00	The user’s time to visit this page, accurate to the second
20	The consumption generated by the user's visit this time
10	User's visit, time to stay on the page
10	User's visit, time spent on the page (Redundancy)

user_id	date	timestamp	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	2017-10-01 08:00:05	Beijing	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	2017-10-01 09:00:05	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	2017-10-01 18:12:10	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	2017-10-02 13:10:00	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	2017-10-02 13:15:00	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	2017-10-01 12:12:48	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	2017-10-03 12:38:20	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

ColumnName	Type	IsKey	Comment
user_id	BIGINT	Yes	User ID
username	VARCHAR(50)	Yes	User nickname
city	VARCHAR(20)	No	User's city
age	SMALLINT	No	User's age
sex	TINYINT	No	User's gender
phone	LARGEINT	No	User phone
address	VARCHAR(500)	No	User address
register_time	DATETIME	No	User registration time

ColumnName	Type	SortKey	Comment
timestamp	DATETIME	Yes	Log Time
type	INT	Yes	Log Type
error_code	INT	Yes	Error Code
error_msg	VARCHAR(1024)	No	Detailed Error Information
op_id	BIGINT	No	Person in charge ID
op_time	DATETIME	No	Processing time

ColumnName	Type	AggregateType	Comment
user_id	BIGINT	-	User ID
date	DATE	-	Data import date
cost	BIGINT	SUM	Total user consumption
count	BIGINT	SUM	For Count Calculation

tencent cloud

New User Offers

Next-Generation CDN：EdgeOne

Elasticsearch Service Special Offers

Free Tier

Tencent Cloud Startup Program

Special Offers

Lighthouse Special Offers

Cloud Object Storage Special Offers

Featured Products

New Products

Education

Tencent Cloud Online Education Solutions

Gaming

Gaming Solution

Game Media Solutions

Financial Services

Financial Services Solution

Audio & Video

Audio/Video Solution

LVB Recording Solution

Interactive Classroom Solution

Interactive Live Streaming Solution

Audio Chat Social Networking Solution

Real Estate

Tencent Cloud LinkBase(Weiling)

E-commerce

E-commerce retail solutions

Compute

Cloud Virtual Machine

Auto Scaling

Batch Compute

CVM Dedicated Host

Database

TencentDB for MySQL

TencentDB for Redis®

TencentDB for CTSDB

TDSQL for MySQL

Data Transfer Service

TencentDB for MongoDB

TencentDB for PostgreSQL

TencentDB for SQL Server

TencentDB for TcaplusDB

Video Service

Cloud Streaming Services

Video on Demand

Media Processing Service

Cloud Application Rendering

Cloud Contact Center

Game Multimedia Engine

Chat

Real-time Communication

Tencent Effect SDK

AI and Machine Learning

Image Creation Large Model

Face Fusion

eKYC

Optical Character Recognition

Video Creation Large Model

Industry Applications

Tencent HealthCare Omics Platform

Container and Middleware

TDMQ for CKafka

Serverless Cloud Function

Tencent Kubernetes Engine

Tencent Kubernetes Engine for Serverless

Networking

Cloud Load Balancer

Virtual Private Cloud

Direct Connect

Cloud Connect Network

NAT Gateway

VPN Connection

Bandwidth Package

Anycast Internet Acceleration

Elastic Network Interface

Flow Logs

Global Application Acceleration Platform

Security

Captcha