ColumnName | Type | AggregationType | Comment |
user_id | LARGEINT | - | User ID |
date | DATE | - | Data import date |
city | VARCHAR(20) | - | User's city |
age | SMALLINT | - | User's age |
sex | TINYINT | - | User's gender |
last_visit_date | DATETIME | REPLACE | User's last access time |
cost | BIGINT | SUM | Total user consumption |
max_dwell_time | INT | MAX | User's maximum stay time |
min_dwell_time | INT | MIN | User's minimum stay time |
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(`user_id` LARGEINT NOT NULL COMMENT "user id",`date` DATE NOT NULL COMMENT "data import date and time",`city` VARCHAR(20) COMMENT "user's city",`age` SMALLINT COMMENT "user's age",`sex` TINYINT COMMENT "user's gender",`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "last visit time of user",`cost` BIGINT SUM DEFAULT "0" COMMENT "total consumption of user",`max_dwell_time` INT MAX DEFAULT "0" COMMENT "user's maximum stay time",`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "user's minimum stay time")AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 1");
AggregationType
. Those without a set AggregationType
, such as user_id
, date
, age
, and sex
, are referred to as keys, while those with a set AggregationType
are referred to as values.
When we import data, rows with the same key column will be aggregated into one row, and their values in the Value columns will be aggregated according to the set AggregationType
. The AggregationType
currently has the following four ways of aggregation:user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Data | Description |
10000 | User ID, each user's unique identification ID |
2017-10-01 | Data storage time, accurate to the day |
Beijing | User's city |
20 | User's age |
0 | Gender Male (1 for Female) |
2017-10-01 06:00:00 | The user’s time to visit this page, accurate to the second |
20 | The consumption generated by the user's visit this time |
10 | User's visit, time to stay on the page |
10 | User's visit, time spent on the page (Redundancy) |
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
last_visit_date
?:2017-10-01 07:00:00
: The last_visit_date
column is aggregated by REPLACE, so 2017-10-01 07:00:00
has replaced 2017-10-01 06:00:00
.2017-10-01 06:00:00
. However, for different import batches, it is certain that data from the data from the new batch will replace those from the old batch.35
: The cost
column is aggregated by SUM, so the update value 35 is the result of 20 + 15.10
: The max_dwell_time
column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.2
: The min_dwell_time
column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.ColumnName | Type | AggregationType | Comment |
user_id | LARGEINT | - | User ID |
date | DATE | - | Data import date |
timestamp | DATETIME | - | Data import time precise to the second |
city | VARCHAR(20) | - | User's city |
age | SMALLINT | - | User's age |
sex | TINYINT | - | User's gender |
last_visit_date | DATETIME | REPLACE | User's last access time |
cost | BIGINT | SUM | Total user consumption |
max_dwell_time | INT | MAX | User's maximum stay time |
min_dwell_time | INT | MIN | User's minimum stay time |
timestamp
that records the data import time accurate to the second.
Also, set AGGREGATE KEY
as AGGREGATE KEY (user_id, date, timestamp, city, age, sex)
.user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
timestamp
column results in difference of keys among the rows. That is to say, as long as the Keys of the rows are not exactly the same in the import data, Doris can also save complete detailed data even in the Aggregate Model.user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
ColumnName | Type | IsKey | Comment |
user_id | BIGINT | Yes | User ID |
username | VARCHAR(50) | Yes | User nickname |
city | VARCHAR(20) | No | User's city |
age | SMALLINT | No | User's age |
sex | TINYINT | No | User's gender |
phone | LARGEINT | No | User phone |
address | VARCHAR(500) | No | User address |
register_time | DATETIME | No | User registration time |
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(`user_id` LARGEINT NOT NULL COMMENT "user id",`username` VARCHAR(50) NOT NULL COMMENT "User Nickname",`city` VARCHAR(20) COMMENT "user's city",`age` SMALLINT COMMENT "user's age",`sex` TINYINT COMMENT "user's gender",`phone` LARGEINT COMMENT "User's Phone Number",`address` VARCHAR(500) COMMENT "User's Address",`register_time` DATETIME COMMENT "User Registration Time")UNIQUE KEY(`user_id`, `username`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 1");
ColumnName | Type | AggregationType | Comment |
user_id | BIGINT | - | User ID |
username | VARCHAR(50) | - | User nickname |
city | VARCHAR(20) | REPLACE | User's city |
age | SMALLINT | REPLACE | User's age |
sex | TINYINT | REPLACE | User's gender |
phone | LARGEINT | REPLACE | User phone |
address | VARCHAR(500) | REPLACE | User address |
register_time | DATETIME | REPLACE | User registration time |
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(`user_id` LARGEINT NOT NULL COMMENT "user id",`username` VARCHAR(50) NOT NULL COMMENT "User Nickname",`city` VARCHAR(20) REPLACE COMMENT "City where the user is located",`age` SMALLINT REPLACE COMMENT "User Age",`sex` TINYINT REPLACE COMMENT "User Gender",`phone` LARGEINT REPLACE COMMENT "User's Phone Number",`address` VARCHAR(500) REPLACE COMMENT "User's Address",`register_time` DATETIME REPLACE COMMENT "User Registration Time")AGGREGATE KEY(`user_id`, `username`)DISTRIBUTED BY HASH(`user_id`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 1");
ColumnName | Type | SortKey | Comment |
timestamp | DATETIME | Yes | Log Time |
type | INT | Yes | Log Type |
error_code | INT | Yes | Error Code |
error_msg | VARCHAR(1024) | No | Detailed Error Information |
op_id | BIGINT | No | Person in charge ID |
op_time | DATETIME | No | Processing time |
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl(`timestamp` DATETIME NOT NULL COMMENT "Log time",`type` INT NOT NULL COMMENT "Log type",`error_code` INT COMMENT "Error code",`error_msg` VARCHAR(1024) COMMENT "Detailed error info",`op_id` BIGINT COMMENT "Person in charge ID",`op_time` DATETIME COMMENT "Processing time")DUPLICATE KEY(`timestamp`, `type`)DISTRIBUTED BY HASH(`type`) BUCKETS 1PROPERTIES ("replication_allocation" = "tag.location.default: 1");
ordered by
. When choosing DUPLICATE KEY, we recommend the first 2-4 columns. This data model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints.ColumnName | Type | AggregationType | Comment |
user_id | LARGEINT | - | User ID |
date | DATE | - | Data import date |
cost | BIGINT | SUM | Total user consumption |
user_id | date | cost |
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
user_id | date | cost |
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
user_id | date | cost |
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
SELECT MIN(cost) FROM table;
SELECT COUNT(*) FROM table;
user_id | date | cost |
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
user_id | date | cost |
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
user_id | date | cost |
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
select count(*) from table;
should be 4. But if the model only scans the user_id
column and operates aggregation upon query, the final result will be 3(10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). You can see that both of these results are incorrect.user_id
and date
column, and perform aggregation when querying, to return the correct result of 4. That is to say, in a count( ) query, Doris must scan all the AGGREGATE KEY columns (in this case, user_id
and date
), and aggregate them to get the semantically correct results. That means if there are many aggregated columns, count( ) queries could involve scanning large amounts of data.ColumnName | Type | AggregateType | Comment |
user_id | BIGINT | - | User ID |
date | DATE | - | Data import date |
cost | BIGINT | SUM | Total user consumption |
count | BIGINT | SUM | For Count Calculation |
select count(*) from table;
is equivalent to select sum(count) from table;
. The latter's query efficiency is much higher than the former. However, this approach has limitations in use, that is, users need to ensure that they do not import rows with the same AGGREGATE KEY column. Otherwise, select sum(count) from table;
can only represent the number of rows in the original import, not the semantics of select count(*) from table;
, and the former value will erroneously increase.count
column to REPLACE, and keep its value constant at 1. Then, the results of select sum(count) from table;
and select count(*) from table;
will be consistent. Moreover, this approach has no restriction on the import of duplicate rows.CREATE TABLE site_visit(siteid INT,city SMALLINT,username VARCHAR(32),pv BIGINT SUM DEFAULT '0')AGGREGATE KEY (siteid, city, username)DISTRIBUTED BY HASH (siteid) BUCKETS 10;
CREATE TABLE sales_order(orderid BIGINT,status TINYINT,username VARCHAR(32),amount BIGINT DEFAULT '0')UNIQUE KEY(orderid)DISTRIBUTED BY HASH(orderid) BUCKETS 10;
CREATE TABLE session_data(visitorid SMALLINT,sessionid BIGINT,visittime DATETIME,city CHAR(20),province CHAR(20),ip varchar(32),brower CHAR(20),url VARCHAR(1024))DUPLICATE KEY(visitorid, sessionid)DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
Was this page helpful?