tencent cloud

Feedback

Data Table and Data Model

Last updated: 2024-06-27 10:54:06
    This document mainly describes the table and data model of Doris from a logical perspective so you can make better use of Doris in different business scenarios.

    Data Table

    In Doris, data is logically described in the form of table. A table is a collection of homogeneous data with the same pattern. A table includes rows and columns. A row is a row of user data. Column is used to describe different fields in a row of data and can adopt different data types (such as integer, character string, and boolean, etc.) according to actual situations.
    In the OLAP scenario, columns can be divided into two categories: key and value. Key represents dimension columns, and value represents metric columns.

    Data Model

    The data model of Doris is mainly divided into three categories: Aggregate, Unique and Duplicate.

    Aggregate Model

    We illustrate what aggregation model is and how to use it correctly with practical examples.

    Example 1: Importing Data Aggregation

    Assume that the business has the following data table schema:
    ColumnName
    Type
    AggregationType
    Comment
    user_id
    LARGEINT
    -
    User ID
    date
    DATE
    -
    Data import date
    city
    VARCHAR(20)
    -
    User's city
    age
    SMALLINT
    -
    User's age
    sex
    TINYINT
    -
    User's gender
    last_visit_date
    DATETIME
    REPLACE
    User's last access time
    cost
    BIGINT
    SUM
    Total user consumption
    max_dwell_time
    INT
    MAX
    User's maximum stay time
    min_dwell_time
    INT
    MIN
    User's minimum stay time
    The corresponding table to CREATE TABLE statement would be as follows (omitting the Partition and Distribution information):
    CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
    (
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `date` DATE NOT NULL COMMENT "data import date and time",
    `city` VARCHAR(20) COMMENT "user's city",
    `age` SMALLINT COMMENT "user's age",
    `sex` TINYINT COMMENT "user's gender",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "last visit time of user",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "total consumption of user",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "user's maximum stay time",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "user's minimum stay time"
    )
    AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
    );
    As you can see, this is a typical fact table of user information and visit behaviors. In star models, user information and visit behaviors are usually stored in dimension tables and fact tables, respectively. Here, for the purposes of explanation, we store the two types of information in one single table. The columns in the table are divided into Key (dimension column) and Value (index column) according to whether they are set with an AggregationType. Those without a set AggregationType, such as user_id, date, age, and sex, are referred to as keys, while those with a set AggregationType are referred to as values. When we import data, rows with the same key column will be aggregated into one row, and their values in the Value columns will be aggregated according to the set AggregationType. The AggregationType currently has the following four ways of aggregation:
    1. SUM: total sum, the values of multiple rows are accumulated.
    2. REPLACE: Replacement. The newly imported value will replace the previous value.
    3. MAX: Retain maximum value.
    4. MIN: Retain minimum value.
    Assume we have the following imported data (original data):
    user_id
    date
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    Beijing
    20
    0
    2017-10-01 06:00:00
    20
    10
    10
    10000
    2017-10-01
    Beijing
    20
    0
    2017-10-01 07:00:00
    15
    2
    2
    10001
    2017-10-01
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    Shenzhen
    35
    0
    2017-10-03 10:20:22
    11
    6
    6
    Assume this is a table that records the user behaviors when visiting a certain commodity page. We will explain the first row of data as follows:
    Data
    Description
    10000
    User ID, each user's unique identification ID
    2017-10-01
    Data storage time, accurate to the day
    Beijing
    User's city
    20
    User's age
    0
    Gender Male (1 for Female)
    2017-10-01 06:00:00
    The user’s time to visit this page, accurate to the second
    20
    The consumption generated by the user's visit this time
    10
    User's visit, time to stay on the page
    10
    User's visit, time spent on the page (Redundancy)
    So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
    user_id
    date
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    Beijing
    20
    0
    2017-10-01 07:00:00
    35
    10
    2
    10001
    2017-10-01
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    Shenzhen
    35
    0
    2017-10-03 10:20:22
    11
    6
    6
    As can be seen, the data of user 10000 have been aggregated to one row after Aggregate. The data of other users remain the same. The explanation for the aggregated data of user 10000 is as follows (The first five columns remain unchanged, starting with column 6?last_visit_date?:
    2017-10-01 07:00:00: The last_visit_date column is aggregated by REPLACE, so 2017-10-01 07:00:00 has replaced 2017-10-01 06:00:00.
    Note
    When using REPLACE to aggregate data from the same import batch, the order of replacement is uncertain. That means, in this case, the data eventually saved Doris could be 2017-10-01 06:00:00. However, for different import batches, it is certain that data from the data from the new batch will replace those from the old batch.
    35: The cost column is aggregated by SUM, so the update value 35 is the result of 20 + 15.
    10: The max_dwell_time column is aggregated by MAX, so 10 is saved as it is the maximum between 10 and 2.
    2: The min_dwell_time column is aggregated by MIN, so 2 is saved as it is the minimum between 10 and 2.
    After aggregation, Doris will only store the aggregated data. In other words, the detailed raw data will no longer be available.

    Example 2: Keep Detailed Data

    Here is a modified version of the table schema in Example 1:
    ColumnName
    Type
    AggregationType
    Comment
    user_id
    LARGEINT
    -
    User ID
    date
    DATE
    -
    Data import date
    timestamp
    DATETIME
    -
    Data import time precise to the second
    city
    VARCHAR(20)
    -
    User's city
    age
    SMALLINT
    -
    User's age
    sex
    TINYINT
    -
    User's gender
    last_visit_date
    DATETIME
    REPLACE
    User's last access time
    cost
    BIGINT
    SUM
    Total user consumption
    max_dwell_time
    INT
    MAX
    User's maximum stay time
    min_dwell_time
    INT
    MIN
    User's minimum stay time
    One column has been added timestamp that records the data import time accurate to the second. Also, set AGGREGATE KEY as AGGREGATE KEY (user_id, date, timestamp, city, age, sex).
    The imported data are as follows:
    user_id
    date
    timestamp
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    2017-10-01 08:00:05
    Beijing
    20
    0
    2017-10-01 06:00:00
    20
    10
    10
    10000
    2017-10-01
    2017-10-01 09:00:05
    Beijing
    20
    0
    2017-10-01 07:00:00
    15
    2
    2
    10001
    2017-10-01
    2017-10-01 18:12:10
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    2017-10-02 13:10:00
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    2017-10-02 13:15:00
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    2017-10-01 12:12:48
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    2017-10-03 12:38:20
    Shenzhen
    35
    0
    2017-10-03 10:20:22
    11
    6
    6
    So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
    user_id
    date
    timestamp
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    2017-10-01 08:00:05
    Beijing
    20
    0
    2017-10-01 06:00:00
    20
    10
    10
    10000
    2017-10-01
    2017-10-01 09:00:05
    Beijing
    20
    0
    2017-10-01 07:00:00
    15
    2
    2
    10001
    2017-10-01
    2017-10-01 18:12:10
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    2017-10-02 13:10:00
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    2017-10-02 13:15:00
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    2017-10-01 12:12:48
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    2017-10-03 12:38:20
    Shenzhen
    35
    0
    2017-10-03 10:20:22
    11
    6
    6
    We can see that the stored data are exactly the same as the imported data, without any aggregation. This is because, this newly added timestamp column results in difference of keys among the rows. That is to say, as long as the Keys of the rows are not exactly the same in the import data, Doris can also save complete detailed data even in the Aggregate Model.

    Example 3: Aggregate Imported Data and the Existing Data

    Based on Example 1, suppose that you have the following data stored in Doris:
    user_id
    date
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    Beijing
    20
    0
    2017-10-01 07:00:00
    35
    10
    2
    10001
    2017-10-01
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    Shenzhen
    35
    0
    2017-10-03 10:20:22
    11
    6
    6
    We will import another batch of data:
    user_id
    date
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10004
    2017-10-03
    Shenzhen
    35
    0
    2017-10-03 11:22:00
    44
    19
    19
    10005
    2017-10-03
    Changsha
    29
    1
    2017-10-03 18:11:02
    3
    1
    1
    So when these data are correctly imported into Doris, the final Storage in Doris is as follows:
    user_id
    date
    city
    age
    sex
    last_visit_date
    cost
    max_dwell_time
    min_dwell_time
    10000
    2017-10-01
    Beijing
    20
    0
    2017-10-01 07:00:00
    35
    10
    2
    10001
    2017-10-01
    Beijing
    30
    1
    2017-10-01 17:05:45
    2
    22
    22
    10002
    2017-10-02
    Shanghai
    20
    1
    2017-10-02 12:59:12
    200
    5
    5
    10003
    2017-10-02
    Guangzhou
    32
    0
    2017-10-02 11:20:00
    30
    11
    11
    10004
    2017-10-01
    Shenzhen
    35
    0
    2017-10-01 10:00:15
    100
    3
    3
    10004
    2017-10-03
    Shenzhen
    35
    0
    2017-10-03 11:22:00
    55
    19
    6
    10005
    2017-10-03
    Changsha
    29
    1
    2017-10-03 18:11:02
    3
    1
    1
    As you can see that the existing data and newly imported data of user 10004 have been aggregated. At the same time, the new data of user 10005 have been added.
    Data aggregation occurs in the following three stages in Doris:
    1. The ETL stage of each batch of import data. At this stage, the batch of import data will be aggregated internally.
    2. The data compaction stage of the underlying BE. AT this stage, BE will aggregate data from different batches that have been imported.
    3. Data query stage. The data involved in the query is aggregated accordingly.
    Data may be aggregated to varying degrees at different times. For example, when a batch of data is just imported, it may not be aggregated with the existing data. However, for users, they can only query the aggregated data. That is, what users see are the aggregated data, and they should not assume that what they have seen are not aggregated.

    Unique model

    In some multi-dimensional analysis scenarios, what users care more is how to ensure the uniqueness of keys, that is, how to get primary key uniqueness constraints. Therefore, we introduced the unique data model. This model is essentially a special case of the aggregation model and a simplified way of representing the table structure. For example:
    ColumnName
    Type
    IsKey
    Comment
    user_id
    BIGINT
    Yes
    User ID
    username
    VARCHAR(50)
    Yes
    User nickname
    city
    VARCHAR(20)
    No
    User's city
    age
    SMALLINT
    No
    User's age
    sex
    TINYINT
    No
    User's gender
    phone
    LARGEINT
    No
    User phone
    address
    VARCHAR(500)
    No
    User address
    register_time
    DATETIME
    No
    User registration time
    This is a typical user basic information table. This kind of data doesn't need aggregation, we just need to ensure the uniqueness of the primary key. (Here the primary key is user_id + username). Therefore, our table creation statement is as follows:
    CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
    (
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
    `city` VARCHAR(20) COMMENT "user's city",
    `age` SMALLINT COMMENT "user's age",
    `sex` TINYINT COMMENT "user's gender",
    `phone` LARGEINT COMMENT "User's Phone Number",
    `address` VARCHAR(500) COMMENT "User's Address",
    `register_time` DATETIME COMMENT "User Registration Time"
    )
    UNIQUE KEY(`user_id`, `username`)
    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
    );
    The structure of this table matches completely to the structure of the table described by the following Aggregate model:
    ColumnName
    Type
    AggregationType
    Comment
    user_id
    BIGINT
    -
    User ID
    username
    VARCHAR(50)
    -
    User nickname
    city
    VARCHAR(20)
    REPLACE
    User's city
    age
    SMALLINT
    REPLACE
    User's age
    sex
    TINYINT
    REPLACE
    User's gender
    phone
    LARGEINT
    REPLACE
    User phone
    address
    VARCHAR(500)
    REPLACE
    User address
    register_time
    DATETIME
    REPLACE
    User registration time
    And table creation statement:
    CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
    (
    `user_id` LARGEINT NOT NULL COMMENT "user id",
    `username` VARCHAR(50) NOT NULL COMMENT "User Nickname",
    `city` VARCHAR(20) REPLACE COMMENT "City where the user is located",
    `age` SMALLINT REPLACE COMMENT "User Age",
    `sex` TINYINT REPLACE COMMENT "User Gender",
    `phone` LARGEINT REPLACE COMMENT "User's Phone Number",
    `address` VARCHAR(500) REPLACE COMMENT "User's Address",
    `register_time` DATETIME REPLACE COMMENT "User Registration Time"
    )
    AGGREGATE KEY(`user_id`, `username`)
    DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
    );
    The Unique model can be completely replaced by the REPLACE method in the Aggregate model. Its internal implementation method and data storage method are exactly the same. We will not continue to give examples here.

    Duplicate model

    In some multi-dimensional analysis scenarios, data neither has primary keys, nor aggregate requirements. Therefore, we introduce the duplicate data model to meet these needs. Here is an example.
    ColumnName
    Type
    SortKey
    Comment
    timestamp
    DATETIME
    Yes
    Log Time
    type
    INT
    Yes
    Log Type
    error_code
    INT
    Yes
    Error Code
    error_msg
    VARCHAR(1024)
    No
    Detailed Error Information
    op_id
    BIGINT
    No
    Person in charge ID
    op_time
    DATETIME
    No
    Processing time
    The table creation statement is as follows:
    CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
    (
    `timestamp` DATETIME NOT NULL COMMENT "Log time",
    `type` INT NOT NULL COMMENT "Log type",
    `error_code` INT COMMENT "Error code",
    `error_msg` VARCHAR(1024) COMMENT "Detailed error info",
    `op_id` BIGINT COMMENT "Person in charge ID",
    `op_time` DATETIME COMMENT "Processing time"
    )
    DUPLICATE KEY(`timestamp`, `type`)
    DISTRIBUTED BY HASH(`type`) BUCKETS 1
    PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
    );
    This data model differs from the Aggregate and Unique models. Data is stored according to the data in the imported file without any aggregation. Even if two rows of data are identical, they will both be retained. The DUPLICATE KEY specified in the table creation statement is only used to specify based on which columns the data are ordered by. When choosing DUPLICATE KEY, we recommend the first 2-4 columns. This data model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints.

    Limitations of the Aggregate Model

    Aggregate Model & Unique Model

    Here we introduce the limitations of the Aggregate model (including the Unique model). In the Aggregate model, what the model presents is the data finally aggregated. That is, for any data not yet aggregated (for example, the data of two different import batches), a certain way must be ensured to present consistency. Assume the table structure is as follows:
    ColumnName
    Type
    AggregationType
    Comment
    user_id
    LARGEINT
    -
    User ID
    date
    DATE
    -
    Data import date
    cost
    BIGINT
    SUM
    Total user consumption
    Assume there are the following two batches of data that have been successfully imported into the storage engine: Batch 1
    user_id
    date
    cost
    10001
    2017-11-20
    50
    10002
    2017-11-21
    39
    batch 2
    user_id
    date
    cost
    10001
    2017-11-20
    1
    10001
    2017-11-21
    5
    10003
    2017-11-22
    22
    As you can see, the data of user 10001 in the two different import batches have not been aggregated yet. However, in order to ensure that the user can only query the following aggregated data as follows:
    user_id
    date
    cost
    10001
    2017-11-20
    51
    10001
    2017-11-21
    5
    10002
    2017-11-21
    39
    10003
    2017-11-22
    22
    We have added an aggregation operator to the query engine to ensure the presentation consistency of data.
    Moreover, on the aggregate column (Value), when executing aggregate class queries that are inconsistent with the aggregate type, please pay attention to the semantics. For example, if we execute the following query in the above example:
    SELECT MIN(cost) FROM table;
    The result will be 5, not 1.
    At the same time, this guarantee of consistency will significantly reduce efficiency in some queries.
    Let's use the most basic count(*) query as an example:
    SELECT COUNT(*) FROM table;
    In other databases, such queries return results quickly. This is because in actual implementation, the models can get the query result by counting rows and saving the statistics upon import, or by scanning only a particular column of data to get the count value during the query, with very little overhead. However, in Doris's Aggregation model, the cost of this kind of query is very large.
    Let's take our previous data as an example:
    batch 1
    user_id
    date
    cost
    10001
    2017-11-20
    50
    10002
    2017-11-21
    39
    batch 2
    user_id
    date
    cost
    10001
    2017-11-20
    1
    10001
    2017-11-21
    5
    10003
    2017-11-22
    22
    Since the final Aggregate results as:
    user_id
    date
    cost
    10001
    2017-11-20
    51
    10001
    2017-11-21
    5
    10002
    2017-11-21
    39
    10003
    2017-11-22
    22
    Therefore, the correct result of select count(*) from table; should be 4. But if the model only scans the user_id column and operates aggregation upon query, the final result will be 3(10001, 10002, 10003). And if it does not operate aggregation, the final result will be 5 (a total of five rows in two batches). You can see that both of these results are incorrect.
    In order to get the correct result, we must read both the user_id and date column, and perform aggregation when querying, to return the correct result of 4. That is to say, in a count( ) query, Doris must scan all the AGGREGATE KEY columns (in this case, user_id and date), and aggregate them to get the semantically correct results. That means if there are many aggregated columns, count( ) queries could involve scanning large amounts of data.
    Therefore, if you need to perform frequent count( ) queries, we recommend users to simulate count( ) by adding a column **which value is always 1 and the aggregation type SUM)**. In this way, the table schema in the previous example will be modified as follows:
    ColumnName
    Type
    AggregateType
    Comment
    user_id
    BIGINT
    -
    User ID
    date
    DATE
    -
    Data import date
    cost
    BIGINT
    SUM
    Total user consumption
    count
    BIGINT
    SUM
    For Count Calculation
    Add a count column, and in the data import, the value of this column is always 1. Then the result of select count(*) from table; is equivalent to select sum(count) from table;. The latter's query efficiency is much higher than the former. However, this approach has limitations in use, that is, users need to ensure that they do not import rows with the same AGGREGATE KEY column. Otherwise, select sum(count) from table; can only represent the number of rows in the original import, not the semantics of select count(*) from table;, and the former value will erroneously increase.
    Another way is to change the aggregation type of the above count column to REPLACE, and keep its value constant at 1. Then, the results of select sum(count) from table; and select count(*) from table; will be consistent. Moreover, this approach has no restriction on the import of duplicate rows.

    Duplicate model

    The Duplicate model does not impose the same limitation as the Aggregate model because it does not involve aggregation semantics. For any columns, it can return the semantically correct results in count(*) queries.

    Best Practice

    Since the data model is determined when creating the table and cannot be modified. Therefore, choosing a suitable data model is very important.

    Choosing a Data Model

    There are currently three types of Doris data models: AGGREGATE KEY, UNIQUE KEY, and DUPLICATE KEY. Data in the three models are all ordered according to the KEY.

    Aggregate Model

    The Aggregate model can significantly reduce the amount of data scanned and the calculation load during aggregation queries by Pre-Aggregate, which is very suitable for fixed pattern reporting query scenarios. However, this model is not friendly for count( *) queries. Also, because the aggregation method on the Value column is fixed, the semantic correctness needs to be considered when conducting other types of aggregation queries. When the Aggregate Key is the same, the new and old records are aggregated, and the currently supported aggregation functions are SUM, MIN, MAX, and REPLACE.
    CREATE TABLE site_visit
    (
    siteid INT,
    city SMALLINT,
    username VARCHAR(32),
    pv BIGINT SUM DEFAULT '0'
    )
    AGGREGATE KEY (siteid, city, username)
    DISTRIBUTED BY HASH (siteid) BUCKETS 10;

    Unique model

    The Unique model is designed for scenarios that need unique primary key constraints. When the Unique key is the same, the new record replaces the old one, ensuring the uniqueness constraint of the primary key. It is suitable for analysis businesses with update requirements. The current implementation of Unique key is the same as the REPLACE aggregation method of Aggregate key, which are essentially the same. However, it can't take advantage of the query advantages brought by Pre-Aggregate such as ROLLUP (because the essence is REPLACE, there is no SUM, etc. of the aggregation method).
    Note
    The Unique model only supports whole-row updates. If users need both unique primary key constraints and updating of specific columns only (for instance, merging columns from multiple source tables into one Doris table), the Aggregate model might be considered, setting the aggregation type of non-primary key columns to REPLACE_IF_NOT_NULL. For specific usage, see Data Model.
    CREATE TABLE sales_order
    (
    orderid BIGINT,
    status TINYINT,
    username VARCHAR(32),
    amount BIGINT DEFAULT '0'
    )
    UNIQUE KEY(orderid)
    DISTRIBUTED BY HASH(orderid) BUCKETS 10;

    Duplicate model

    The Duplicate model does not merge identical rows and is suitable for Ad-hoc queries of any dimension. Although it can't use the characteristics of Pre-Aggregate, it is not subject to the constraints of the Aggregate model, and it can take advantage of the column storage model (column pruning, vector execution, etc.).
    CREATE TABLE session_data
    (
    visitorid SMALLINT,
    sessionid BIGINT,
    visittime DATETIME,
    city CHAR(20),
    province CHAR(20),
    ip varchar(32),
    brower CHAR(20),
    url VARCHAR(1024)
    )
    DUPLICATE KEY(visitorid, sessionid)
    DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;

    Large Wide Tables and Star Schema

    When business party creates tables, they usually do not distinguish between dimension information and metric information, and define the Schema into a large wide table, in order to adapt to front-end businesses. For Doris, the performance of these large wide tables is not satisfactory:
    There are many fields in the Schema, and there may be more key columns in the aggregation model, which will increase the columns to be ordered during the import process.
    The update of dimension information will be reflected in the entire table, and its update frequency directly affects query efficiency.
    In use, the user is advised to use the Star Schema to distinguish between dimension tables and metric tables. Frequently updated dimension tables can also be placed in an external table in MySQL. And if there are only a few updates, it can be placed directly in Doris. When a dimension table is stored in Doris, more replicas can be set for the dimension table to improve the performance of Join.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support