tencent cloud

Feedback

Materialized View

Last updated: 2024-06-27 11:10:47
    Materialized View is a special table in Doris that contains a precalculated dataset, based on well-defined SELECT statements. The purpose of materialized views is to meet users' needs for both arbitrary dimensional analysis of detailed data and quick analysis of fixed dimensions.

    Applicable Scenario

    Analytical demand covers the queries of both detailed data and fixed-dimension data.
    Queries only involve a small part of the columns or rows in the table.
    Queries include some time-consuming processing operations, such as aggregate operation that takes a long time.
    Queries should match different prefix indexes.

    Advantage

    For those queries that often repetitively use the same subquery results, the performance is greatly improved.
    Doris automatically maintains the data of materialized view. Whether it is new data import or deletion operations, data consistency between the base table and materialized view can be ensured. No extra manual maintenance is required.
    During queries, the most optimized materialized view is automatically selected, and data is directly read from it.
    Note
    Automatically maintaining materialized view data incurs some maintenance overhead, which will be elaborated in the limitations of materialized views in the later sections.

    Materialized View VS Rollup

    Before there is materialized view function, users usually used the Rollup function to improve query efficiency by pre-aggregation. However, Rollup has its limitations; it cannot perform pre-aggregate based on the detail model. Materialized view not only covers the features of Rollup, but also supports a richer set of aggregate functions. Hence, materialized view is actually a superset of Rollup. That is, features supported by the earlier ALTER TABLE ADD ROLLUP syntax can now all be implemented through CREATE MATERIALIZED VIEW.
    Materialized View and Rollup support different aggregate functions, and materialized view supports more aggregate functions (replace aggregate mode is not listed as it is not used in queries)
    Aggregate Function
    SUM
    MAX
    SUM
    COUNT
    BITMAP_UNION
    HLL_UNION
    Rollup
    true
    true
    true
    false
    false
    false
    Materialized View
    true
    true
    true
    true
    true
    true

    Using Materialized View

    Doris system provides a whole set of DDL syntaxes for materialized view, including creation, viewing, and deletion. DDL syntax is consistent with PostgreSQL and Oracle.
    All the columns in the query or sub-query must be in the materialized view, i.e., all the columns involved in the query (including the select list and the query condition columns in where) exist in the columns of the materialized view. Otherwise, the query can only hit the Base table.
    Conditions on columns that can use prefix index should be =?<>=?<=?>=?in?between and these conditions should be in parallel and connected with and, for or,?!= the conditions cannot hit.

    Creating Materialized View

    Firstly, you need to decide what kind of materialized view to create based on the characteristics of your query statement. It doesn't mean it's the best that your materialized view definition is exactly the same as one of your query statements. There are two principles:
    1. Abstract from the query statement common grouping and aggregate method as the definition of the materialized view.
    2. There's no need to create a materialized view for combinations of all dimension.
    About the first point, a materialized view is most effective if it can be abstracted and multiple queries can match this view. Because the maintenance of the materialized view itself also consumes resources. If the materialized view fits only a specific query and other queries cannot use this view, it will lead to low cost-effectiveness of this view. It takes up the storage resources of the cluster but cannot serve more queries. Therefore, users need to combine their query statements and data dimension information to abstract some definitions of materialized views.
    The second point is that not all dimension analyses will be covered in actual analysis queries. So you just need to create materialized views for commonly used dimension combinations, to achieve a balance in space and time. A materialized view can be created through the command below. The creation of a materialized view is an asynchronous operation, which means after the user successfully submits the creation task, Doris will calculate the existing data in the background until the creation is successful. The specific syntax can be viewed by the command below:
    HELP CREATE MATERIALIZED VIEW

    Aggregate function is supported

    Currently, the aggregate functions supported by the materialized view creation statement are:
    SUM,MIN,MAX (Version 0.12).
    COUNT,BITMAP_UNION,HLL_UNION (Version 0.13).
    The form of BITMAP_UNION must be: BITMAP_UNION(TO_BITMAP(COLUMN)) where the column type can only be an integer (largeint is not supported), or BITMAP_UNION(COLUMN) with the base table as the AGG model.
    The HLL_UNION must be in the form of: HLL_UNION(HLL_HASH(COLUMN)) where the column type cannot be DECIMAL, or HLL_UNION(COLUMN) with the base table as the AGG model.

    Updating Policy

    To ensure data consistency between the materialized view table and the Base table, Doris will sync the base table operations such as importing, deleting etc. to the materialized view table. In addition, the update efficiency is improved by incremental updates and atomicity is ensured through the transaction method. If user inserts data into the base table through the INSERT command, then this data will be synchronously inserted into the materialized view. The INSERT command will return successfully only after the data has been successfully written into both the base table and the materialized view table.

    Query Auto-Match

    After the materialized view is created successfully, the user's query does not need to change, that is, it is still the base table that is queried. Doris will automatically select the best materialized view based on the current query statement, read data from the materialized view and calculate. Users can check whether the current query has used the materialized view through the EXPLAIN command. The match relationship between the aggregate in the materialized view and the aggregate in the query:
    Aggregate in the Materialized View
    Aggregate in the Query
    sum
    sum
    min
    min
    max
    max
    count
    count
    bitmap_union
    bitmap_union,bitmap_union_count, count(distinct)
    hll_union
    hll_raw_agg, hll_union_agg, ndv, approx_count_distinct
    For bitmap and HLL, the aggregate functions in queries will be rewritten based on the structure of the materialized view after matching a materialized view. See Best Practice 2 for more details.

    Querying Materialized View

    Check which materialized views the current table has, and what their table structures are by the following command:
    MySQL [test]> desc mv_test all;
    +-----------+---------------+-----------------+----------+------+-------+---------+--------------+
    | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra |
    +-----------+---------------+-----------------+----------+------+-------+---------+--------------+
    | mv_test | DUP_KEYS | k1 | INT | Yes | true | NULL | |
    | | | k2 | BIGINT | Yes | true | NULL | |
    | | | k3 | LARGEINT | Yes | true | NULL | |
    | | | k4 | SMALLINT | Yes | false | NULL | NONE |
    | | | | | | | | |
    | mv_2 | AGG_KEYS | k2 | BIGINT | Yes | true | NULL | |
    | | | k4 | SMALLINT | Yes | false | NULL | MIN |
    | | | k1 | INT | Yes | false | NULL | MAX |
    | | | | | | | | |
    | mv_3 | AGG_KEYS | k1 | INT | Yes | true | NULL | |
    | | | to_bitmap(`k2`) | BITMAP | No | false | | BITMAP_UNION |
    | | | | | | | | |
    | mv_1 | AGG_KEYS | k4 | SMALLINT | Yes | true | NULL | |
    | | | k1 | BIGINT | Yes | false | NULL | SUM |
    | | | k3 | LARGEINT | Yes | false | NULL | SUM |
    | | | k2 | BIGINT | Yes | false | NULL | MIN |
    +-----------+---------------+-----------------+----------+------+-------+---------+--------------+
    You can see that there are three materialized views in the current mv_test table: mv_1, mv_2 and mv_3, as well as their table structure.

    Dropping Materialized View

    If users no longer need the materialized view, they can delete the materialized view by the following command:
    DROP MATERIALIZED VIEW
    The specific syntax can be viewed by the following command:
    HELP DROP MATERIALIZED VIEW

    Viewing the Created Materialized View

    The specific syntax for users to view the created materialized views by the command:
    SHOW CREATE MATERIALIZED VIEW

    Best Practice of Materialized View

    Best Practice 1

    The use of materialized views is generally divided into the following steps:
    1. Creating a materialized view.
    2. Asynchronously check whether the materialized view has been built.
    3. Query and automatically match the materialized view.
    Step 1: Create a materialized view. Suppose a user has a detailed sales record table, which stores the transaction ID, salesperson, selling store, sales time, and transaction amount for each transaction. The command to create the table is:
    create table sales_records(record_id int, seller_id int, store_id int, sale_date date, sale_amt bigint) distributed by hash(record_id) properties("replication_num" = "1");
    The structure of the sales_records table is as follows:
    MySQL [test]> desc sales_records;
    +-----------+--------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-----------+--------+------+-------+---------+-------+
    | record_id | INT | Yes | true | NULL | |
    | seller_id | INT | Yes | true | NULL | |
    | store_id | INT | Yes | true | NULL | |
    | sale_date | DATE | Yes | false | NULL | NONE |
    | sale_amt | BIGINT | Yes | false | NULL | NONE |
    +-----------+--------+------+-------+---------+-------+
    If the user frequently queries the sales volume of different stores, they can create a materialized view for this sales_records table, which groups by the selling store and sums the sales of the same store. The command to create the materialized view is as follows:
    MySQL [test]> create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id;
    If the back-end returns the following image, it means that the task to create the materialized view has been successfully submitted.
    Query OK, 0 rows affected (0.012 sec)
    Step 2: Check if the materialized view has been built. Because the creation of the materialized view is an asynchronous operation, the user needs to check whether the materialized view has been built after submitting the creation task by asynchronous command. The command is as follows:
    SHOW ALTER TABLE ROLLUP FROM db_name; (Version 0.12)
    SHOW ALTER TABLE MATERIALIZED VIEW FROM db_name; (Version 0.13)
    In this command, db_name is a parameter, and you need to replace it with your actual database name. The command displays all the tasks of creating materialized views in this database. The result is as follows:
    +-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+
    | JobId | TableName | CreateTime | FinishedTime | BaseIndexName | RollupIndexName | RollupId | TransactionId | State | Msg | Progress | Timeout |
    +-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+
    | 22036 | sales_records | 2020-07-30 20:04:28 | 2020-07-30 20:04:57 | sales_records | store_amt | 22037 | 5008 | FINISHED | | NULL | 86400 |
    +-------+---------------+---------------------+---------------------+---------------+-----------------+----------+---------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------+---------+
    The TableName indicates which table the data in the materialized view comes from, and the RollupIndexName refers to the name of the materialized view. The most important metric is the State. When the State of the task to create the materialized view has become FINISHED, it means that the materialized view has been successfully created. This means that the view may be automatically matched when querying. Step 3: Querying. Once the materialized view is created, when users query the sales volume of different stores, they will directly read the aggregated data from the recently created materialized view store_amt. This achieves the effect of improving the efficiency of queries. Users still specify the sales_records table for their queries, for example:
    SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;
    The above query can automatically match store_amt. Users can check whether the current query has been matched to the appropriate materialized view with the following command.
    EXPLAIN SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;
    +-----------------------------------------------------------------------------+
    | Explain String |
    +-----------------------------------------------------------------------------+
    | PLAN FRAGMENT 0 |
    | OUTPUT EXPRS:<slot 2> `store_id` | <slot 3> sum(`sale_amt`) |
    | PARTITION: UNPARTITIONED |
    | |
    | RESULT SINK |
    | |
    | 4:EXCHANGE |
    | |
    | PLAN FRAGMENT 1 |
    | OUTPUT EXPRS: |
    | PARTITION: HASH_PARTITIONED: <slot 2> `store_id` |
    | |
    | STREAM DATA SINK |
    | EXCHANGE ID: 04 |
    | UNPARTITIONED |
    | |
    | 3:AGGREGATE (merge finalize) |
    | | output: sum(<slot 3> sum(`sale_amt`)) |
    | | group by: <slot 2> `store_id` |
    | | |
    | 2:EXCHANGE |
    | |
    | PLAN FRAGMENT 2 |
    | OUTPUT EXPRS: |
    | PARTITION: RANDOM |
    | |
    | STREAM DATA SINK |
    | EXCHANGE ID: 02 |
    | HASH_PARTITIONED: <slot 2> `store_id` |
    | |
    | 1:AGGREGATE (update serialize) |
    | | STREAMING |
    | | output: sum(`sale_amt`) |
    | | group by: `store_id` |
    | | |
    | 0:OlapScanNode |
    | TABLE: sales_records |
    | PREAGGREGATION: ON |
    | partitions=1/1 |
    | rollup: store_amt |
    | tabletRatio=10/10 |
    | tabletList=22038,22040,22042,22044,22046,22048,22050,22052,22054,22056 |
    | cardinality=0 |
    | avgRowSize=0.0 |
    | numNodes=1 |
    +-----------------------------------------------------------------------------+
    45 rows in set (0.006 sec)
    The most important thing is the rollup attribute in OlapScanNode. You can see that the rollup of the current query shows store_amt. In other words, the query has correctly matched the materialized view store_amt and directly reads data from it.

    Best Practice 2: PV, UV

    Business scenario: Compute the UV, PV of ads.
    If the user's original ad click data is stored in Doris, then the query for the PV and UV of the ad can be sped up by creating a bitmap_union materialized view. First, create a table to store ad click data details, including each click event, what ad was clicked, what channel was clicked, and who the user is.
    MySQL [test]> create table advertiser_view_record(time date, advertiser varchar(10), channel varchar(10), user_id int) distributed by hash(time) properties("replication_num" = "1");
    Query O
    K, 0 rows affected (0.014 sec)
    The structure of the original ad click data table is:
    MySQL [test]> desc advertiser_view_record;
    +------------+-------------+------+-------+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +------------+-------------+------+-------+---------+-------+
    | time | DATE | Yes | true | NULL | |
    | advertiser | VARCHAR(10) | Yes | true | NULL | |
    | channel | VARCHAR(10) | Yes | false | NULL | NONE |
    | user_id | INT | Yes | false | NULL | NONE |
    +------------+-------------+------+-------+---------+-------+
    4 rows in set (0.001 sec)
    1. Create a materialized view. Because the user wants to query is the UV value of the ad, which is to de-duplicate users of the same ad, then the query is generally:
    SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
    For this UV-seeking scenario, we can create a materialized view with bitmap_union to achieve a pre-accurate deduplication effect. In Doris, the result of count(distinct) Aggregate is completely consistent with the result of bitmap_union_count Aggregate. The bitmap_union_count is equal to the count of the bitmap_union result, so if is involved in the query, count(distinct) , by creating a materialized view with bitmap_union Aggregate can speed up the query. For this case, one can create a materialized view that groups by advertising and channel, and user_id for precise deduplication.
    MySQL [test]> create materialized view advertiser_uv as select advertiser, channel, bitmap_union(to_bitmap(user_id)) from advertiser_view_record group by advertiser, channel;
    Query OK, 0 rows affected (0.012 sec)
    Note
    Because the user_id itself is an INT type, it needs to be converted to a bitmap type by the function to_bitmap in Doris before it can be bitmap_union aggregate.
    After creation, the table structure of the ad click detail table and the materialized view table is as follows:
    MySQL [test]> desc advertiser_view_record all;
    +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+
    | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra |
    +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+
    | advertiser_view_record | DUP_KEYS | time | DATE | Yes | true | NULL | |
    | | | advertiser | VARCHAR(10) | Yes | true | NULL | |
    | | | channel | VARCHAR(10) | Yes | false | NULL | NONE |
    | | | user_id | INT | Yes | false | NULL | NONE |
    | | | | | | | | |
    | advertiser_uv | AGG_KEYS | advertiser | VARCHAR(10) | Yes | true | NULL | |
    | | | channel | VARCHAR(10) | Yes | true | NULL | |
    | | | to_bitmap(`user_id`) | BITMAP | No | false | | BITMAP_UNION |
    +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+
    2. Auto-match query. After the materialized view table is created, when querying the ad UV, Doris will automatically query from the just created materialized view advertiser_uv. For example, the original query statement is as follows:
    SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
    After selecting materialized view, the actual query will be converted to:
    SELECT advertiser, channel, bitmap_union_count(to_bitmap(user_id)) FROM advertiser_uv GROUP BY advertiser, channel;
    You can use the EXPLAIN command to check whether Doris matches the materialized view:
    MySQL [test]> explain SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Explain String |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | PLAN FRAGMENT 0 |
    | OUTPUT EXPRS:<slot 7> `advertiser` | <slot 8> `channel` | <slot 9> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) |
    | PARTITION: UNPARTITIONED |
    | |
    | RESULT SINK |
    | |
    | 4:EXCHANGE |
    | |
    | PLAN FRAGMENT 1 |
    | OUTPUT EXPRS: |
    | PARTITION: HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` |
    | |
    | STREAM DATA SINK |
    | EXCHANGE ID: 04 |
    | UNPARTITIONED |
    | |
    | 3:AGGREGATE (merge finalize) |
    | | output: bitmap_union_count(<slot 6> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`)) |
    | | group by: <slot 4> `advertiser`, <slot 5> `channel` |
    | | |
    | 2:EXCHANGE |
    | |
    | PLAN FRAGMENT 2 |
    | OUTPUT EXPRS: |
    | PARTITION: RANDOM |
    | |
    | STREAM DATA SINK |
    | EXCHANGE ID: 02 |
    | HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` |
    | |
    | 1:AGGREGATE (update serialize) |
    | | STREAMING |
    | | output: bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) |
    | | group by: `advertiser`, `channel` |
    | | |
    | 0:OlapScanNode |
    | TABLE: advertiser_view_record |
    | PREAGGREGATION: ON |
    | partitions=1/1 |
    | rollup: advertiser_uv |
    | tabletRatio=10/10 |
    | tabletList=22084,22086,22088,22090,22092,22094,22096,22098,22100,22102 |
    | cardinality=0 |
    | avgRowSize=0.0 |
    | numNodes=1 |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    45 rows in set (0.030 sec)
    In the result of EXPLAIN, you can first see that the rollup attribute value of OlapScanNode is advertiser_uv. That is, the query directly scans the data of the materialized view, indicating a successful match. Secondly, for the user_id field, count(distinct) is rewritten as bitmap_union_count(to_bitmap). It achieves the effect of accurate de-duplication through the bitmap.

    Best Practice 3

    Business Scenario: More abundant prefix index matches.
    The user's original table has three columns (k1, k2, k3). Where, k1, k2 are prefix index columns. Now, if the user's query conditions include where k1=1 and k2=2, the query can be accelerated through the index. However, in some cases, user filtering conditions can't match the prefix index, for example, where k3=3. Hence, it will not be able to improve query speed through the index. Creating a materialized view with k3 as the first column can solve this problem.
    1. Creating a materialized view.
    CREATE MATERIALIZED VIEW mv_1 as SELECT k3, k2, k1 FROM tableA ORDER BY k3;
    Completed using the syntax above, the materialized view retains the complete detail data, and the prefix index of the materialized view is the k3 column. The table structure is as follows:
    MySQL [test]> desc tableA all;
    +-----------+---------------+-------+------+------+-------+---------+-------+
    | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra |
    +-----------+---------------+-------+------+------+-------+---------+-------+
    | tableA | DUP_KEYS | k1 | INT | Yes | true | NULL | |
    | | | k2 | INT | Yes | true | NULL | |
    | | | k3 | INT | Yes | true | NULL | |
    | | | | | | | | |
    | mv_1 | DUP_KEYS | k3 | INT | Yes | true | NULL | |
    | | | k2 | INT | Yes | false | NULL | NONE |
    | | | k1 | INT | Yes | false | NULL | NONE |
    +-----------+---------------+-------+------+------+-------+---------+-------+
    2. Query matching. Now, when a user's query contains a filter condition of the k3 column, for example:
    select k1, k2, k3 from table A where k3=3;
    Then, the query will directly read data from the just created mv_1 materialized view. The materialized view has a prefix index for k3 and the query efficiency will be improved.

    Limitations of Materialized View

    1. The parameters of the aggregate function of the materialized view do not support expression, they only support a single column, such as: sum(a+b) is not supported.
    2. If the column of the delete statement does not exist in the materialized view, the delete operation cannot be performed. If you must delete the data, you need to first delete the materialized view and then delete the data.
    3. If there are too many materialized views on a single table, it can affect the efficiency of imports: When importing data, the data of the materialized view and the base table are updated synchronously. If a table has more than 10 materialized views, it may cause slow import speed, just like importing data of 10 tables at the same time in a single import.
    4. The same column cannot appear in the same materialized view with different aggregate functions, for example: select sum(a), min(a) from table is not supported.
    5. A materialized view for the Unique Key data model can only change the order of columns and cannot perform Aggregate functions. Therefore, it is not possible to perform coarse-grained aggregate operations on the data by creating a materialized view on the Unique Key model.

    Abnormal Error

    1. DATA_QUALITY_ERR: "The data quality does not satisfy, please check your data" The creation of the materialized view failed due to data quality issues.
    Note
    bitmap types only supports positive integers, and if there are negative numbers in the original data, the creation of the materialized view may fail.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support