tencent cloud

Feedback

Runtime Filter

Last updated: 2024-06-27 11:12:16
    Runtime Filter is a new feature officially added in the Doris 0.15 version. It aims to dynamically generate filter conditions for some Join queries at runtime in order to reduce the volume of data scanned, avoid unnecessary I/O and network transmission, thereby speeding up the query. Its design, implementation and effect can be referred to ISSUE 6116.

    Definitions

    FE: Frontend, the frontend node of Doris, responsible for Metadata management and request access.
    BE: Backend, the backend node of Doris, responsible for query execution and data storage.
    Left table: In a Join query, it's the table on the left. Perfom probe operation. The order can be adjusted by Join Reorder.
    Right table: In a Join query, it's the table on the right. Perform build operation. The order can be adjusted by Join Reorder.
    Fragment: The FE will convert the execution of specific SQL statements into corresponding Fragments and distribute them to BE for execution. BE executes the corresponding Fragment and aggregates the results back to FE.
    Join on clause: In A join B on A.a=B.b, A.a=B.b generates join conjuncts based on that at query planning time. It contains expr for join Build and Probe, where the Build expr is called src expr in Runtime Filter and the Probe expr is called target expr in Runtime Filter.

    Principles

    Runtime Filter is generated at query planning time, built in HashJoinNode, and applied in ScanNode. For example, there is a Join query between tables T1 and T2. The way os join is HashJoin. T1 is a fact table with 100,000 rows of data, and T2 is a dimension table with 2,000 rows of data. The actual situation of Doris join is:
    | > HashJoinNode <
    | | |
    | | 100000 | 2000
    | | |
    | OlapScanNode OlapScanNode
    | ^ ^
    | | 100000 | 2000
    | T1 T2
    |
    It's obvious that scanning data for T2 is much faster than that for T1. If we wait for a while before scanning T1, and let T2 deliver the scanned data records to HashJoinNode, HashJoinNode will calculate a filter condition based on the data of T2, such as the maximum and minimum values of T2 data, or construct a Bloom Filter, then send this filter condition to ScanNode that is waiting to scan T1. The latter applies this filter condition, and hands the filtered data over to HashJoinNode, reducing the number of times the hash table is probed and the network overhead. This filter condition is the Runtime Filter, with the effect as follows:
    | > HashJoinNode <
    | | |
    | | 6000 | 2000
    | | |
    | OlapScanNode OlapScanNode
    | ^ ^
    | | 100000 | 2000
    | T1 T2
    |
    If you can push the filter condition (Runtime Filter) down to the Storage engine, in some cases you can use index to directly reduce the amount of data scanned, which greatly reduces the time for scanning. The effect is as follows:
    | > HashJoinNode <
    | | |
    | | 6000 | 2000
    | | |
    | OlapScanNode OlapScanNode
    | ^ ^
    | | 6000 | 2000
    | T1 T2
    |
    It can be seen that unlike predicate push down and partition pruning, Runtime Filter is a filter condition that is dynamically generated at runtime. That is, the join on clause is parsed at query runtime to determine the filter expression, and the expression is broadcast to the ScanNode that is reading the left table, thereby reducing the volume of data scanned, reducing the number of times to probe the hash table, and avoiding unnecessary I/O and network transmission. Runtime Filter is mainly used to optimize joins for large tables. If the data volume of the left table is too small, or the data amount of the right table is too large, then Runtime Filter may not achieve the expected effect.

    Use Method

    Runtime Filter Query Options

    For query option information related to Runtime Filter, see the following section:
    The first query option is to adjust the type of Runtime Filter used. In most cases, you only need to adjust this one option, and keep the other options at defaults. runtime_filter_type: It includes Bloom Filter, MinMax Filter, IN predicate, and by default it conservatively only uses IN predicate. In some cases using Bloom Filter, MinMax Filter, IN predicate together results in higher performance.
    Other query options are usually further adjusted only in certain specific scenarios to achieve the best effect. They are normally optimized only after performance testing, for resource-intensive, long-running and high-frequency queries.
    runtime_filter_mode: It is used to adjust the push-down strategy of Runtime Filter, includes three strategies, namely OFF, LOCAL, GLOBAL, and by default the Setting is GLOBAL strategy
    runtime_filter_wait_time_ms: The time that the ScanNode of the left table waits for each Runtime Filter, and by default it is 1000ms.
    runtime_filters_max_num: The maximum number of Bloom Filters that each query can apply in Runtime Filter, and by default it is 10.
    runtime_bloom_filter_min_size: The minimum length of Bloom Filter in Runtime Filter, and by default it is 1048576(1M).
    runtime_bloom_filter_max_size: The maximum length of Bloom Filter in Runtime Filter, and by default it is 16777216(16M).
    runtime_bloom_filter_size: The default length of Bloom Filter in Runtime Filter, and by default it is 2097152(2M).
    runtime_filter_max_in_num: If the number of rows in the right table of join is greater than this value, we will not generate IN predicate, and by default it is 1024.

    1. runtime_filter_type

    Type of Runtime Filter used. Type: Numbers (1, 2, 4) or corresponding mnemonic character strings (IN, BLOOM_FILTER, MIN_MAX), default 1 (IN predicate), use comma to separate when using multiple, note that quotes are needed, or any number of types numbers are added, for example:
    set runtime_filter_type="BLOOM_FILTER,IN,MIN_MAX";
    which is equivalent to:
    set runtime_filter_type=7;
    Note for Use
    Bloom Filter: There is a certain error rate, which results in less data being filtered than expected, but it will not cause the final result to be inaccurate. In most cases, Bloom Filter can improve performance or has no significant impact on performance, but in some cases it will cause performance to decrease.
    The overhead for the construction and application of Bloom Filter is high, so when the filtration rate is low, or when the volume of data in the left table is small, Bloom Filter may cause performance to decrease.
    Currently, only the Key column of the left table can apply Bloom Filter to push down to the Storage engine, and the test results show that if Bloom Filter does not push down to the Storage engine, it often results in performance decrease.
    At present, Bloom Filter only has a short-circuit logic when using the expression filter on the ScanNode. That is, when the false positive rate is too high, the Bloom Filter is not used continuously. But after the Bloom Filter was pushed down to the Storage engine, there was no short-circuit logic, so when the filtration rate was low, it might cause performance to decrease.
    MinMax Filter: It includes maximum and minimum values, and consequently filters out data less than the minimum value and greater than the maximum value. The filtering effect of MinMax Filter is related to the type of Key column in the join on clause and the data distribution of the left and right tables.
    When the type of Key column in the join on clause is int/bigint/double, etc., in extreme cases, if the minimum and maximum values of the left and right tables are the same, there is no effect. Conversely, if the maximum value of the right table is less than the minimum value of the left table or the minimum value of the right table is larger than the maximum value of the left table, the effect is best.
    When the type of Key column in the join on clause is varchar, etc., application of MinMax Filter often leads to a decrease in performance.
    IN predicate: The IN predicate is built based on all values of the Key column in the join on clause on the right table, and the built IN predicate is filtered on the left table. Compared with the construction and application expenses of Bloom Filter, the overhead is lower and the performance is often higher when the volume of data on the right table is small.
    By default, it will only be pushed down when the number of data rows on the right table is less than 1024 (can be adjusted through session variable runtime_filter_max_in_num).
    At present, the Merge method is not implemented in IN predicate, that is, it cannot be pushed down across Fragments, so when it needs to be pushed down to the ScanNode of the left table in shuffle join, if a Bloom Filter is not generated, we will convert IN predicate to a Bloom Filter to handle the push down across Fragments. Therefore, even if only IN predicate is selected as the type, a Bloom Filter might actually be applied;

    2. runtime_filter_mode

    Used to control the transmission range of Runtime Filter among instances. Type: Number (0, 1, 2) or the corresponding mnemonic string (OFF, LOCAL, GLOBAL), default: 2 (GLOBAL).
    Usage Notes LOCAL: It's relatively conservative. The constructed Runtime Filter can only be used within the same Fragment on the same instance (the smallest unit for query execution), i.e., the Runtime Filter producer (HashJoinNode constructing the Filter) and consumer (ScanNode using the RuntimeFilter) are in the same Fragment, such as a common scenario of broadcast join. GLOBAL: It's relatively aggressive. Apart from satisfying the scenarios of LOCAL strategy, the Runtime Filter can also be used in different Fragments on different instances after Merge through network transmission, such as when Runtime Filter producer and consumer are in different Fragments, i.e., in shuffle join.
    In most cases, the GLOBAL strategy can optimize the query in a wider range of scenarios, but in some shuffle joins, the overhead of generating and merging Runtime Filters exceeds the performance benefits achieved by the query. The strategy can be considered to be changed to LOCAL if needed. If join queries in the cluster do not improve performance due to Runtime Filter, you can switch the Setting to OFF to completely disable this feature. Reasons and strategies for merging Runtime Filters during the construction and application of Runtime Filter in different Fragments can be referred to in ISSUE 6116.

    3. runtime_filter_wait_time_ms

    The time taken to wait for Runtime Filter. Type: integer, the default is 1000, unit: ms.
    Usage Notes With Runtime Filter enabled, ScanNode on the left table will wait for a duration before scanning data for each Runtime Filter allocated to it, i.e., if ScanNode is allocated with 3 Runtime Filters, it will wait for 3000ms at most. Because the construction and Merge of Runtime Filter take time, ScanNode will try to push down the Runtime Filters that have arrived during the waiting time to the Storage engine. If the waiting time is exceeded, ScanNode will start scanning data directly using the arrived Runtime Filters. If the Runtime Filter arrives after the ScanNode starts scanning, ScanNode won't push this Runtime Filter down to the Storage engine. Instead, it will use expressions to filter data that has already been scanned from the Storage engine based on this Runtime Filter. Data previously scanned won't apply this Runtime Filter, then the scale of intermediate data will be larger than the optimal solution, which, however, can prevent severe fragmentation. If the cluster is busy and there are many resource-intensive or long-duration queries in the cluster, you can consider extending the waiting time to avoid missing optimization opportunities for complex queries. If the cluster load is light and on the cluster there are many small queries that only take a few seconds, you can consider shortening the waiting time to avoid adding 1s delay to each query.

    4. runtime_filters_max_num

    The maximum number of Bloom Filters in each query's generated Runtime Filter. Type: Integer, default: 10.
    Usage Notes Currently, only the number of Bloom Filters is restricted, because compared with MinMax Filter and IN predicate, Bloom Filter has a higher cost of construction and application. If the generated Bloom Filters exceed the maximum allowed number, we will retain those Bloom Filters with high selectivity, because high selectivity means there will be more rows can be filtered. This Setting can prevent potential issues caused by excessive memory overhead due to Bloom Filters.
    Selectivity = (HashJoinNode Cardinality / HashJoinNode Left Child Cardinality)
    -- Since the Cardinality obtained by the FE is currently inaccurate, the selectivity calculated by the Bloom Filter here is not accurate. Therefore, only some of the Bloom Filters are randomly retained at last.
    This query option only needs to be adjusted when optimizing some time-consuming queries involving large table joins.

    5. Bloom Filter length-related parameters

    This includes runtime_bloom_filter_min_size, runtime_bloom_filter_max_size, runtime_bloom_filter_size, used to determine the size (in bytes) of the data structure used by Runtime Filter's Bloom Filter. Type: Integer.
    Usage Notes Since it is required that the Bloom Filters built by each HashJoinNode have the same length in order to be merged, currently the length of the Bloom Filter is calculated during the FE query planning. If the data row count (Cardinality) in the right table of join statistics can be obtained, an attempt will be made to estimate the optimal size of the Bloom Filter based on the Cardinality, and round this to the nearest power of 2 (the log value to base 2). If the Cardinality of the right table cannot be obtained, the default Bloom Filter length runtime_bloom_filter_size will be used. runtime_bloom_filter_min_size and runtime_bloom_filter_max_size are used to limit the minimum and maximum lengths of the Bloom Filter used. A larger Bloom Filter is more effective when dealing with inputs with high cardinality, but it needs to use more memory. Suppose there is a need to filter high cardinality columns (such as columns with millions of distinct values) in the query, you might consider increasing the runtime_bloom_filter_size value to do some benchmarking tests. This is conducive to making the Bloom Filter filtering more precise, thereby achieving the expected performance improvement. The effectiveness of the Bloom Filter depends on the data distribution of the query, so typically, only the Bloom Filter length would need to be adjusted for certain specific queries, rather than making global changes. Generally, this query option only needs to be adjusted when optimizing some time-consuming queries involving large table joins.

    Check the Runtime Filter Generated by the Query

    The explain command can display the join on clause information used by each Fragment in the displayed query plan, as well as the notes on the generation and use of Runtime Filters by the Fragment, thus confirming whether the Runtime Filter application has been applied to the expected join on clause.
    The notes contained in the Fragment that generates the Runtime Filter, for example runtime filters: filter_id[type] <- table.column.
    The notes contained in the Fragment that uses the Runtime Filter, for example runtime filters: filter_id[type] -> table.column.
    In the following example, the query uses a Runtime Filter with an ID of RF000.
    CREATE TABLE test (t1 INT) DISTRIBUTED BY HASH (t1) BUCKETS 2 PROPERTIES("replication_num" = "1");
    INSERT INTO test VALUES (1), (2), (3), (4);
    
    CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2 PROPERTIES("replication_num" = "1");
    INSERT INTO test2 VALUES (3), (4), (5);
    
    EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2;
    +-------------------------------------------------------------------+
    | Explain String |
    +-------------------------------------------------------------------+
    | PLAN FRAGMENT 0 |
    | OUTPUT EXPRS:t1 |
    | |
    | 4:EXCHANGE |
    | |
    | PLAN FRAGMENT 1 |
    | OUTPUT EXPRS: |
    | PARTITION: HASH_PARTITIONED: default_cluster:ssb.test.t1 |
    | |
    | 2:HASH JOIN |
    | | join op: INNER JOIN (BUCKET_SHUFFLE) |
    | | equal join conjunct: test.t1 = test2.t2 |
    | | runtime filters: RF000[in] <- test2.t2 |
    | | |
    | |----3:EXCHANGE |
    | | |
    | 0:OlapScanNode |
    | TABLE: test |
    | runtime filters: RF000[in] -> test.t1 |
    | |
    | PLAN FRAGMENT 2 |
    | OUTPUT EXPRS: |
    | PARTITION: HASH_PARTITIONED: default_cluster:ssb.test2.t2 |
    | |
    | 1:OlapScanNode |
    | TABLE: test2 |
    +-------------------------------------------------------------------+
    -- The runtime filters line above shows that 2:HASH JOIN of PLAN FRAGMENT 1 generated an IN predicate with an ID of RF000,
    -- where the key values of test2.t2 are known only at runtime,
    -- using this IN predicate in 0:OlapScanNode to filter out unnecessary data when reading test.t1.
    
    SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2;
    -- Returns 2 rows of results [3, 4];
    
    -- Through the profile of the query (set enable_profile=true;), you can check the detailed information about the internal workings of the query,
    -- including whether each Runtime Filter has been pushed down, waiting duration, and the total time taken for OLAP_SCAN_NODE from preparing to receiving the Runtime Filter.
    RuntimeFilter:in:
    - HasPushDownToEngine: true
    - AWaitTimeCost: 0ns
    - EffectTimeCost: 2.76ms
    
    -- Furthermore, in the OLAP_SCAN_NODE of the profile, you can also check the filtering effect and timed taken after pushing down the Runtime Filter.
    - RowsVectorPredFiltered: 9.320008M (9320008)
    - VectorPredEvalTime: 364.39ms

    Rules for Planning Runtime Filters

    1. Only supports creating Runtime Filters for the equality conditions in the join on clause, excluding Null-safe conditions, because this could potentially filter out the null values in the left table of the join.
    2. Does not support pushing down Runtime Filters to left outer, full outer, or anti join of the left table.
    3. Does not support the case where src expr or target expr is a constant.
    4. Does not support the case where src expr is equivalent to target expr.
    5. Does not support the case where the type of src expr is HLL or BITMAP.
    6. Currently, it only supports pushing down Runtime Filters to OlapScanNode.
    7. It does not support the case where the target expr contains NULL-checking expressions, e.g., COALESCE/IFNULL/CASE, because pushing down this Runtime Filter to the left table of the outer join could possibly lead to incorrect results when the join on clause of other joins at the upper layer of the outer join contains NULL-checking expressions and generates runtime filters.
    8. It does not support the case where the column (slot) in target expr cannot find an equivalent column in the original table.
    9. It does not support column transmission, which includes the following two situations:
    Firstly, for instance, when join on clause contains A.k = B.k and B.k = C.k, currently, C.k can only push down to B.k, but not to A.k.
    Secondly, for instance, when join on clause contains A.a + B.b = C.c, if A.a can be transmitted to B.a, meaning A.a and B.a are equivalent columns, B.a can replace A.a, and then try to push Runtime Filter down to B (if A.a and B.a are not equivalent columns, it cannot be pushed down to B because target expr must have an affinity with only one join left table).
    10. The types of target expr and src expr must be equivalent because Bloom Filter is based on hash, if the types are not equivalent, it will try to convert the type of target expr to the type of src expr.
    11. It does not support pushing down Runtime Filters generated by PlanNode.Conjuncts. Unlike eqJoinConjuncts and otherJoinConjuncts of HashJoinNode, Runtime Filters generated by PlanNode.Conjuncts could potentially yield incorrect results in the test, for instance, when the IN subquery is converted to join, the automatically generated join on clause will be stored in PlanNode.Conjuncts. At this point, the application of Runtime Filters may cause some rows to be missing in the results.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support