tencent cloud

Feedback

Query Performance Optimization Guide

Last updated: 2024-07-31 18:03:07

    Foreword

    To enhance task execution efficiency, the DLC engine employs numerous optimization measures during computation, such as Data Governance, Iceberg indexing, Cache, etc. Proper use can not only reduce unnecessary scan costs but can even increase efficiency by several times or even dozens of times. Below, some optimization ideas are provided at different levels.

    Optimize SQL Statements

    Scenario: The SQL statement itself is unreasonable, leading to poor execution efficiency.

    Optimize JOIN Statements

    When a query involves JOINs with multiple tables, the Presto engine prioritizes completing the JOIN operation for the table on the right side of the query. Generally, completing the JOIN for the smaller table first, then joining the result set with the larger table, leads to higher execution efficiency. Therefore, the order of JOINs directly affects the query's performance. DLC Presto automatically collects statistics for inner tables and uses CBO to reorder the tables in the query.
    For outer tables, users can usually collect statistics through the analyze statement or manually specify the order of JOINs. If manual specification is needed, please order the tables by size, placing the smaller table on the right and the larger table on the left, as in tables A > B > C, for example: select * from A Join B Join C. It is important to note that this does not guarantee increased efficiency in all scenarios, as it actually depends on the size of the data set resulting from the JOIN.

    Optimize GROUP BY Statements

    Arranging the order of fields in the GROUP BY statement can improve performance to a certain extent. Please sort the aggregation fields by cardinality from highest to lowest, for example:
    // Efficient approach
    SELECT id,gender,COUNT(*) FROM table_name GROUP BY id, gender;
    // Inefficient approach
    SELECT id,gender,COUNT(*) FROM table_name GROUP BY gender, id;
    Another optimization method is to use numbers to replace the specific grouping fields as much as possible. These numbers represent the positions of the column names following the SELECT keyword, for example, the above SQL can be replaced as follows:
    SELECT id,gender,COUNT(*) FROM table_name GROUP BY 1, 2;

    Use Approximate Aggregate Functions

    For query scenarios that can tolerate a small amount of error, using these approximate aggregate functions can significantly improve query performance.
    For example, Presto can use the APPROX_DISTINCT() function instead of COUNT(distinct x), and the corresponding function in Spark is APPROX_COUNT_DISTINCT. The drawback of this approach is that approximate aggregate functions have an error margin of about 2.3%.

    Use REGEXP_LIKE instead of multiple LIKE statements

    When there are multiple LIKE statements in SQL, you can often use regular expressions to replace multiple LIKEs, which can significantly improve execution efficiency. For example:
    SELECT COUNT(*) FROM table_name WHERE field_name LIKE '%guangzhou%' OR LIKE '%beijing%' OR LIKE '%chengdu%' OR LIKE '%shanghai%'
    Can be optimized to:
    SELECT COUNT(*) FROM table_name WHERE regexp_like(field_name, 'guangzhou|beijing|chengdu|shanghai')

    Data Governance

    Data Governance Use cases

    Scenario: Real-time writing. Flink CDC real-time writing usually adopts the upsert method, which generates a large number of small files during the writing process. When small files accumulate to a certain extent, it can cause data queries to slow down, or even result in timeout failures.
    You can check the number of table files and snapshot information in the following way.
    SELECT COUNT(*) FROM [catalog_name.][db_name.]table_name$files;
    SELECT COUNT(*) FROM [catalog_name.][db_name.]table_name$snapshots;
    For example:
    SELECT COUNT(*) FROM `DataLakeCatalog`.`db1`.`tb1$files`;
    SELECT COUNT(*) FROM `DataLakeCatalog`.`db1`.`tb1$snapshots`;
    When the number of table files and snapshots is excessive, refer to the document Enable data optimization to activate the data governance feature.

    Data Governance Effectiveness

    After enabling data governance, there was a significant improvement in query efficiency. For example, the table below compares the query time before and after merging files. The experiment used a 16CU Presto, with a data volume of 14M, 2921 files, and an average of 0.6KB per file.
    Executed Statement
    Merge Files
    Number of files
    Number of records
    Query time consumption
    Effect
    SELECT count(*) FROM tb
    No
    2921 items
    7895 entries
    32s
    93% faster speed
    SELECT count(*) FROM tb
    Yes
    1 item
    7895 entries
    2s

    Partition

    Partitioning enables the classification and storage of related data based on column values with different characteristics such as time and region. This significantly reduces scan volume and improves query efficiency. For more details on DLC external table partitioning, please refer to Quick Start with Partition Table. The table below shows a comparison of query time consumption and scan volume between partitioned and unpartitioned states in a single table with a data volume of 66.6GB, 1.4 billion data records, and an ORC data format. Within it, dt is a partition field containing 1,837 partitions.
    Query statement
    Unpartitioned
    Partition
    Time Consumption Comparison
    Scan Volume Comparison
    Time Consumption
    Scan Volume
    Time Consumption
    Scan Volume
    SELECT count(*) FROM tb WHERE dt='2001-01-08'
    2.6s
    235.9MB
    480ms
    16.5 KB
    81% Faster
    Reduce by 99.9%
    SELECT count(*) FROM tb WHERE dt<'2022-01-08' AND dt>'2001-07-08'
    3.8s
    401.6MB
    2.2s
    2.8MB
    Faster by 42%
    Reduce by 99.3%
    As can be seen from the above table, partitioning can effectively reduce Query Latency and Scan Volume, but Excessive partitioning may backfire. As shown in the table below.
    Query statement
    Unpartitioned
    Partition
    Time Consumption Comparison
    Scan Volume Comparison
    Time Consumption
    Scan Volume
    Time Consumption
    Scan Volume
    SELECT count(*) FROM tb
    4s
    24MB
    15s
    34.5MB
    Slower by 73%
    30% More
    It is recommended to filter partitions in your SQL statements using the WHERE keyword.

    Cache

    In today's trend of Distributed Computing and Compute-Storage Separation, accessing Metadata and Huge Data through Network will be restricted by Network I/O. DLC significantly reduces Response Latency by defaulting to the following Caching Technologies, without the need for your intervention.
    Alluxio: is a Data Orchestration Technology. It provides a Cache, moving data from the Storage Layer to a location closer to Data-Driven Applications, making it more accessible. Alluxio's Memory-First Hierarchical Architecture allows data access to be several orders of magnitude faster than existing solutions.
    RaptorX: is a Linker for Presto. It runs on top of storage like Presto, providing Sub-Second Latency. The goal is to offer a Unified, Cost-Effective, Rapid, and Scalable solution for OLAP and Interactive Use Cases.
    Result Cache: Caches the same repeated queries, greatly improving speed and efficiency
    The DLC Presto engine by default supports Tiered Cache with RaptorX and Alluxio, effectively reducing latency in similar task scenarios within a short period. Both Spark and Presto engines support Result Cache.
    The following table shows TPCH benchmark data in a total data volume of 1TB Parquet files, using 16CU Presto for this test. Since the test focuses on the caching feature, it primarily selects SQLs with significant IO consumption from TPCH. The tables involved mainly include lineitem, orders, customer, etc. The SQLs involved are Q1, Q4, Q6, Q12, Q14, Q15, Q17, Q19, and Q20. The horizontal axis represents the SQL statement, and the vertical axis represents the running time (in seconds).
    
    
    
    It's important to note that the DLC Presto engine dynamically loads the cache based on Data Access Frequency. Therefore, cache hits cannot be achieved during the engine's first task execution after startup, leading to initial performance still being limited by network IO. However, this limitation is significantly mitigated as the Number of executions increases. The table below shows a performance comparison of three queries in a presto 16cu cluster.
    Query statement
    Query
    Time Consumption
    Data Scan Volume
    SELECT * FROM table_name WHERE udid='xxx'; 
    
    First Query
    3.2s
    40.66MB
    Second Query
    2.5s
    40.66MB
    Third Query
    1.6s
    40.66MB
    You can view the cache hit ratio of executed SQL tasks in the 'Data Exploration' feature of the DLC Console.
    
    

    Index

    Compared to external tables, the table creation method using internal tables + indexes will significantly reduce both time and scan volume. For more detailed information about creating tables, please refer to Data Table Management.
    After creating a table, build an index before inserting based on the business usage frequency, after WRITE ORDERED BY for the indexed fields.
    alter table `DataLakeCatalog`.`dbname`.`tablename` WRITE ORDERED BY udid;
    The table below shows a comparison of query performance on external and internal tables (with indexes) in a presto 16cu cluster
    Table Types
    Query
    Time Consumption
    Data Scan Volume
    Exterior
    First Query
    16.5s
    2.42GB
    Second Query
    15.3s
    2.42GB
    Third Query
    14.3s
    2.42GB
    Inner Table (Index)
    First Query
    3.2s
    40.66MB
    Second Query
    2.5s
    40.66MB
    Third Query
    1.6s
    40.66MB
    It is evident from the table that, compared to external tables, the table creation method using inner tables + indexes significantly reduces both time and scan volume. Moreover, due to cache acceleration, the execution time will also decrease as the number of executions increases.

    Synchronous Query and Asynchronous Query

    DLC has undergone special optimization for BI scenarios. It can be set to synchronize mode or asynchronous mode (supports only the Presto engine) by configuring the engine parameter dlc.query.execution.mode. The value descriptions are as follows.
    async (default): In this mode, tasks complete full query calculations, and the results are saved to COS before being returned to the user, allowing users to download the query results after the query has completed. 
    sync: Under this mode, it is not necessary to perform full calculation. Once partial results are available, they are directly returned to the user by the engine, without being saved to COS. Thus, users can achieve lower query latency and reduced time consumption, but the results are only stored in the system for 30s. It is recommended to use this mode when complete query results from COS are not needed, but lower query latency and time consumption are desired, such as during the query exploration phase or for BI result presentation.
    Configuration Method: After selecting the Data Engine, it supports parameter configuration for the data engine. After selecting the data engine, click add in Advanced settings to configure.

    
    
    

    Resource Bottleneck

    To assess whether resources have reached a bottleneck, DLC provides monitoring of engine resources such as CPU, memory, cloud disks, and network. You can adjust resource specifications according to business scale. For adjustments, refer to the Adjustment Configuration Fee Explanation. Steps to view engine resource usage are as follows:
    1. Open the Data Engine Tag page on the left.
    2. Click the Monitoring button on the right side of the respective engine.
    3. Navigate to TCOP, where you can see all monitoring metrics as shown below. For detailed operations and monitoring metrics, refer to Data Engine Monitoring. You can also configure alarms for each metric. For a detailed introduction, refer to Monitoring and Alarm Configuration.
    
    
    
    
    
    

    Other Factors

    Adaptive Shuffle

    To enhance stability, DLC by default enables Adaptive Shuffle, which supports regular shuffle with limited local disk space while ensuring stability in scenarios of large shuffle and data skew. Advantages of Adaptive Shuffle include:

    Cluster cold start

    The DLC supports the automatic or manual suspension of a cluster. After the suspension, no charges are incurred. Therefore, the message "Queuing" may be displayed when a task is executed for the first time after the cluster is started, because resources are being pulled up during the cold start of the cluster. If you submit tasks frequently, it is recommended to Purchase a package year/month cluster,which does not have a cold start and can quickly execute tasks at any time.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support