For the read-only analysis engine, you should collect basic statistics of table objects to achieve good SQL execution results. The system can more precisely recommend a reasonable SQL execution plan for you only after the table statistics are collected.
Statistics Overview
The statistics is the data characteristics and distribution information about database objects. The database can use the statistics to calculate a more accurate planned cost and optimize the plan selection. The types of statistics include the number of rows in the table, the number of distinct values (NDV) in the column, and the column histogram information. Collecting the statistics involves scanning and obtaining the database objects and storing the collected statistics in the data dictionary of the system. Additionally, the system maintains a local cache of the statistics to enhance the access efficiency of the optimizer. While collecting and using statistics, you should pay attention to the following two points:
The accuracy of statistics is vital for decision making of the optimizer. Therefore, regularly updating and maintaining the statistics is key to maintaining the database performance.
The collection of statistics may have a certain impact on the database performance, for it involves scanning and analyzing database objects. Therefore, you should weigh the performance and the statistics accuracy when collecting statistics.
In summary, the statistics play a crucial role in the database management system. They provide critical information about database objects and assist the optimizer in making more advisable decisions, so as to enhance the query performance and the overall efficiency of the database system.
Note:
Starting from version 2.2410.1.0, random sampling is supported. This feature allows random sampling based on the sampling rate specified by the TableScan operator. It also supports block-level sampling and enables statistics collection according to the configured sampling ratio, significantly reducing the overhead caused by full table scans during the statistics collection process.
Querying Table Statistics Information
SHOW TABLE STATS [ShowLikeOrWhere];
Description of the query result field is as follows:
|
db_name | Database name. |
table_name | Table name. |
partition_name | Partition name. |
update_time | Collection update time. |
modify_count | Number of rows in the table involving modification. |
row_count | Total number of rows in the table. |
sample_size | Number of rows sampled for collecting basic statistics information. |
The queried statistics information for t1 in the test database is shown in the following example:
SHOW TABLE STATS WHERE db_name='test' AND table_name='t1';
Statistics Collection
The read-only analysis engine currently supports two different methods for collecting statistics:
Automatic collection
Starting from version 2.2410.1.0, the read-only analysis engine supports dynamic sampling and automatic statistics collection during the process of loading data into columnar storage. Users do not need to manually collect statistics, as the system automatically generates the relevant statistics for databases and tables.
Note:
The automatically collected statistics information does not include histogram information of tables. If histogram information is required, it should be collected manually.
Manual collection
If you need to manually obtain the latest statistics for a specific table, you can log in to the read-only analysis engine instance using the MySQL client and execute the following statement to collect statistics for the specified table object:
ANALYZE TABLE <table name> [ ALL COLUMNS| COLUMNS IdentList [WITH NUM BUCKETS]];
ALL COLUMNS specifies that statistics information should be collected for all columns according to the specified method, such as collecting histogram information for all columns.
If ALL COLUMNS is not specified, the default collection policy is used, which collects basic statistics information for columns without histogram information.
COLUMNS IdentList represents the relevant statistics information to be collected for specified columns, for example, COLUMN c1,c2,c3.
NUM BUCKETS specifies the number of buckets for collecting equi-height histograms or TOPN information, with a range of [1, 1024]. If the specified number of buckets is 1, it means that no equi-height histogram or TOPN information is collected.
Use Case
Use the default policy to collect statistics information for t1 in the test database:
Specify the collection of histogram statistics information for all columns of t1 in the test database:
ANALYZE TABLE t1 ALL COLUMNS WITH 256 BUCKETS;
Specify the collection of histogram statistics information for columns c1 and c2 of t1 in the test database:
ANALYZE TABLE t1 COLUMNS c1,c2 WITH 256 BUCKETS;
Manually Specifying the Database-Level Statistics Information Collection
When there is a large number of tables, statistics information collection can be performed directly for databases. The collection policy is the same as that at the table level, and the principle is to collect information for all tables under the database as thoroughly as possible. If the collection for a particular table fails, it will not affect the collection for other tables.
ANALYZE DATABASE DBNameList [ ALL COLUMNS| COLUMNS IdentList [WITH NUM BUCKETS]];