Problem Description
High CPU utilization in TDSQL-C for MySQL clusters can often lead to system anomalies, such as slow responses, inability to obtain connections, and timeout. A large number of timeout retries are often the main culprits of performance "avalanches". High CPU utilization is often caused by abnormal SQL statements, and a large number of lock conflicts, lock waits, or unsubmitted transactions can also lead to high CPU utilization.
When the database performs business queries or modifies statements, the CPU first requests data blocks from the memory:
If the memory has the target data, the CPU will execute the computation task and return the result, which may involve actions requiring high CPU utilization such as sorting.
If the memory does not have the target data, the database will get the data from the disk.
The two data acquisition processes above are called logical read and physical read, respectively. Therefore, poorly performing SQL statements can easily cause the database to generate a lot of logical reads during the execution, resulting in high CPU utilization. They may also make the database generate a lot of physical reads, resulting in high IOPS and I/O latency.
Solutions
DBbrain provides users with three major features to identify and optimize the abnormal SQL statements that cause high CPU utilization: Anomaly diagnosis: It supports 7 * 24-hour anomaly detection and diagnosis, providing real-time optimization suggestions.
Slow SQL analysis: It analyzes slow SQL statements of the current instance and provides corresponding optimization suggestions.
Audit log analysis: It performs in-depth analysis on SQL statements and provides optimization suggestions based on TencentDB audit data (full SQL).
Method 1 (recommended): Use the "exception diagnosis" feature to troubleshoot database exceptions.
The exception diagnosis feature offers proactive fault localization and optimization, requiring no database operation and maintenance experience. It addresses not only exceptions of high CPU utilization but also nearly all frequent exceptions and failures in both read/write instances and read-only instances in a cluster.
The steps are as shown in the example below:
1. log in to the DBbrain console, select Performance Optimization from the left navigation pane, and then click the Exception Diagnosis tab on the top. 2. Select (enter or search for) an instance ID in the top-left corner to switch to the target instance.
3. On this page, select Real-Time or Historical and specify the time to be queried. If there are any failures within this time frame, an overview of the information can be viewed in the "Diagnosis Prompt" on the right.
4. Click View Details in the "Real-Time/Historical Diagnosis" or the diagnostic items in the Diagnosis Prompt column to enter the diagnosis details page.
Event overview: Includes the diagnosis item name, time range, risk level, duration, and overview.
Description: Includes symptom snapshots and performance trends of the exception event or health check event.
Intelligent Analysis: Analyzes the root cause of the performance exception to help you locate the specific operation.
Expert Suggestion: Provides optimization suggestions, including but not limited to SQL optimization (index and rewrite), resource configuration optimization, and parameter fine-tuning.
5. Click the Optimization Suggestions tab to view the optimization suggestions provided by DBbrain for the failure, such as optimization suggestions for SQL statements in this case.
Method 2. Use the "slow SQL analysis" feature to troubleshoot SQL statements that lead to high CPU utilization
1. Log in to the DBbrain console, select Diagnostic Optimization from the left navigation pane, and click the Slow SQL Analysis tab on top. 2. Select (enter or search for) an instance ID in the top-left corner to switch to the target instance.
3. On the page, select the time period you wish to query. If there are slow SQL statements during this period, the SQL statistics section will display them in a bar chart, showing the times and quantities of slow SQL occurrences.
Click on the bar chart, and the list below will display all the related slow SQL information (aggregated SQL templates), and the right side will display the execution time distribution of SQL during that period.
4. You can identify and filter SQL statement execution data in the SQL statement list in the following way:
4.1 Sort the SQL statements by average duration (or maximum duration). Examine the top SQL statements in terms of duration. We do not recommend you sort the statements by total duration, as the data may be affected by a high number of executions.
4.2 Then, check the numbers of returned rows and scanned rows.
If there is an SQL statement with the same "number of returned rows" and "number of scanned rows", it is very likely that the full table has been queried and returned.
If there are several SQL statements with a large number of scanned rows but no or few returned rows, it means that the system generated a lot of logical and physical reads. If the volume of the data to be queried is too high and memory is insufficient, the request will generate many physical I/O requests and consume lots of I/O resources. Too many logical reads will occupy too many CPU resources, resulting in high CPU utilization.
5. Click an SQL statement to view its details, resource consumption, and optimization suggestions.
Analysis page: You can view the complete SQL template, SQL samples, and optimization suggestions and descriptions. You can optimize SQL based on the expert recommendations provided by DBbrain to improve SQL performance and reduce execution time.
Statistics page: Based on the total execution time proportion, total lock wait time proportion, total rows scanned proportion, and total rows returned proportion in the statistics report, you can analyze the specific causes of the slow SQL occurrence and perform corresponding optimization.
Details page: You can view the user source, IP source, database, and other detailed information for this type of SQL.
Was this page helpful?