DLC offers agile and efficient serverless data lake analysis and computing services. With DLC being a distributed computing platform, DLC's query performance is influenced by various internal and external factors, such as engine CU size, the number of simultaneously submitted tasks for queueing, SQL query structure, and Spark parameter settings. DLC insight management provides a visual and intuitive interface that helps you quickly understand the current query performance, identify potential performance-impacting factors, and receive performance optimization suggestions.
DLC offers the Insight Management feature that includes Task Insightsand Engine Usage Insights, helping users better adjust resources or optimize task logic. Applicable Business Scenarios:
1. A need for comprehensive insights into the overall performance of the Spark engine. For example, this includes intuitive visualization and analysis of various metrics such as resource preemption across tasks running under the engine, resource usage within the engine, execution duration, data scan size, and data shuffle size.
2. A need for convenient self-service troubleshooting and analysis of task performance. For example, this includes the ability to filter and sort numerous tasks by duration to quickly identify problematic large tasks, and to pinpoint reasons for slow or failed Spark tasks, such as resource preemption, shuffle anomalies, or insufficient disk space, all with clear diagnostics.
Note:
Supported engines and task types: Currently, only Spark-type insights are supported under SuperSQL (both SQL and batch job engines are supported).
Task Insights
Task Insights provide a task-based perspective, helping users quickly identify completed tasks and receive optimization analysis and recommendations for performance improvements.
Directions
Log in to the Data Lake Compute DLC console, select the Insight Management feature and then switch to the Task Insights page. Insight Overview
Daily-level statistics provide an overview of the distribution and trends of tasks that require optimization, offering a more intuitive understanding of the tasks for each day.
Task Insights
The Task Insights feature supports the analysis of summary metrics for each executed task and identifies potential optimization issues.
After a task is completed, users can simply select the task they want to analyze and click Task Insights in the operation bar to view the insights.
Based on the actual execution of the current task, DLC Task Insights will provide optimization recommendations by combining data analysis and algorithmic rules.
Engine Usage Insights
When cluster resources are tight, tasks may be submitted to the engine but remain queued within the engine without the user's awareness. Then users may continue to submit tasks, worsening task congestion.
Engine Usage Insights provides a unified display of all tasks under an engine, from submission to execution, offering a comprehensive view of task distribution. This helps users quickly analyze the overall usage of the engine.
Note: Real-time data for engine execution time and data scan volume is only available for the SparkSQL engine. For the Spark job engine, queuing time and execution time within the engine are only available after the insights are completed.
Log in to the Data Lake Compute DLC console, select the Insight Management feature, switch to the Engine Usage Insight page, and select the data engine you want to view. How to Enable the Insights Feature
Upgrade Existing Spark Engines to SuperSQL Engine Kernel Image
For newly purchased engines or those purchased after July 18, 2024, the insights feature is automatically enabled, and you can skip this step:
Go to the SuperSQL engine list page, select the engine name for which you want to enable insights, navigate to Kernel Management, and click Version Upgrade (the default is to upgrade to the latest kernel).
Overview of Key Insight Metrics
|
Resource preemption | The SQL task execution delay is greater than 1 minute from the stage submission time, or the delay exceeds 20% of the total runtime. |
Shuffle exception | The error stack information related to shuffle issues appears during stage execution. |
Slow task | The duration of a task within a stage is more than twice the average duration of other tasks in the same stage. |
Data skew | The shuffle data of a task is more than twice the average shuffle data size of other tasks. |
Insufficient disk or memory | The error stack during stage execution contains information about OOM, insufficient disk space, or COS bandwidth limitation errors. |
Engine execution time | Reflects the time taken for the first task to be executed in the Spark engine (the time when the task first begins to occupy the CPU and starts execution). |
CU consumption | Reflects the actual resource consumption of tasks. This is calculated by summing the runtime of all Spark task executors. Since each Spark task is executed in parallel across multiple CUs, the CU consumption time is accumulated sequentially, which results in a total time greater than the engine's execution time. |
Data scan size | The total size of input bytes aggregated across each Spark stage. |
Total output size | The total size of output bytes aggregated across each Spark stage. |
Affected data size and row count | Indicates the size of the data and the number of rows impacted when a table is updated. |
Parallel tasks | Displays the parallel execution of tasks, making it easier to analyze affected tasks (up to 200 tasks). |
Was this page helpful?