Metric Name | Metric Definition |
Engine execution time | Reflects the time the first task was executed on the Spark engine (the time when the task first preempted the CPU for execution). |
CU consumption | Reflects the actual resource consumption of a task. The calculation aggregates the runtime of all Spark task executors. Since each Spark task is executed in parallel across multiple CUs and the CU consumption time is serialized and accumulated, it is greater than the engine execution duration. |
Data scan size | Summarizes the input bytes size for each Spark stage. |
Total output size | Summarizes the output bytes size for each Spark stage. |
Data shuffle size | Summarizes the shuffle read bytes size for each Spark stage. |
Number of output files | (This metric requires the Spark engine kernel to be upgraded to a version after November 16, 2024)The total number of files written by tasks through statements such as insert. |
Number of output small files | (This metric requires the Spark engine kernel to be upgraded to a version after November 16, 2024)Small files are defined as output files with a size less than 4 MB (controlled by the parameter spark.dlc.monitorFileSizeThreshold, default 4 MB, configurable at the engine or task level). This metric represents the total number of small files written by tasks through statements such as insert. |
Parallel task | Displays the parallel execution of tasks, making it easier to analyze affected tasks (up to 200 entries). |
Insight Type | Algorithm Description (Continuously Improving and Adding New Algorithms) |
Resource preemption | SQL execution task delay time is greater than 1 minute after stage submission, or delay exceeds 20% of the total runtime (the threshold formula dynamically adjusts based on task runtime and data volume). |
Shuffle exception | Stage execution encounters shuffle-related error stack information. |
Slow task | Task duration in a stage is greater than twice the average duration of other tasks in the same stage (the threshold formula dynamically adjusts based on task runtime and data volume). |
Data skew | Task shuffle data is greater than twice the average shuffle data size of other tasks (the threshold formula dynamically adjusts based on task runtime and data volume). |
Disk or memory insufficiency | Error stack information during stage execution includes OOM, insufficient disk space, or COS bandwidth limitation errors related to disk or memory insufficiency. |
Excessive small file output | (This insights type requires the Spark engine kernel to be upgraded to a version after November 16, 2024)See the metric number of output small files in the list, and the presence of excessive small file output is determined if any of the following conditions are met:1. For partitioned tables, if any partition outputs more than 200 small files; for non-partitioned tables, if the total number of small files exceeds 200.2. If partitioned or non-partitioned tables output more than 3,000 files with an average file size less than 4 MB. |