Overview
For a running (or once run) streaming job, you can view its monitoring information with the following two methods.
Viewing in the console
Log in to the Stream Compute Service console, click the name of the target job, and click the Monitoring tab to view key job metrics, such as incoming data records, outgoing data records, operator computing time, CPU utilization, and heap memory utilization. Beta features: For major regions such as Beijing, Guangzhou, and Shanghai, fine-grained metrics are available on the Monitoring page, such as by Job, TaskManager, or Task.
Viewing on Tencent Cloud Observability Platform (TCOP)
On the job list page in the console, click Tencent Cloud Observability Platform on the right to go to the TCOP console and view monitoring metrics in detail there. You can also configure job-specific alarm policies there. Note
Stream Compute Service also supports monitoring Flink metrics with Prometheus, which allows you to save, analyze, and display various job metrics. Illustrations of the Stream Compute Service console
On the Jobs page in the console, you can view the running of jobs.
Take high_cpu in the above figure as an example. Click Job name to go to the details page.
On the Overview page of the Monitoring tab, select a time range.
Default available ranges include last hour, last day, and last 7 days. You can set a custom time range.
Two sampling granularity options are available: 1-minute granularity and 5-minute granularity, with smoother curves formed at the latter.
Metrics available on the overview page
The most critical runtime metrics of the job are provided on the overview page, such as incoming data records, outgoing data records, operator computing time, sink watermark delay (based on the current timestamp), job restarts, TaskManager CPU utilization, TaskManager heap memory utilization, and TaskManager old GC time and count, helping you quickly identify common job exceptions.
Checkpoint metrics (in beta)
Note
Checkpoint metrics are in beta testing and available only for Guangzhou, Beijing, and Shanghai. For other regions, please stay tuned.
After checkpointing is enabled for a Flink job, its runtime information will be saved in checkpoints for recovering the job when necessary. The following metrics are displayed on this page:
Last checkpoint size (Bytes): The size of the last checkpoint.
Checkpoint time (ms): The time taken to make the last checkpoint.
Total checkpoint failures: The total number of checkpoint failures.
JobManager metrics (in beta)
Note
JobManager metrics are in beta testing and available only for Guangzhou, Beijing, and Shanghai. For other regions, please stay tuned.
When a Flink job is started, only one JobManager (JM) will be used. Therefore, the metrics displayed here are those of this JobManager.
JM CPU Load (%): Status.JVM.CPU.Load
of the JobManager, representing the CPU utilization of the JVM.
JM Heap Memory (Bytes): The heap memory usage of the JobManager.
JM GC Count: Status.JVM.GarbageCollector.<GarbageCollector>.Count
of the JobManager, representing the GC count of the JobManager.
JM GC Time (ms): Status.JVM.GarbageCollector.<GarbageCollector>.Time
of the JobManager, representing the GC time of the JobManager.
TaskManager metrics (in beta)
Note
TaskManager metrics are in beta testing and available only for Guangzhou, Beijing, and Shanghai. For other regions, please stay tuned.
When a Flink job is started, one or more TaskManagers will be used, depending on the specified parallelism. All TaskManagers will be displayed in the list. You can select a TaskManager to view its metrics. Available TaskManager metrics include the following:
CPU Load (%): Status.JVM.CPU.Load
of the TaskManager, representing the CPU utilization of the JVM.
Heap Memory (Bytes): The heap memory usage of the TaskManager.
GC Count: Status.JVM.GarbageCollector.<GarbageCollector>.Count
of the TaskManager, representing the GC count of the TaskManager.
GC Time (ms): Status.JVM.GarbageCollector.<GarbageCollector>.Time
of the TaskManager, representing the GC time of the TaskManager.
Pod Memory (Bytes): The memory usage of the Pod where the TaskManager resides. This metric represents the memory usage of the whole Pod, including the JVM heap memory, non-heap direct memory, overhead, and memory of other auxiliary services in the Pod. If the value of this metric is too large, the Pod faces the risk of being killed due to OOM.
Pod CPU (%): The CPU utilization of the Pod where the TaskManager resides. This metric represents the CPU utilization of the whole Pod, including the CPU usage of the JVM and other auxiliary services in the Pod.
Task metrics (in beta)
Note
Task metrics are in beta testing and available only for Guangzhou, Beijing, and Shanghai. For other regions, please stay tuned.
The execution graph of a Flink job contains one or more tasks. You can view the metrics of a task in the graph.
OutPoolUsage: The percentage of output queues. When this metric reaches 100%, the task is backpressured, which needs to be resolved with some methods, such as setting a larger operator parallelism.
OutputQueueLength: The number of output queues.
InPoolUsage: The percentage of input queues. When this metric reaches 100%, the task is backpressured, which needs to be resolved with some methods, such as setting a larger operator parallelism.
InputQueueLength: The number of input queues.
CurrentInputWatermark: The last (minimum) watermark the task has received.
Was this page helpful?