Overview
A TaskManager of a Flink job is a JVM process with its own heap memory. Both storing the runtime state of Flink operators and other operations can cause the use of too much heap memory.
When the JVM heap memory is about to be used up, full GC (a memory recovery mechanism) is triggered to release the space. If only a small size of memory is recovered each time and it is difficult to release the heap memory in time, full GC will be triggered frequently and continuously in the JVM. This operation will occupy a large amount of CPU time, making the execution threads of the job fail, and this event is triggered.
Note
This feature is in beta testing, so custom rules are not supported. This capability will be available in the future.
Trigger conditions
The system detects the full GC time of all TaskManagers of a Flink job every 5 minutes.
If the increased full GC time of a TaskManager accounts for more than 30% of a detection period (the full GC time exceeds 1.5 minutes within 5 minutes), a severe full GC problem exists in the job, and this event is triggered.
Note
To avoid frequent alarms, at most one push of this event can be triggered per hour for each running instance ID of each job.
Alarms
Suggestions
If you receive a push notification of this event, we recommend you configure more resources for the job as instructed in Configuring Job Resources. For example, you can increase the TaskManager spec (increased max available space of the TaskManager heap memory to contain more state data), or set a larger operator parallelism (reduced amount of data processed by a TaskManager to reduce the memory used) for more efficient data processing. You can also adjust advanced Flink parameters as instructed in Advanced Job Parameters. For example, you can set taskmanager.memory.managed.size
to a smaller value to increase the available heap memory. However, you must make adjustments under the guidance of an expert who fully understands the memory allocation mechanisms in Flink. Otherwise, this operation probably poses other issues. If OutOfMemoryError: Java heap space
or similar keywords are found in the job crash logs, you can enable the feature of Collecting Pod Crash Events, and set -XX:+HeapDumpOnOutOfMemoryError
as described in the document, so that the local heap dump can be captured in time for analysis in case of an OOM crash of the job. If OutOfMemoryError: Java heap space
is not found in the logs, and the job is properly running, we recommend you configure alarms for the job, and add job failure event in the alarm rules of Stream Compute Service to timely receive job failure event pushes. If the problem persists after all above methods are used, submit a ticket to contact the technicians for help.
Was this page helpful?