Overview
The JobManager of a Flink job manages and schedules the whole job. It is a JVM process with its own heap memory. For a source connector using the FLIP-27 interface, its enumerator will record the shard information in the heap memory. Too many shards may result in the use of too much heap memory, affecting the stability of the job as a whole.
When the JVM heap memory is about to be used up, full GC (a memory recovery mechanism) is triggered to release the space. If only a small size of memory is recovered each time and it is difficult to release the heap memory in time, full GC will be triggered frequently and continuously in the JVM. This operation will occupy a large amount of CPU time, making the execution threads of the job fail, and this event is triggered.
Note
This feature is in beta testing, so custom rules are not supported. This capability will be available in the future.
Trigger conditions
The system detects the full GC time of the JobManager of a Flink job every 5 minutes.
If the increased full GC time of the JobManager accounts for more than 30% of a detection period (the full GC time exceeds 1.5 minutes within 5 minutes), a severe full GC problem exists in the job, and this event is triggered.
Note
To avoid frequent alarms, at most one push of this event can be triggered per hour for each running instance ID of each job.
Alarm configuration
Suggestions
If you receive a push notification of this event, we recommend you configure more resources for the job as instructed in Configuring Job Resources. For example, you can increase the JobManager spec (increase max available space of the JobManager heap memory to contain more state data). If you use MySQL CDC, we recommend you increase the size per shard in the WITH parameter (set scan.incremental.snapshot.chunk.size
to a larger value) to avoid the JobManager heap memory from being used up due to too many shards. If OutOfMemoryError: Java heap space
is not found in the logs, and the job is properly running, we recommend you configure alarms for the job, and add job failure event in the alarm rules of Stream Compute Service to timely receive job failure event pushes. If the problem persists after all above methods are used, submit a ticket to contact the technicians for help.
Was this page helpful?