Overview
A job failure event in Stream Compute Service indicates that the status of a Flink job changes from running to failed or restarting, which may cause interrupted data processing, output delay in the downstream, and other issues.
Conditions
Trigger
1. The status of a Flink job changes from RUNNING
to FAILED
or RESTARTING
. Later, the Flink JobManager will recover the job in about 10s, with the running instance ID after recovery remaining unchanged.
2. A Flink job is restarted too many times or too frequently, exceeding the limit (the threshold is generally controlled by restart-strategy.fixed-delay.attempts
and defaults to 5, and we recommend you increase it in a production environment) given in the Restart Policies. This will result in the exit of both the JobManager and the TaskManagers, and the system will try to recover the job from the last successful checkpoint within about 2 minutes, with the running instance ID after recovery increased by 1. Clearing
After the Flink or Stream Compute Service system recovers the job back to RUNNING
, a failed job recovery event will be generated, indicating the end of this event.
Alarms
Suggestions
You can search for exception logs under the instance ID of the job for which the event is generated, as instructed in Diagnosis with Logs. Generally speaking, error messages before and after the keywords from RUNNING to FAILED
contain the direct causes of the job failure. We recommend you analyze the issue based on these error messages together with the logs of the JobManager and the TaskManagers. If the problem is still not found with the above diagnosis, please check as instructed in Viewing Monitoring Information whether resource overuse exists. You can focus on TaskManager CPU usage, heap memory usage, full GC count, full GC time, and other critical metrics to check whether exceptions exist.
Was this page helpful?