Overview
Various events may occur during the running of a job, such as start, job running failure, checkpointing failure, and other exceptions. A comprehensive events page is provided in the Stream Compute Service console, allowing you to view and subscribe to critical events.
On the events page, you can select a target event type, and further filter events by running instance ID and time range. You can click Reset filter to clear filters and reset to defaults and pull the latest events.
Note
To avoid the generation of too many events, the max time range for filtering is limited to 7 days within the past 90-day period.
Event types
Job start and stop
When you click Publish draft on the Development & Testing page of a job, or when the job exits due to crash and the event is detected, the system will try to start the job, and automatically create a new instance ID for this run. Later, you will see a new start event on the events page. When you stop or restart the job, or it crashes and exits, a stop event with the above-mentioned instance ID will be generated. The job start time and end time refer to the time points when the internal process of the job is completed, but not the time points when you operate on the UI.
For example, the information in the figure below indicates that the instance is started on 2021-11-10 16:49:30 and stopped on 2021-10-10 16:55:52 by you or the system.
Job running failure and recovery
When a job is restarted during its running (its status changes from RUNNING
to RESTARTING
or FAILED
), a job running failure event will be generated. If the job is RUNNING
again, a failed job recovery event will be generated.
You can select Operation > Solution to view causes of and solutions to the event. You can also configure alarms for job running failure events. Checkpointing failure and recovery
If checkpointing is enabled for a job, and a checkpoint fails to be taken, a checkpoint failure event will be generated. If the checkpoint succeeds later, a failed checkpoint recovery event will be generated.
You can select Operation > Solution to view causes of and solutions to the event. You can also configure alarms for checkpoint failure events. Job exception events (in beta)
The Stream Compute Service backend continuously monitors and analyzes the running of jobs. When a job encounters severe exceptions (such as TaskManager full GC too long, CPU load too high, and abnormal Pod exit), the corresponding events will be pushed to you for reference, so that you can determine whether the job is properly running.
Note
To avoid bothering you unnecessarily, at most one job exception event (other than the abnormal Pod exit) will be pushed per hour.
This feature is in beta testing. It supports detecting severe problems only, and thresholds cannot be adjusted. It will be further improved and upgraded. Please stay tuned.
Was this page helpful?