Viewing Critical Events

Overview
Various events may occur during the running of a job, such as start, job running failure, checkpointing failure, and other exceptions. A comprehensive events page is provided in the Stream Compute Service console, allowing you to view and subscribe to critical events.
On the events page, you can select a target event type, and further filter events by running instance ID and time range. You can click Reset filter ‍to clear filters and reset to defaults and pull the latest events.
Note
 To avoid ‍the generation of too many events, the max time range for filtering is limited to 7 days within the past 90-day period.
Event types
Job start and stop
When you click Publish draft on the Development & Testing page of a job, or when the job exits due to crash and the event is detected, the system will try to start the job, and automatically create a new instance ID for this run. Later, you will see a new start event on the events page. When you stop or restart the job, or it crashes and exits, a stop event with the above-mentioned instance ID will be generated. The job start time and end time refer to the time points when the internal process of the job is completed, but not the time points when you operate on the UI.
For example, the information in the figure below indicates that the instance is started on 2021-11-10 16:49:30 and stopped on 2021-10-10 16:55:52 by you or the system.
Job running failure and recovery
When a job is restarted during its running (its status changes from RUNNING to RESTARTING or FAILED), a job running failure event will be generated. If the job is RUNNING again, a failed job recovery event will be generated.
You can select Operation > Solution to view ‍causes of and solutions to the event. You can also configure alarms for job running failure events.
Checkpointing failure and recovery
If checkpointing is enabled for a job, and a checkpoint fails to be taken, a checkpoint failure event will be generated. If the checkpoint succeeds later, a failed checkpoint recovery event will be generated.
You can select Operation > Solution to view ‍causes of and solutions to the event. You can also configure alarms for checkpoint failure events.
Job exception events (in beta)
The Stream Compute Service backend continuously monitors and analyzes the running of jobs. When a job encounters severe exceptions (such as TaskManager full GC too long, CPU load too high, and abnormal Pod exit), the corresponding events will be pushed to you for reference, so that you can determine whether the job is properly running.
Note
To avoid bothering you unnecessarily, at most one job exception event (other than the abnormal Pod exit) will be pushed per hour.
This feature is in beta testing. It supports detecting severe problems only, and thresholds cannot be adjusted. It will be further improved and upgraded. Please stay tuned.
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Image Creation

Internet of Things

Stream Services

Cloud Real-time Rendering

Cloud Resource Management

More

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Management and Audit Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Voice Technology

Natural Language Processing

Communication

Media Process Services

Education Sevices

Developer Tools

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Tencent Big Model

Optical Character Recognition

Interactive Video Services

Media SDK

Medical Services

Monitor and Operation

Overview

Event types

Job start and stop

Job running failure and recovery

Checkpointing failure and recovery

Job exception events (in beta)