Job Failure

Overview
A job failure event in Stream Compute Service indicates that the status of a Flink job changes from running to failed or restarting, which may cause interrupted data processing, output delay in the downstream, and other issues.
Conditions
Trigger
1. The status of a Flink job changes from RUNNING to FAILED or RESTARTING. Later, the Flink JobManager will recover the job in about 10s, with the running instance ID after recovery remaining unchanged.
2. A Flink job is restarted too many times or too frequently, exceeding the limit (the threshold is generally controlled by restart-strategy.fixed-delay.attempts and defaults to 5, and we recommend you increase it in a production environment) given in the Restart Policies. This will result in the exit of both the JobManager and the TaskManagers, and the system will try to recover the job from the last successful checkpoint within about 2 minutes, with the running instance ID ‍after recovery increased by 1.
Clearing
After the Flink or Stream Compute Service system recovers the job back to RUNNING, a failed job recovery event will be generated, indicating the end of this event.
Alarms
You can configure an alarm policy for this event to receive trigger and clearing notifications in real time.
Suggestions
You can search for exception logs under the instance ID of the job for which the event is generated, as instructed in Diagnosis with Logs. Generally speaking, error messages before and after the keywords from RUNNING to FAILED contain the direct causes of the job failure. We recommend you analyze the issue based on these error messages together with the logs of the JobManager and the TaskManagers.
If the problem is still not found with the above diagnosis, please check as instructed in Viewing Monitoring Information whether resource overuse exists. You can focus on TaskManager CPU usage, heap memory usage, full GC count, full GC time, and other critical metrics to check whether exceptions exist.

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

tencent cloud

Sign Up

Log in

Compute

Microservice

Data Migration

Database SaaS Tool

Data Security

Application Security

Big Data

Voice Technology

Internet of Things

Stream Services

Cloud Real-time Rendering

Management and Audit Tools

Edge Computing

Serverless

Relational Database

Networking

Business Security

Domains & Websites

Face Recognition

AI Platform Service

Middleware

Media On-Demand

Game Services

Developer Tools

Container

Essential Storage Service

Enterprise Distributed DBMS

CDN and Acceleration

Security Services

Enterprise Applications

Image Creation

Natural Language Processing

Communication

Media Process Services

Education Sevices

Monitor and Operation

Distributed cloud

Data Process and Analysis

NoSQL Database

Network Security

Cloud Security

Office Collaboration

Tencent Big Model

Optical Character Recognition

Interactive Video Services

Media SDK

Cloud Resource Management

More

Overview

Conditions

Trigger

Clearing

Alarms

Suggestions