tencent cloud

Diagnosis with Logs
Last updated: 2023-11-07 17:52:44
Diagnosis with Logs
Last updated: 2023-11-07 17:52:44

Overview

In the Stream Compute Service console, two categories of logs are available: start logs and running logs.
Start logs: When SQL, JAR, or other types of jobs are submitted in a cluster, the startup process of generating a Flink execution graph starts first, and logs generated in this process are referred to as start logs. When a job fails to start, a yellow triangle with an exclamation mark point (⚠️) will appear next to its name in the console, over which you can move the pointer to view details. You can also read the log context of the errors on the logs page.
Running logs: After the execution graph of a job is generated, its JobManager and TaskManagers will be started, and the execution graph will be submitted to the cluster for execution. From this point, the job status becomes "running", and logs printed by the JobManager and TaskManagers are called running logs.

Keywords of common exceptions

Job failure causes

You can search by from RUNNING to FAILED to identify the direct cause of a job crash, and the information following Caused by in the stack trace represents the failure details.

OOM

If java.lang.OutOfMemoryError appears, it is probably that OOM has occurred in the heap memory. In this case, you need to increase the operator parallelism (CUs) of the job and optimize the memory usage to avoid OOM.

JVM exit ‍and other fatal errors

The following keywords are generally followed by a process exit code and can help identify fatal JVM or Akka errors that cause a JVM to be forcibly shut down.
exit code OR shutting down JVM OR fatal OR kill OR killing
For example, the fatal error of ZooKeeper connection loss shown in the figure below hits the keyword fatal.

Checkpoint failure (timeout)

The following keywords indicate that a checkpoint fails. In this case, please analyze the issue based on the specific causes. For example, declined represents a checkpoint failure due to resource unavailability (the job is not running), the existence of FINISHED operators, checkpoint timeout, incomplete checkpoint files, or other reasons.
Checkpoint was declined
Checkpoint was canceled
Checkpoint expired
job has failed
Task has failed
Failure to finalize

Timeout/Failure

The following keywords indicate that an access timeout may occur to an external system (such as MySQL or Kafka) due to network failure or other reasons. The results provided may contain much configuration content. Please check whether this represents an error. For example, Timeout expired while fetching topic metadata for Kafka represents an initialization timeout, and Communications link failure for MySQL represents disconnection (which may be a client timeout due to no data inflow for a long period).
java.util.concurrent.TimeoutException
timeout
failure
timed out
failed

Exception

Exception indicates that an exception may have occurred. For example, the start logs of a Flink job in the following figure indicates that the job fails to be submitted due to an exception. Search by Exception will display specific exceptions following Caused by ‍in the stack traces at all levels.
Note
Not all logs containing Exception can be found by search due to keyword segmentation rules.

WARN and ERROR logs

In general, you can search for all logs containing WARN or ERROR, where many results may be found. Please filter the information as needed. For example, ‍some logs may contain WARN and ERROR themselves and do not represent errors.

Ignorable errors

The following common errors in the Stream Compute Service logs do not affect the running of jobs and can be skipped during troubleshooting:
WARN org.apache.flink.core.plugin.PluginConfig - The plugins directory [plugins] does not exist.

WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-00000000.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

ERROR org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - Authentication failed

WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

WARN org.apache.flink.kubernetes.utils.KubernetesInitializerUtils - Ship directory /data/workspace/.../shipFiles is not exists. Ignoring it.

WARN org.apache.flink.configuration.GlobalConfiguration - Error while trying to split key and value in configuration file /opt/flink-1.11.0/conf/flink-conf.yaml

WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

WARNING: Unable to load JDK7 types (annotations, java.nio.file.Path): no Java7 support added

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback