Overview
In the Stream Compute Service console, two categories of logs are available: start logs and running logs.
Start logs: When SQL, JAR, or other types of jobs are submitted in a cluster, the startup process of generating a Flink execution graph starts first, and logs generated in this process are referred to as start logs. When a job fails to start, a yellow triangle with an exclamation mark point (⚠️) will appear next to its name in the console, over which you can move the pointer to view details. You can also read the log context of the errors on the logs page.
Running logs: After the execution graph of a job is generated, its JobManager and TaskManagers will be started, and the execution graph will be submitted to the cluster for execution. From this point, the job status becomes "running", and logs printed by the JobManager and TaskManagers are called running logs.
Keywords of common exceptions
Job failure causes
You can search by from RUNNING to FAILED
to identify the direct cause of a job crash, and the information following Caused by
in the stack trace represents the failure details.
OOM
If java.lang.OutOfMemoryError
appears, it is probably that OOM has occurred in the heap memory. In this case, you need to increase the operator parallelism (CUs) of the job and optimize the memory usage to avoid OOM.
JVM exit and other fatal errors
The following keywords are generally followed by a process exit code and can help identify fatal JVM or Akka errors that cause a JVM to be forcibly shut down.
exit code OR shutting down JVM OR fatal OR kill OR killing
For example, the fatal error of ZooKeeper connection loss shown in the figure below hits the keyword fatal
.
Checkpoint failure (timeout)
The following keywords indicate that a checkpoint fails. In this case, please analyze the issue based on the specific causes. For example, declined
represents a checkpoint failure due to resource unavailability (the job is not running), the existence of FINISHED
operators, checkpoint timeout, incomplete checkpoint files, or other reasons.
Checkpoint was declined
Checkpoint was canceled
Checkpoint expired
job has failed
Task has failed
Failure to finalize
Timeout/Failure
The following keywords indicate that an access timeout may occur to an external system (such as MySQL or Kafka) due to network failure or other reasons. The results provided may contain much configuration content. Please check whether this represents an error. For example, Timeout expired while fetching topic metadata
for Kafka represents an initialization timeout, and Communications link failure
for MySQL represents disconnection (which may be a client timeout due to no data inflow for a long period).
java.util.concurrent.TimeoutException
timeout
failure
timed out
failed
Exception
Exception
indicates that an exception may have occurred. For example, the start logs of a Flink job in the following figure indicates that the job fails to be submitted due to an exception. Search by Exception
will display specific exceptions following Caused by
in the stack traces at all levels.
Note
Not all logs containing Exception
can be found by search due to keyword segmentation rules.
WARN and ERROR logs
In general, you can search for all logs containing WARN
or ERROR
, where many results may be found. Please filter the information as needed. For example, some logs may contain WARN
and ERROR
themselves and do not represent errors.
Ignorable errors
The following common errors in the Stream Compute Service logs do not affect the running of jobs and can be skipped during troubleshooting:
WARN org.apache.flink.core.plugin.PluginConfig - The plugins directory [plugins] does not exist.
WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-00000000.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
ERROR org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - Authentication failed
WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN org.apache.flink.kubernetes.utils.KubernetesInitializerUtils - Ship directory /data/workspace/.../shipFiles is not exists. Ignoring it.
WARN org.apache.flink.configuration.GlobalConfiguration - Error while trying to split key and value in configuration file /opt/flink-1.11.0/conf/flink-conf.yaml
WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
WARNING: Unable to load JDK7 types (annotations, java.nio.file.Path): no Java7 support added
Was this page helpful?