Category | Event Name | Description | Recommendations and Measure | Default Value | Severity | Disabling Allowed | Enabled by Default |
Node | The CPU utilization exceeds the threshold continuously | The server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes |
| The average CPU utilization exceeds the threshold | The average server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Moderate | Yes | No |
| The average CPU iowait utilization exceeds the threshold | The average CPU iowait utilization of the server in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Manually troubleshoot the issue | m=60, t=1,800 | Severe | Yes | Yes |
| The 1-second CPU load exceeds the threshold continuously | The 1-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1,800 | Moderate | Yes | No |
| The 5-second CPU load exceeds the threshold continuously | The 5-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1,800 | Severe | Yes | No |
| The memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes |
| The swap space exceeds the threshold continuously | The server swap memory has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.1, t=300 | Medium | Yes | No |
| The total number of system processes exceeds the threshold continuously | The total number of system processes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=10,000, t=1,800 | Severe | Yes | Yes |
| The average total number of fork subprocesses exceeds the threshold | The average total number of fork subprocesses in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Manually troubleshoot the issue | m=5,000, t=1,800 | Moderate | Yes | No |
| No process OOM | An OOM error occurred in the process | Adjust the process heap memory size | - | Severe | No | Yes |
| A disk I/O error occurred (this event is not supported currently) | A disk I/O error occurred | Replace the disk | - | Fatal | Yes | Yes |
| The average disk utilization exceeds the threshold continuously | The average disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes |
| The average disk I/O utilization exceeds the threshold continuously | The average disk I/O utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes |
| The node file handle utilization exceeds the threshold continuously | The node file handle utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=85, t=1,800 | Moderate | Yes | No |
| The number of TCP connections to the node exceeds the threshold continuously | The number of TCP connections to the node has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check whether there are connection leaks | m=10,000, t=1,800 | Moderate | Yes | No |
| The configured node memory utilization exceeds the threshold | The memory utilization configured for all roles on the node exceeds the node's physical memory threshold | Adjust the allocated node process heap memory | 90% | Severe | Yes | No |
| The node process is unavailable | The node service process is unavailable | View the service logs to find out why the service failed to be pulled | - | Moderate | Yes | Yes |
| The node heartbeat is missing | The node heartbeat failed to be reported regularly | Manually troubleshoot the issue | - | Fatal | No | Yes |
| The hostname is incorrect | The node's hostname is incorrect | Manually troubleshoot the issue | - | Fatal | No | Yes |
| Failed to ping the metadatabase | The TencentDB instance heartbeat failed to be reported regularly | - | - | - | - | - |
| The utilization of a single disk exceeds the threshold continuously | The single disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.85, t=1,800 | Severe | Yes | Yes |
| The I/O utilization of a single disk exceeds the threshold continuously | The single disk I/O device utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.85, t=1,800 | Severe | Yes | Yes |
| The single disk inodes utilization exceeds the threshold continuously | The single disk inodes utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.85, t=1,800 | Severe | Yes | Yes |
| The difference between the UTC time and NTP time of the server exceeds the threshold | The difference between the UTC time and NTP time of the server exceeds the threshold (in ms) | 1. Make sure that the NTP daemon is running
2. Make sure that the network communication with the NTP server is normal | Difference=30000 | Severe | Yes | Yes |
| Automatic node replenishment | If automatic node replenishment is enabled, when any exceptions in task and router nodes are detected, the system automatically purchases nodes of the same model to replace the affected nodes. | 1. If the replenishment is successful, no more attention is required. 2. If the replenishment fails, manually terminate the affected nodes in the console and purchase new nodes to replace them. | - | Moderate | Yes | Yes |
| Node failure | Faulty nodes exist in a cluster | Handle the failure in the | - | Severe | No | Yes |
HDFS | The total number of HDFS files exceeds the threshold continuously | The total number of files in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory | m=50,000,000, t=1,800 | Severe | Yes | No |
| The average total number of HDFS files exceeds the threshold | The average total number of files in the cluster in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory | m=50,000,000, t=1,800 | Severe | Yes | No |
| The total number of HDFS blocks exceeds the threshold continuously | The total number of blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory or the block size | m=50,000,000, t=1,800 | Severe | Yes | No |
| The average total number of HDFS blocks exceeds the threshold | The average total number of HDFS blocks in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory or the block size | m=50,000,000, t=1,800 | Severe | Yes | No |
| The number of HDFS data nodes marked as dead exceeds the threshold continuously | The number of data nodes marked as dead has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=1, t=1,800 | Moderate | Yes | No |
| The HDFS storage space utilization exceeds the threshold continuously | The HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1,800 | Severe | Yes | Yes |
| The average HDFS storage space utilization exceeds the threshold | The average HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1,800 | Severe | Yes | No |
| Active/Standby NameNodes were switched | Active/Standby NameNodes were switched | Locate the cause of NameNode switch | - | Severe | Yes | Yes |
| The NameNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=300, t=300 | Severe | Yes | No |
| The number of current NameNode connections exceeds the threshold continuously | The number of current NameNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=2,000, t=1,800 | Moderate | Yes | No |
| A full GC event occurred on a NameNode | A full GC event occurred on a NameNode | Fine-tune the parameter settings | - | Severe | Yes | Yes |
| The NameNode JVM memory utilization exceeds the threshold continuously | The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NameNode heap memory size | m=85, t=1,800 | Severe | Yes | Yes |
| The DataNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=300, t=300 | Moderate | Yes | No |
| The number of current DataNode connections exceeds the threshold continuously | The number of current DataNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=2,000, t=1,800 | Moderate | Yes | No |
| A full GC event occurred on a DataNode | A full GC event occurred on a NameNode | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The DataNode JVM memory utilization exceeds the threshold continuously | The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the DataNode heap memory size | m=85, t=1,800 | Moderate | Yes | Yes |
| Both NameNodes of HDFS are in Standby service status | Both NameNode roles are in Standby status at the same time | Manually troubleshoot the issue | - | Severe | Yes | Yes |
| The number of HDFS missing blocks exceeds the threshold | The number of missing blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | We recommend you troubleshoot HDFS data block corruption and run the hadoop fsck / command to check the HDFS file distribution | m=1, t=1,800 | Severe | Yes | Yes |
| The HDFS NameNode entered the safe mode | The NameNode entered the safe mode (for 300 seconds continuously) | We recommend you troubleshoot HDFS data block corruption and run the hadoop fsck / command to check the HDFS file distribution | - | Severe | Yes | Yes |
YARN | The number of currently missing NodeManagers in the cluster exceeds the threshold continuously | The number of currently missing NodeManagers in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check the NodeManager process status and check whether the network connection is smooth | m=1, t=1,800 | Moderate | Yes | No |
| The number of pending containers exceeds the threshold continuously | The number of pending containers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Reasonably specify resources that can be used by YARN jobs | m=90, t=1,800 | Moderate | Yes | No |
| The cluster memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Scale out the cluster | m=85, t=1,800 | Severe | Yes | Yes |
| The average cluster memory utilization exceeds the threshold | The average memory utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Scale out the cluster | m=85, t=1,800 | Severe | Yes | No |
| The cluster CPU utilization exceeds the threshold continuously | The CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Scale out the cluster | m=85, t=1,800 | Severe | Yes | Yes |
| The average cluster CPU utilization exceeds the threshold | The average CPU utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Scale out the cluster | m=85, t=1,800 | Severe | Yes | No |
| The number of available CPU cores in each queue is below the threshold continuously. | The number of available CPU cores in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1, t=1,800 | Moderate | Yes | No |
| The available memory in each queue is below the threshold continuously | The available memory in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1,024, t=1,800 | Moderate | Yes | No |
| Active/Standby ResourceManagers were switched | Active/Standby ResourceManagers were switched | Check the ResourceManager process status and view the standby ResourceManager logs to locate the cause of active/standby switch | - | Severe | Yes | Yes |
| A full GC event occurred in a ResourceManager | A full GC event occurred in a ResourceManager | Fine-tune the parameter settings | - | Severe | Yes | Yes |
| The ResourceManager JVM memory utilization exceeds the threshold continuously | The ResourceManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the ResourceManager heap memory size | m=85, t=1,800 | Severe | Yes | Yes |
| A full GC event occurred in a NodeManager | A full GC event occurred in a NodeManager | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The available memory in NodeManager is below the threshold continuously | The available memory in a single NodeManager has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=1, t=1,800 | Moderate | Yes | No |
| The NodeManager JVM memory utilization exceeds the threshold continuously | The NodeManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=85, t=1,800 | Moderate | Yes | No |
HBase | The number of regions in RIT status in the cluster exceeds the threshold continuously | The number of regions in RIT status in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | If the HBase version is below 2.0, run hbase hbck -fixAssigment | m=1, t=60 | Severe | Yes | Yes |
| The number of dead RegionServers exceeds the threshold continuously | The number of dead RegionServers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=1, t=300 | Moderate | Yes | Yes |
| The average number of regions in each RegionServer in the cluster exceeds the threshold continuously | The average number of regions in each RegionServer in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=300, t=1,800 | Moderate | Yes | Yes |
| A full GC event occurred on HMaster | A full GC event occurred on HMaster | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes |
| The HMaster JVM memory utilization exceeds the threshold continuously | The HMaster JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HMaster heap memory size | m=85, t=1800 | Severe | Yes | Yes |
| The number of current HMaster connections exceeds the threshold continuously | The number of current HMaster connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=1000, t=1800 | Moderate | Yes | No |
| A full GC event occurred in RegionServer | A full GC event occurred in RegionServer | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No |
| The RegionServer JVM memory utilization exceeds the threshold continuously | The RegionServer JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the RegionServer heap memory size | m=85, t=1800 | Moderate | Yes | No |
| The number of current RPC connections to RegionServer exceeds the threshold continuously | The number of current RPC connections to RegionServer has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=1000, t=1800 | Moderate | Yes | No |
| The number of RegionServer StoreFiles exceeds the threshold continuously | The number of RegionServer StoreFiles has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Run the major compaction | m=50,000, t=1,800 | Moderate | Yes | No |
| A full GC event occurred in HBase Thrift | A full GC event occurred in HBase Thrift | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No |
| The HBase Thrift JVM memory utilization exceeds the threshold continuously | The HBase Thrift JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HBase Thrift heap memory size | m=85, t=1,800 | Moderate | Yes | No |
| Both HMasters of HBase is in Standby service status | Both HMaster roles are in Standby status at the same time | Manually troubleshoot the issue | - | Severe | Yes | Yes |
Hive | A full GC event occurred in HiveServer2 | A full GC event occurred in HiveServer2 | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | Yes |
| The HiveServer2 JVM memory utilization exceeds the threshold continuously | The HiveServer2 JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HiveServer2 heap memory size | m=85, t=1,800 | Severe | Yes | Yes |
| A full GC event occurred in HiveMetaStore | A full GC event occurred in HiveMetaStore | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes |
| A full GC event occurred in HiveWebHcat | A full GC event occurred in HiveWebHcat | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes |
ZooKeeper | The number of ZooKeeper connections exceeds the threshold continuously | The number of ZooKeeper connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=65,535, t=1,800 | Moderate | Yes | No |
| The number of ZNodes exceeds the threshold continuously | The number of ZNodes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Manually troubleshoot the issue | m=2,000, t=1,800 | Moderate | Yes | No |
Impala | The ImpalaCatalog JVM memory utilization exceeds the threshold continuously | The ImpalaCatalog JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the ImpalaCatalog heap memory size | m=0.85, t=1,800 | Moderate | Yes | No |
| The Impala daemon JVM memory utilization exceeds the threshold continuously | The Impala daemon JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Impala daemon heap memory size | m=0.85, t=1,800 | Moderate | Yes | No |
| The number of Impala Beeswax API client connections exceeds the threshold | The number of Impala Beeswax API client connections has been greater than or equal to m | Adjust the value of fs_sevice_threads in the impalad.flgs configuration in the console | m=64, t=120 | Severe | Yes | Yes |
| The number of Impala HiveServer2 client connections exceeds the threshold | The number of Impala HiveServer2 client connections has been greater than or equal to m | Adjust the value of fs_sevice_threads in the impalad.flgs configuration in the console | m=64, t=120 | Severe | Yes | Yes |
| The query execution duration exceeds the threshold | The query execution duration exceeds m seconds | Manually troubleshoot the issue | - | Severe | Yes | No |
| The total number of failed queries exceeds the threshold | The total number of failed queries has been greater than or equal to m for t seconds (300 ≤ t ≤ 604,800) | Manually troubleshoot the issue | m=1, t=300 | Severe | Yes | No |
| The total number of committed queries exceeds the threshold | The total number of committed queries has been greater than or equal to m for t seconds (300 ≤ t ≤ 604,800) | Manually troubleshoot the issue | m=1, t=300 | Severe | Yes | No |
| The query execution failure rate exceeds the threshold | The query execution failure rate has been greater than or equal to m for t seconds (300 ≤ t ≤ 604,800) | Manually troubleshoot the issue | m=1, t=300 | Severe | Yes | No |
PrestoSQL | The current number of failed PrestoSQL nodes exceeds the threshold continuously | The current number of failed PrestoSQL nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Manually troubleshoot the issue | m=1, t=1,800 | Severe | Yes | Yes |
| The number of queuing resources in the current PrestoSQL resource group exceeds the threshold continuously | The number of queuing tasks in the PrestoSQL resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=5,000, t=1,800 | Severe | Yes | Yes |
| The number of failed PrestoSQL queries exceeds the threshold | The number of failed PrestoSQL queries is greater than or equal to m | Manually troubleshoot the issue | m=1, t=1,800 | Severe | Yes | No |
| A full GC event occurred in a PrestoSQLCoordinator | A full GC event occurred in a PrestoSQLCoordinator | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The PrestoSQLCoordinator JVM memory utilization exceeds the threshold continuously | The PrestoSQLCoordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the PrestoSQLCoordinator heap memory size | m=0.85, t=1,800 | Severe | Yes | Yes |
| A full GC event occurred on a PrestoSQL worker | A full GC event occurred on a PrestoSQL worker | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The PrestoSQLWorker JVM memory utilization exceeds the threshold continuously | The PrestoSQLWorker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the PrestoSQLWorker heap memory size | m=0.85, t=1,800 | Severe | Yes | No |
Presto | The current number of failed Presto nodes exceeds the threshold continuously | The current number of failed Presto nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Manually troubleshoot the issue | m=1, t=1,800 | Severe | Yes | Yes |
| The number of queuing resources in the current Presto resource group exceeds the threshold continuously | The number of queuing tasks in the Presto resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=5,000, t=1,800 | Severe | Yes | Yes |
| The number of failed Presto queries exceeds the threshold | The number of failed Presto queries is greater than or equal to m | Manually troubleshoot the issue | m=1, t=1,800 | Severe | Yes | No |
| A full GC event occurred on a PrestoSQL coordinator | A full GC event occurred on a PrestoSQL coordinator | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The Presto coordinator JVM memory utilization exceeds the threshold continuously | The Presto coordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Presto coordinator heap memory size | m=0.85, t=1,800 | Moderate | Yes | Yes |
| A full GC event occurred on a Presto worker | A full GC event occurred on a Presto worker | Fine-tune the parameter settings | - | Moderate | Yes | No |
| The Presto worker JVM memory utilization exceeds the threshold continuously | The Presto worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Presto worker heap memory size | m=0.85, t=1,800 | Severe | Yes | No |
Alluxio | The current total number of Alluxio workers is below the threshold continuously | The current total number of Alluxio workers has been smaller than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Manually troubleshoot the issue | m=1, t=1,800 | Severe | Yes | No |
| The utilization of the capacity on all tiers of the current Alluxio worker exceeds the threshold | The utilization of the capacity on all tiers of the current Alluxio worker has been greater than or equal to the threshold for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=0.85, t=1,800 | Severe | Yes | No |
| A full GC event occurred on an Alluxio master | A full GC event occurred on an Alluxio master | Manually troubleshoot the issue | - | Moderate | Yes | No |
| The Alluxio master JVM memory utilization exceeds the threshold continuously | The Alluxio master JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Alluxio worker heap memory size | m=0.85, t=1,800 | Severe | Yes | Yes |
| A full GC event occurred on an Alluxio worker | A full GC event occurred on an Alluxio worker | Manually troubleshoot the issue | - | Moderate | Yes | No |
| The Alluxio worker JVM memory utilization exceeds the threshold continuously | The Alluxio worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Alluxio master heap memory size | m=0.85, t=1,800 | Severe | Yes | Yes |
kudu | The cluster replica skew exceeds the threshold | The cluster replica skew has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously | Run the rebalance command to balance the replicas | m=100, t=300 | Moderate | Yes | Yes |
| The hybrid clock error exceeds the threshold | The hybrid clock error has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously | Make sure that the NTP daemon is running and the network communication with the NTP server is normal | m=5,000,000, t=300 | Moderate | Yes | Yes |
| The number of running tablets exceeds the threshold | The number of running tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Too many tablets on a node can affect the performance. We recommend you clear unnecessary tables and partitions or expand the capacity as needed. | m=1,000, t=300 | Moderate | Yes | Yes |
| The number of failed tablets exceeds the threshold | The number of failed tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether any disk is unavailable or data file is corrupted | m=1, t=300 | Moderate | Yes | Yes |
| The number of failed data directories exceeds the threshold | The number of failed data directories has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the path configured in the fs_data_dirs parameter is available | m=1, t=300 | Severe | Yes | Yes |
| The number of full data directories exceeds the threshold | The number of full data directories has been greater than or equal to m for t (120 ≤ t ≤ 3,600) seconds continuously | Clear unnecessary data files or expand the capacity as needed | m=1, t=120 | Severe | Yes | Yes |
| The number of write requests rejected due to queue overloading exceeds the threshold | The number of write requests rejected due to queue overloading has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the number of write hotspots or worker threads is small | m=10, t=300 | Moderate | Yes | No |
| The number of expired scanners exceeds the threshold | The number of expired scanners has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Be sure to call the method for closing a scanner after reading data | m=100, t=300 | Moderate | Yes | Yes |
| The number of error logs exceeds the threshold | The number of error logs has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Manually troubleshoot the issue | m=10, t=300 | Moderate | Yes | Yes |
| The number of RPC requests that timed out while waiting in the queue exceeds the threshold | The number of RPC requests that timed out while waiting in the queue has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the system load is too high | m=100, t=300 | Moderate | Yes | Yes |
Kerberos | The Kerberos response time exceeds the threshold | The Kerberos response time has been greater than or equal to m (ms) for t (300 ≤ t ≤ 604,800) seconds continuously | Manually troubleshoot the issue | m=100, t=1,800 | Severe | Yes | Yes |
Cluster | The auto scaling policy has failed | 1. The scale-out rule failed due to insufficient subnet EIPs associated with the cluster. 2. The scale-out rule failed due to insufficient expansion resource inventory of the preset specifications. 3. The scale-out rule failed due to insufficient account balance. 4. An internal error occurred. | 1. Switch to another subnet in the same VPC. 2. Switch to specifications of resources that are sufficient or submit a ticket to contact developers. 3. Top up the account to ensure that the account balance is sufficient. 4. Submit a ticket to contact developers. | - | Severe | No | Yes |
| The execution of the auto scaling policy has timed out | 1. Scaling cannot be performed temporarily as the cluster is in the cooldown period. 2. Scaling is not triggered because the retry period upon expiration is too short.
3. The cluster in the current status cannot be scaled out. | 1. Adjust the cooldown period for the scaling rule. 2. Extend the retry period upon expiration. 3. Try again later or submit a ticket to contact developers. | - | Severe | No | Yes |
| The auto scaling policy is not triggered | 1. The scale-out rule cannot be triggered because no expansion resource specification is set. 2. The scale-out rule cannot be triggered because the maximum number of nodes for elastic resources is reached. 3. The scale-in rule cannot be triggered because the minimum number of nodes for elastic resources is reached. 4. The time range for scaling has expired. 5. The scale-in rule cannot be triggered because there are no elastic resources in the cluster. | 1. Set at least one elastic resource specification for the rule. 2. Modify the maximum number of nodes to continue scaling out if the upper limit is reached. 3. Modify the minimum number of nodes to continue scaling in if the lower limit is reached. 4. Modify the effective time range of the rule if you want to continue using the rule for auto scaling. 5. Execute the scale-in rule after adding elastic resources. | - | Moderate | Yes | Yes |
| Auto scaling partially succeeded | 1. Only partial resources were supplemented because the resource inventory was less than the required quantity for scale-out. 2. Only partial resources were supplemented because the required quantity for scale-out exceeded the actual quantity of resources delivered. 3. The scale-out rule was partially successful because the maximum number of nodes for elastic resources was reached. 4. The scale-in rule was partially successful because the minimum number of nodes for elastic resources was reached. 5. The resource supplement failed due to insufficient subnet EIPs associated with the cluster. 6. The resource supplement failed due to insufficient expansion resource inventory of the preset specifications. 7. The resource supplement failed due to insufficient account balance. | 1. Use the available resources for manual scaling to supplement the resources for auto scaling. 2. Use the available resources for manual scaling to supplement the resources for auto scaling. 3. Modify the maximum number of nodes to continue scaling out if the upper limit is reached. 4. Modify the minimum number of nodes to continue scaling in if the lower limit is reached. 5. Switch to another subnet in the same VPC. 6. Switch to specifications of resources that are sufficient or submit a ticket to contact developers. 7. Top up the account to ensure that the account balance is sufficient. | - | Moderate | Yes | Yes |
| The node process is unavailable | The node process is unavailable | Manually troubleshoot the issue | - | Moderate | No | Yes |
| The process is killed by OOMKiller | The process OOM is killed by OOMKiller | Adjust the process heap memory size | - | Severe | No | Yes |
| A JVM or OLD exception occurred | A JVM or OLD exception occurred | Manually troubleshoot the issue | 1. The OLD utilization reaches 80% for 5 consecutive minutes
Or 2. The JVM memory utilization reaches 90% | Severe | Yes | Yes |
| Timeout of service role health status occurred | The service role health status has timed out for t seconds (180 ≤ t ≤ 604,800) | The service role health status has timed out for minutes. To resolve this issue, check the logs of the corresponding service role and perform necessary actions accordingly. | t=300 | Moderate | Yes | No |
| A service role health status exception occurred | The service role health status has been abnormal for t seconds (180 ≤ t ≤ 604,800) | The service role health status has been unavailable for minutes. To resolve this issue, check the logs of the corresponding service role and perform necessary actions accordingly. | t=300 | Severe | Yes | Yes |
| Auto scaling failed | This alert indicates that the auto scaling process has failed (either completely or partially) | Manually troubleshoot the issue | - | Severe | No | Yes |
Was this page helpful?