Metric Dimension | Metric Name | Unit | Metric Description |
Node resource | CPU usage percentage | Percentage | Proportion of the current node CPU usage to Pod CPU request value of the current node (the CPU specification selected during instance creation). |
Node resource | File system read rate | MiBytes/s | Volume of data read from the node data disk per second. |
Node resource | File system write rate | MiBytes/s | Volume of data written to the node data disk per second. |
Node resource | Memory usage percentage | Percentage | Proportion of the current node memory usage to Pod MEM request value of the current node (the memory specification selected during instance creation). |
Node resource | Total memory usage | MiBytes | Memory usage of the node. |
Node resource | Network receiving rate | MiBytes | Data receiving rate of the node NIC. |
Node resource | Network sending rate | MiBytes | Data sending rate of the node NIC. |
Application metric | Number of database keys | Count | Number of keys on the node. The data is obtained from etcd metrics. Calculation formula: etcd_debugging_mvcc_keys_total{job="$job"}. |
Application metric | Database MVCC writes | Times | Number of data writes on the node. Calculation formula: etcd_mvcc_put_total{job="$job"}. |
Application metric | Database size | MiBytes | Database size calculated on the node. Calculation formula: etcd_debugging_mvcc_db_total_size_in_bytes{job="$job"}. |
Application metric | Consensus proposal application rate | Time/s | The value is usually small. (Only a few thousand proposals are applied even under high load). If the value keeps increasing, it indicates that the etcd server is overloaded, which may be caused by queries with high consumption (such as large-range queries or large txn operations). Calculation formula: rate(etcd_server_proposals_applied_total{job="$job"}[5m]). |
Application metric | Consensus proposal commit rate | Time/s | The value usually increases over time. If the delay between a single member and the leader is high for a long time, it indicates that the member is running slowly or is not healthy. Calculation formula: rate(etcd_server_proposals_committed_total{job="$job"}[5m]). |
Application metric | Total number of queueing consensus proposals | Count | An increase of the value indicates that the client load is high or the member cannot submit proposals. Calculation formula: etcd_server_proposals_pending{job="$job"}. |
Application metric | Growth rate of failed consensus proposals | Time/s | This metric is usually related to two issues: temporary failures related to leader election or longer-term failures due to the lack of the arbitration node in the cluster. Calculation formula: rate(etcd_server_proposals_failed_total{job="$job"}[5m]). |
Instance-level metric | Cluster leader existence | Boolean value | If there is no leader, the instance is unavailable. Calculation formula: max(etcd_server_has_leader{job="$job"}). |
Instance-level metric | Total number of leader switches | Times | If there is no leader, frequent leader changes will great affect the etcd performance, which may be caused by network connection issues or high load on the etcd cluster. Calculation formula: max(etcd_server_leader_changes_seen_total{job="$job"}). Note: The data collected for this metric is the aggregated data after the cluster is successfully created. It is irrelevant to the alarm cycle. |
Instance API | gRPC call rate | Time/s | gRPC call rate of operations using the specific method. Calculation formula: sum(rate(grpc_server_handled_total{job="$job"}[1m])) by (job,grpc_method,instance). |