Rule Name | Rule Expression | Duration | Description |
Error with client access to APIServer | (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01 | 15m | The error rate of client access to the APIServer is above 1% |
Imminent expiration of the client certificate for APIServer access | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 | None | The client certificate for APIServer access will expire in 24 hours |
Recording API error | sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 | None | The recording API reported an error in the last 5 minutes |
Low recording API availability | (1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90 | 5m | The availability of the recording API service in the last 5 minutes was below 90% |
APIServer fault | absent(sum(up{job="apiserver"}) by (cluster_id) > 0) | 5m | APIServer disappeared from the collection targets |
Scheduler fault | absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0) | 15m | The scheduler disappeared from the collection targets |
Controller manager fault | absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0) | 15m | The controller manager disappeared from the collection targets |
Rule Name | Rule Expression | Duration | Description |
Exceptional node status | kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0 | 15m | The node status is exceptional for over 15 minutes |
Unreachable node | kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 | 15m | The node is unreachable, and its workload will be scheduled again |
Too many Pods running on node | count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95 | 15m | The number of Pods running on the node is close to the upper limit |
Node status fluctuation | sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2 | 15m | The node status fluctuates between normal and exceptional |
Imminent expiration of the kubelet client certificate | kubelet_certificate_manager_client_ttl_seconds < 86400 | None | The kubelet client certificate will expire in 24 hours |
Imminent expiration of the kubelet server certificate | kubelet_certificate_manager_server_ttl_seconds < 86400 | None | The kubelet server certificate will expire in 24 hours |
Kubelet client certificate renewal error | increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0 | 15m | An error occurred while renewing the kubelet client certificate |
Kubelet server certificate renewal error | increase(kubelet_server_expiration_renew_errors[5m]) > 0 | 15m | An error occurred while renewing the kubelet server certificate |
Time-Consuming PLEG | histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10 | 5m | The 99th percentile of PLEG operation duration exceeds 10 seconds |
Time-Consuming Pod start | histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60 | 15m | The 99th percentile of Pod start duration exceeds 60 seconds |
Kubelet fault | absent(sum(up{job="kubelet"}) by (cluster_id) > 0) | 15m | Kubelet disappeared from the collection targets |
Rule Name | Rule Expression | Duration | Description |
Cluster CPU resource overload | sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores) | 5m | Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated |
Cluster memory resource overload | sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes) | 5m | Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated |
Cluster CPU quota overload | sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5 | 5m | The CPU quota in the cluster exceeds the total number of allocable CPU cores |
Cluster memory quota overload | sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) / sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5 | 5m | The memory quota in the cluster exceeds the total amount of allocable memory |
Imminent runout of quota resources | sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9 | 15m | The quota resource utilization exceeds 90% |
High proportion of restricted CPU execution cycles | sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 ) | 15m | The proportion of restricted CPU execution cycles is high |
High Pod CPU utilization | sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75 | 15m | The Pod CPU utilization exceeds 75% |
High Pod memory utilization | sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75 | 15m | The Pod memory utilization exceeds 75% |
Rule Name | Rule Expression | Duration | Description |
Frequent Pod restarts | increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0 | 15m | The Pod was frequently restarted in the last 5 minutes |
Exceptional Pod status | sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0 | 15m | The Pod is in the `NotReady` status for over 15 minutes |
Exceptional container status | sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0 | 1h | The container is in the `Waiting` status for a long period of time |
Deployment version mismatch | kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"} | 15m | The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect |
Deployment replica quantity mismatch | (kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0) | 15m | The actual number of replicas is different from the set number of replicas |
StatefulSet version mismatch | kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"} | 15m | The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect |
StatefulSet replica quantity mismatch | (kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0) | 15m | The actual number of replicas is different from the set number of replicas |
Ineffective StatefulSet update | (maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0) | 15m | The StatefulSet hasn't been updated on some Pods |
Frozen DaemonSet change | ((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0) | 15m | The DaemonSet change lasts more than 15 minutes |
DaemonSet not scheduled on some nodes | kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0 | 10m | The DaemonSet is not scheduled on some nodes |
Faulty scheduling of DaemonSet on some nodes | kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0 | 15m | The DaemonSet is incorrectly scheduled to some nodes |
Excessive Job execution | kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"} > 0 | 12h | The execution duration of the Job exceeds 12 hours |
Job execution failure | kube_job_failed{job=~".*kube-state-metrics"} > 0 | 15m | Job execution failed |
Mismatch between replica quantity and HPA | (kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 | 15m | The actual number of replicas is different from that set in HPA |
Number of replicas reaching maximum value in HPA | kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"} | 15m | The actual number of replicas reaches the maximum value configured in HPA |
Exceptional PersistentVolume status | kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0 | 15m | The PersistentVolume is in the `Failed` or `Pending` status |
Rule Name | Rule Expression | Duration | Description |
Imminent runout of filesystem space | (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) | 1h | It is estimated that the filesystem space will be used up in 4 hours |
High filesystem space utilization | (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) | 1h | The available filesystem space is below 5% |
Imminent runout of filesystem inodes | (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) | 1h | It is estimated that the filesystem inodes will be used up in 4 hours |
High filesystem inode utilization | (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) | 1h | The proportion of available inodes is below 3% |
Unstable network interface status | changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) | 2m | The network interface status is unstable and frequently changes between "up" and "down" |
Network interface data reception error | increase(node_network_receive_errs_total[2m]) > 10 | 1h | An error occurred while the network interface received data |
Network interface data sending error | increase(node_network_transmit_errs_total[2m]) > 10 | 1h | An error occurred while the network interface sent data |
Unsynced server clock | min_over_time(node_timex_sync_status[5m]) == 0 | 10m | The server time has not been synced recently. Please check whether NTP is correctly configured |
Server clock skew | (node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0) | 10m | The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured |
Was this page helpful?