tencent cloud

Feedback

Rule Type Description(old)

Last updated: 2024-08-07 22:05:33
    TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.

    Kubernetes master component

    The following metrics are provided for non-managed clusters:
    Rule Name
    Rule Expression
    Duration
    Description
    Error with client access to APIServer
    (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
    15m
    The error rate of client access to the APIServer is above 1%
    Imminent expiration of the client certificate for APIServer access
    apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
    None
    The client certificate for APIServer access will expire in 24 hours
    Recording API error
    sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
    None
    The recording API reported an error in the last 5 minutes
    Low recording API availability
    (1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
    5m
    The availability of the recording API service in the last 5 minutes was below 90%
    APIServer fault
    absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
    5m
    APIServer disappeared from the collection targets
    Scheduler fault
    absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
    15m
    The scheduler disappeared from the collection targets
    Controller manager fault
    absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
    15m
    The controller manager disappeared from the collection targets

    Kubelet

    Rule Name
    Rule Expression
    Duration
    Description
    Exceptional node status
    kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
    15m
    The node status is exceptional for over 15 minutes
    Unreachable node
    kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
    15m
    The node is unreachable, and its workload will be scheduled again
    Too many Pods running on node
    count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
    15m
    The number of Pods running on the node is close to the upper limit
    Node status fluctuation
    sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
    15m
    The node status fluctuates between normal and exceptional
    Imminent expiration of the kubelet client certificate
    kubelet_certificate_manager_client_ttl_seconds < 86400
    None
    The kubelet client certificate will expire in 24 hours
    Imminent expiration of the kubelet server certificate
    kubelet_certificate_manager_server_ttl_seconds < 86400
    None
    The kubelet server certificate will expire in 24 hours
    Kubelet client certificate renewal error
    increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
    15m
    An error occurred while renewing the kubelet client certificate
    Kubelet server certificate renewal error
    increase(kubelet_server_expiration_renew_errors[5m]) > 0
    15m
    An error occurred while renewing the kubelet server certificate
    Time-Consuming PLEG
    histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
    5m
    The 99th percentile of PLEG operation duration exceeds 10 seconds
    Time-Consuming Pod start
    histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
    15m
    The 99th percentile of Pod start duration exceeds 60 seconds
    Kubelet fault
    absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
    15m
    Kubelet disappeared from the collection targets

    Kubernetes Resource Use

    Rule Name
    Rule Expression
    Duration
    Description
    Cluster CPU resource overload
    sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
    5m
    Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
    Cluster memory resource overload
    sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
    5m
    Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
    Cluster CPU quota overload
    sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
    5m
    The CPU quota in the cluster exceeds the total number of allocable CPU cores
    Cluster memory quota overload
    sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) / sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
    5m
    The memory quota in the cluster exceeds the total amount of allocable memory
    Imminent runout of quota resources
    sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
    15m
    The quota resource utilization exceeds 90%
    High proportion of restricted CPU execution cycles
    sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
    15m
    The proportion of restricted CPU execution cycles is high
    High Pod CPU utilization
    sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
    15m
    The Pod CPU utilization exceeds 75%
    High Pod memory utilization
    sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
    15m
    The Pod memory utilization exceeds 75%

    Kubernetes Workload

    Rule Name
    Rule Expression
    Duration
    Description
    Frequent Pod restarts
    increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
    15m
    The Pod was frequently restarted in the last 5 minutes
    Exceptional Pod status
    sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
    15m
    The Pod is in the `NotReady` status for over 15 minutes
    Exceptional container status
    sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
    1h
    The container is in the `Waiting` status for a long period of time
    Deployment version mismatch
    kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
    15m
    The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
    Deployment replica quantity mismatch
    (kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
    15m
    The actual number of replicas is different from the set number of replicas
    StatefulSet version mismatch
    kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
    15m
    The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
    StatefulSet replica quantity mismatch
    (kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
    15m
    The actual number of replicas is different from the set number of replicas
    Ineffective StatefulSet update
    (maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
    15m
    The StatefulSet hasn't been updated on some Pods
    Frozen DaemonSet change
    ((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
    15m
    The DaemonSet change lasts more than 15 minutes
    DaemonSet not scheduled on some nodes
    kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
    10m
    The DaemonSet is not scheduled on some nodes
    Faulty scheduling of DaemonSet on some nodes
    kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
    15m
    The DaemonSet is incorrectly scheduled to some nodes
    Excessive Job execution
    kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"} > 0
    12h
    The execution duration of the Job exceeds 12 hours
    Job execution failure
    kube_job_failed{job=~".*kube-state-metrics"} > 0
    15m
    Job execution failed
    Mismatch between replica quantity and HPA
    (kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
    15m
    The actual number of replicas is different from that set in HPA
    Number of replicas reaching maximum value in HPA
    kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
    15m
    The actual number of replicas reaches the maximum value configured in HPA
    Exceptional PersistentVolume status
    kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
    15m
    The PersistentVolume is in the `Failed` or `Pending` status

    Kubernetes Node

    Rule Name
    Rule Expression
    Duration
    Description
    Imminent runout of filesystem space
    (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
    1h
    It is estimated that the filesystem space will be used up in 4 hours
    High filesystem space utilization
    (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
    1h
    The available filesystem space is below 5%
    Imminent runout of filesystem inodes
    (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
    1h
    It is estimated that the filesystem inodes will be used up in 4 hours
    High filesystem inode utilization
    (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
    1h
    The proportion of available inodes is below 3%
    Unstable network interface status
    changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
    2m
    The network interface status is unstable and frequently changes between "up" and "down"
    Network interface data reception error
    increase(node_network_receive_errs_total[2m]) > 10
    1h
    An error occurred while the network interface received data
    Network interface data sending error
    increase(node_network_transmit_errs_total[2m]) > 10
    1h
    An error occurred while the network interface sent data
    Unsynced server clock
    min_over_time(node_timex_sync_status[5m]) == 0
    10m
    The server time has not been synced recently. Please check whether NTP is correctly configured
    Server clock skew
    (node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
    10m
    The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support