tencent cloud

All product documents
TencentCloud Managed Service for Prometheus
Rule Type Description(old)
Last updated: 2024-01-29 16:01:55
Rule Type Description(old)
Last updated: 2024-01-29 16:01:55
TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.

Kubernetes master component

The following metrics are provided for non-managed clusters:
Rule Name
Rule Expression
Duration
Description
Error with client access to APIServer
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
15m
The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
None
The client certificate for APIServer access will expire in 24 hours
Recording API error
sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
None
The recording API reported an error in the last 5 minutes
Low recording API availability
(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
5m
The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault
absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
5m
APIServer disappeared from the collection targets
Scheduler fault
absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
15m
The scheduler disappeared from the collection targets
Controller manager fault
absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
15m
The controller manager disappeared from the collection targets

Kubelet

Rule Name
Rule Expression
Duration
Description
Exceptional node status
kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
15m
The node status is exceptional for over 15 minutes
Unreachable node
kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
15m
The node is unreachable, and its workload will be scheduled again
Too many Pods running on node
count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
15m
The number of Pods running on the node is close to the upper limit
Node status fluctuation
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
15m
The node status fluctuates between normal and exceptional
Imminent expiration of the kubelet client certificate
kubelet_certificate_manager_client_ttl_seconds < 86400
None
The kubelet client certificate will expire in 24 hours
Imminent expiration of the kubelet server certificate
kubelet_certificate_manager_server_ttl_seconds < 86400
None
The kubelet server certificate will expire in 24 hours
Kubelet client certificate renewal error
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet client certificate
Kubelet server certificate renewal error
increase(kubelet_server_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet server certificate
Time-Consuming PLEG
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
5m
The 99th percentile of PLEG operation duration exceeds 10 seconds
Time-Consuming Pod start
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
15m
The 99th percentile of Pod start duration exceeds 60 seconds
Kubelet fault
absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
15m
Kubelet disappeared from the collection targets

Kubernetes Resource Use

Rule Name
Rule Expression
Duration
Description
Cluster CPU resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
5m
Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster memory resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
5m
Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster CPU quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
5m
The CPU quota in the cluster exceeds the total number of allocable CPU cores
Cluster memory quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) / sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
5m
The memory quota in the cluster exceeds the total amount of allocable memory
Imminent runout of quota resources
sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
15m
The quota resource utilization exceeds 90%
High proportion of restricted CPU execution cycles
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
15m
The proportion of restricted CPU execution cycles is high
High Pod CPU utilization
sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod CPU utilization exceeds 75%
High Pod memory utilization
sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod memory utilization exceeds 75%

Kubernetes Workload

Rule Name
Rule Expression
Duration
Description
Frequent Pod restarts
increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
15m
The Pod was frequently restarted in the last 5 minutes
Exceptional Pod status
sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
15m
The Pod is in the `NotReady` status for over 15 minutes
Exceptional container status
sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
1h
The container is in the `Waiting` status for a long period of time
Deployment version mismatch
kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
15m
The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
Deployment replica quantity mismatch
(kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
StatefulSet version mismatch
kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
15m
The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
StatefulSet replica quantity mismatch
(kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
Ineffective StatefulSet update
(maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
15m
The StatefulSet hasn't been updated on some Pods
Frozen DaemonSet change
((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
15m
The DaemonSet change lasts more than 15 minutes
DaemonSet not scheduled on some nodes
kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
10m
The DaemonSet is not scheduled on some nodes
Faulty scheduling of DaemonSet on some nodes
kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
15m
The DaemonSet is incorrectly scheduled to some nodes
Excessive Job execution
kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"} > 0
12h
The execution duration of the Job exceeds 12 hours
Job execution failure
kube_job_failed{job=~".*kube-state-metrics"} > 0
15m
Job execution failed
Mismatch between replica quantity and HPA
(kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
15m
The actual number of replicas is different from that set in HPA
Number of replicas reaching maximum value in HPA
kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
15m
The actual number of replicas reaches the maximum value configured in HPA
Exceptional PersistentVolume status
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
15m
The PersistentVolume is in the `Failed` or `Pending` status

Kubernetes Node

Rule Name
Rule Expression
Duration
Description
Imminent runout of filesystem space
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem space will be used up in 4 hours
High filesystem space utilization
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The available filesystem space is below 5%
Imminent runout of filesystem inodes
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem inodes will be used up in 4 hours
High filesystem inode utilization
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The proportion of available inodes is below 3%
Unstable network interface status
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
2m
The network interface status is unstable and frequently changes between "up" and "down"
Network interface data reception error
increase(node_network_receive_errs_total[2m]) > 10
1h
An error occurred while the network interface received data
Network interface data sending error
increase(node_network_transmit_errs_total[2m]) > 10
1h
An error occurred while the network interface sent data
Unsynced server clock
min_over_time(node_timex_sync_status[5m]) == 0
10m
The server time has not been synced recently. Please check whether NTP is correctly configured
Server clock skew
(node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
10m
The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon