tencent cloud

Log In Sign Up Free

Detect AI Fraud with Tencent eKYC！Intercept 99%+ Deepfake Attacks!

TencentCloud Managed Service for Prometheus

Product Introduction

Service Regions

Purchase Guide

Billing Overview

Pay-as-You-Go (Postpaid)

Pay-as-You-Go Description

Billing Rules for Free Metrics Exceeding Free Storage Period

Free Trial Introduction

Managed Collector Billing Introduction

Purchase Methods

Payment Overdue

Free Metrics in Pay-as-You-Go Mode

Pay-as-You-Go — Unit Price for One-Year/Two-Year Data Storage

Getting Started

Integration Guide

Operation Guide

Instance

Creating Instance

Searching for Instance

Renaming Instance

Terminating Instance

Rebooting Instance

Modifying Storage Period

Viewing Instance's Basic Information

TKE

Integration with TKE

Integration Center

Data Multi-Write

Recording Rule

Rule Management

List of Default Recording Rules

Alerting Rule

Alerting Rule Description

Creating Alerting Rule

Disabling Alerting Rule

Rule Type Description(old)

Notification Template

Alarm Suppression

Tag

Access Control

Granting Policy

Description of Role Permissions Related to Service Authorization

API Guide

Querying Monitoring Data

Instance Diagnosis

TKE Metrics

Free Metrics in Pay-as-You-Go Mode

Recommended Common Metrics for TKE

Container Monitoring Chart Metrics

Data Collection Configuration

Configuring Necessary Monitoring Metrics

Adjusting Collection IntervalAdjusting Collection Interval

Resource Usage and Billing Overview

Practical Tutorial

Migration from Self-Built Prometheus

Custom Integration with CVM

Enabling Public Network Access for TKE Serverless Cluster

Connecting TMP to Local Grafana

Enabling Public Network Access for Prometheus Instances

Configuring a Public Network Address for a Prometheus Instance

Terraform

Terraform Overview

Managing Prometheus Instances Using Terraform

Managing the Integration Center of Prometheus Instances Using Terraform

Collecting Container Monitoring Data Using Terraform

Configuring Alarm Policies Using Terraform

FAQs

Basic Questions

Integration with TKE Cluster

Product Consulting

Use and Technology

Service Level Agreement

TMP Policy

Data Processing And Security Agreement

DocumentationTencentCloud Managed Service for PrometheusOperation GuideAlerting RuleRule Type Description(old)

Rule Type Description(old)

Last updated: 2024-01-29 16:01:55

Rule Type Description(old)

Last updated: 2024-01-29 16:01:55

TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.
Kubernetes master component
The following metrics are provided for non-managed clusters:
Rule Name
Rule Expression
Duration
Description
Error with client access to APIServer
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
15m
The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
None
The client certificate for APIServer access will expire in 24 hours
Recording API error
sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
None
The recording API reported an error in the last 5 minutes
Low recording API availability
(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
5m
The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault
absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
5m
APIServer disappeared from the collection targets
Scheduler fault
absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
15m
The scheduler disappeared from the collection targets
Controller manager fault
absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
15m
The controller manager disappeared from the collection targets
Kubelet
Rule Name
Rule Expression
Duration
Description
Exceptional node status
kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
15m
The node status is exceptional for over 15 minutes
Unreachable node
kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
15m
The node is unreachable, and its workload will be scheduled again
Too many Pods running on node
count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
15m
The number of Pods running on the node is close to the upper limit
Node status fluctuation
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
15m
The node status fluctuates between normal and exceptional
Imminent expiration of the kubelet client certificate
kubelet_certificate_manager_client_ttl_seconds < 86400
None
The kubelet client certificate will expire in 24 hours
Imminent expiration of the kubelet server certificate
kubelet_certificate_manager_server_ttl_seconds < 86400
None
The kubelet server certificate will expire in 24 hours
Kubelet client certificate renewal error
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet client certificate
Kubelet server certificate renewal error
increase(kubelet_server_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet server certificate
Time-Consuming PLEG
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
5m
The 99th percentile of PLEG operation duration exceeds 10 seconds
Time-Consuming Pod start
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
15m
The 99th percentile of Pod start duration exceeds 60 seconds
Kubelet fault
absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
15m
Kubelet disappeared from the collection targets
Kubernetes Resource Use
Rule Name
Rule Expression
Duration
Description
Cluster CPU resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
5m
Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster memory resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
5m
Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster CPU quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
5m
The CPU quota in the cluster exceeds the total number of allocable CPU cores
Cluster memory quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) /  sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
5m
The memory quota in the cluster exceeds the total amount of allocable memory
Imminent runout of quota resources
sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
15m
The quota resource utilization exceeds 90%
High proportion of restricted CPU execution cycles
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
15m
The proportion of restricted CPU execution cycles is high
High Pod CPU utilization
sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod CPU utilization exceeds 75%
High Pod memory utilization
sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod memory utilization exceeds 75%
Kubernetes Workload
Rule Name
Rule Expression
Duration
Description
Frequent Pod restarts
increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
15m
The Pod was frequently restarted in the last 5 minutes
Exceptional Pod status
sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
15m
The Pod is in the `NotReady` status for over 15 minutes
Exceptional container status
sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
1h
The container is in the `Waiting` status for a long period of time
Deployment version mismatch
kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
15m
The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
Deployment replica quantity mismatch
(kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
StatefulSet version mismatch
kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
15m
The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
StatefulSet replica quantity mismatch
(kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
Ineffective StatefulSet update
(maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
15m
The StatefulSet hasn't been updated on some Pods
Frozen DaemonSet change
((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
15m
The DaemonSet change lasts more than 15 minutes
DaemonSet not scheduled on some nodes
kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
10m
The DaemonSet is not scheduled on some nodes
Faulty scheduling of DaemonSet on some nodes
kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
15m
The DaemonSet is incorrectly scheduled to some nodes
Excessive Job execution
kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"}  > 0
12h
The execution duration of the Job exceeds 12 hours
Job execution failure
kube_job_failed{job=~".*kube-state-metrics"}  > 0
15m
Job execution failed
Mismatch between replica quantity and HPA
(kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
15m
The actual number of replicas is different from that set in HPA
Number of replicas reaching maximum value in HPA
kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
15m
The actual number of replicas reaches the maximum value configured in HPA
Exceptional PersistentVolume status
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
15m
The PersistentVolume is in the `Failed` or `Pending` status
Kubernetes Node
Rule Name
Rule Expression
Duration
Description
Imminent runout of filesystem space
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem space will be used up in 4 hours
High filesystem space utilization
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The available filesystem space is below 5%
Imminent runout of filesystem inodes
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem inodes will be used up in 4 hours
High filesystem inode utilization
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The proportion of available inodes is below 3%
Unstable network interface status
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
2m
The network interface status is unstable and frequently changes between "up" and "down"
Network interface data reception error
increase(node_network_receive_errs_total[2m]) > 10
1h
An error occurred while the network interface received data
Network interface data sending error
increase(node_network_transmit_errs_total[2m]) > 10
1h
An error occurred while the network interface sent data
Unsynced server clock
min_over_time(node_timex_sync_status[5m]) == 0
10m
The server time has not been synced recently. Please check whether NTP is correctly configured
Server clock skew
(node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
10m
The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured
﻿

Was this page helpful?

You can also Contact Sales or Submit a Ticket for help.

Yes

No

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support

Hong Kong, China

+852 800 906 020 (Toll Free)

United States

+1 844 606 0804 (Toll Free)

United Kingdom

+44 808 196 4551 (Toll Free)

Canada

+1 888 605 7930 (Toll Free)

Australia

+61 1300 986 386 (Toll Free)

EdgeOne hotline

+852 300 80699

More local hotlines coming soon