tencent cloud

Feedback

Description of tke-monitor-agent

Last updated: 2024-02-01 10:07:57

Overview

Tencent Cloud upgraded the basic monitoring architecture to improve the stability of the TKE basic monitoring and alarming feature. After the upgrade, a DaemonSet named tke-monitor-agent is deployed under the kube-system namespace in the cluster, and the K8s resource objects of authentication and authorization are created, including ClusterRole, ServiceAccount, and ClusterRoleBinding. These resource objects are all named tke-monitor-agent.

Strengths

This add-on collects the monitoring data of containers, Pods, nodes, and community add-ons. The collected data is used for basic monitoring metrics display, metrics alarming, and metric-based HPA service in the console. By deploying this add-on, you can fix the problem that the monitoring data can't be obtained due to the instability of the basic monitoring service, thereby enjoying more stable monitoring, alarming, and HPA services.

Impact

Deploying this add-on does not affect the normal running of the cluster.
If your node resources are allocated unreasonably, node load is too heavy, or node resources are not enough, deploying the basic monitoring add-on may cause the problem where the Pod corresponding to the tke-monitor-agent DaemonSet is in the status of Pending, Evicted, OOMKilled or CrashLoopBackOff. The details of the status are as follows:
Pending: The resources on the cluster node are not enough to schedule a Pod. You can schedule the Pod to the node by setting the quantity of requested resources for the tke-monitor-agent DaemonSet to 0. For more information, see Pod Remains in Pending.
Evicted: This status may be caused by insufficient node resources or a heavy load on the node. You can find out the cause and solve the problem in the following ways:
Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Message field.
Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Events field.
CrashLoopBackOff or OOMKilled: Run kubectl describe pod -n kube-system <podName> to check whether an OOM error occurs. If yes, you can increase the value of memory limits, which can't exceed 100 MB. If the error still occurs after the value is set to 100 MB, submit a ticket for assistance.
ContainerCreating: Run kubectl describe pod -n kube-system <podName> to check the Events field. If Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "<pod name >": Error response from daemon: Failed to set projid for /data/docker/overlay2/xxx-init: no space left on device is displayed, the container data disk is full, and you can clear the data disk to restore it.
Note:
If the problem persists, submit a ticket for assistance.
Quantity of resources consumed in each Pod managed by the DaemonSet (named tke-monitor-agent) is positively correlated with the number of Pods and containers running on the node. Below is a sample stress test with low MEM and CPU usage: Data volume 220 Pods are deployed on a node, and each Pod contains three containers. Resources consumed
MEM (peak)
CPU (peak)
About 40 MiB
0.01C
The stress test result of the CPU usage is as shown below:


The stress test result of the memory usage is as shown below:



Component Permission Description

Permission Description

The permission of this component is the minimal dependency required for the current feature to operate.

Permission Scenarios

Feature
Involved Object
Involved Operation Permission
It is required to gather the number of Pods and related information in the cluster.
ReplicaSets, Deployments, and Pods
list/watch
Obtaining the metric information of cadvisor by visiting the /metrics port on the Kubelet of the node.
nodes, nodes/proxy, and nodes/metrics
list/watch/get
Delivering metric data with cluster-monitor
services
list/watch
Reporting metrics to HPA-Metrics-Server
custommetrics
update

Permission Definition

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tke-monitor-agent
rules:
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "nodes/metrics"]
verbs: ["list", "watch", "get"]
- apiGroups: [""]
resources: ["services"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["list", "watch"]
- apiGroups: ["monitor.tencent.io"]
resources: ["custommetrics"]
verbs: ["update"]


Catalog

In The Article

Description of tke-monitor-agent

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support