tencent cloud

Feedback

Overview

Last updated: 2024-04-24 15:55:36

Overview

Component Overview

Kubernetes' scheduling logic operates based on the Pod's Request. The schedulable resources on the node are occupied by the Pod's Request amount and cannot free up. The native node dedicated scheduler is a scheduling plugin developed by Tencent Kubernetes Engine (TKE) based on the native Kube-scheduler Extender mechanism of Kubernetes, which can virtually magnify the capacity of the node, resolving the issue of the node's resources being occupied while maintaining a low utilization rate.

Kubernetes objects deployed in a cluster

Kubernetes Object Name
Type
Requested Resource
Belonging Namespace
crane-scheduler-controller
Deployment
Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance
kube-system
crane-descheduler
Deployment
Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance
kube-system
crane-scheduler
Deployment
Each instance is endowed with 200m CPU and 200Mi memory, totaling three instances
kube-system
crane-scheduler-controller
Service
-
kube-system
crane-scheduler
Service
-
kube-system
crane-scheduler
ClusterRole
-
kube-system
crane-descheduler
ClusterRole
-
kube-system
crane-scheduler
ClusterRoleBinding
-
kube-system
crane-descheduler
ClusterRoleBinding
-
kube-system
crane-scheduler-policy
ConfigMap
-
kube-system
crane-descheduler-policy
ConfigMap
-
kube-system
ClusterNodeResourcePolicy
CRD
-
-
CraneSchedulerConfiguration
CRD
-
-
NodeResourcePolicy
CRD
-
-
crane-scheduler-controller-mutating-webhook
MutatingWebhookConfiguration
-
-

Application Scenarios

Scenario 1: Resolving the issue of high node box rate but low utilization

Note:
The fundamental concepts are as follows.
Box Rate: It refers to the ratio of the sum of Requests of all Pods on a node to the actual specifications of the node. Utilization: It refers to the ratio of the total actual usage of all Pods on a node to the actual specifications of the node.
The native Kubernetes scheduler schedules based on the Request resources of Pod. Therefore, even if the actual usage on the node is low at this time, if the sum of Requests of all Pods on the node is close to the actual specifications of the node, new Pods cannot be scheduled, resulting in substantial resource waste. Moreover, businesses tend to apply for surplus resources to ensure the stability of their services, that is, a large Request, leading to the occupation of node resources, unable to free up. At this point, the node's box rate is substantial, but the actual resource utilization is comparatively low.
At such times, you can use the dedicated native node scheduler to virtually enhance the specifications of CPU and memory on a node, thus amplifying its scheduler resources. More pods can thereby be scheduled.

Scenario 2: Setting the watermark of the nodes

The watermark setting of the node is to ensure the stability of the node and set the node's target utilization rate:
Control of the watermark during scheduling: This step determines the native node's target resource utilization rate to guarantee stability. While scheduling the Pods, nodes with resources above this watermark will not be selected. Moreover, from nodes meeting the watermark requirements, as shown in the following figure, nodes with lower actual load watermarks have priority to balance the cluster node's utilization distribution.
Control of the watermark during runtime: This step determines the current target resource utilization rate for native nodes to guarantee stability. At runtime, nodes with resources above this watermark could trigger evictions. Given that eviction is a high-risk action, bear in mind the following notes.

Notes

1. To avoid draining important Pods, this feature is set not to evict Pods by default. For Pods that can be safely drained, it is essential for users to explicitly determine the workload to which the Pod belongs. For example, StatefulSet, Deployment, and other objects can be set as drainable annotations:
descheduler.alpha.kubernetes.io/evictable: 'true'
2. It is recommended to enable event persistence for the cluster, to better monitor component abnormalities and troubleshoot. When evicting a Pod, corresponding events will be generated. You can observe if the Pod is being repeatedly evicted based on the Descheduled event.
3. The eviction action has requirements for nodes: a cluster is required to have 3 or more low-load native nodes, where a low-load definition refers to a Node's load that is lesser than its operational water-level control.
4. After filtering at the node dimension, evacuation begins on the workload on the Node. This necessitates the constraint that the replica count of the workload should be equal to or greater than 2, or at least half of the Workload spec replicas.
5. At the Pod dimension level, if a Pod's load exceeds the eviction watermark of the node, eviction is forbidden to prevent the overloading of other nodes by relocating them there.

Scenario 3: Pods under specified Namespace shall be allocated only to native nodes upon the subsequent scheduling

Native nodes, the newly-launched node types, are introduced by the TKE Tencent Kubernetes Engine team of Tencent Cloud. They are built upon the technical excellence derived from Tencent Cloud's tens of millions of core container operations, thereby delivering native-like, high-stability, and rapid-response K8s node management capabilities. Native nodes, with amplifiable node specifications and recommended Request capabilities, are hence highly advisable for exploiting its advantages fully by scheduling your workload to them. While enabling the native node scheduler, you can opt for Namespace. Consequently, Pods under the specified Namespace shall be scheduled exclusively to native nodes in the following scheduling.
Note:
If the native node resources are insufficient at this stage, it would result in Pod Pending.

Limits

This feature is only supported by the native node. For more information, see Native Node Overview.
It is required to ensure that the Kubernetes version is v1.22.5-tke.8, v1.20.6-tke.24, v1.18.4-tke.28,v1.16.3-tke.30 or higher. For cluster versions upgrade, see Upgrading a Cluster.

Risk Control

After the uninstallation of this component, only the scheduling logic associated with the native node-dedicated scheduler will be eliminated, leaving the scheduling capability of the native Kube-Scheduler untouched. The already scheduled Pods on the native node will not be affected due to their pre-set schedule. However, a reboot of kubelet on the native node might trigger Pod eviction as the sum of Pods' Requests on the native node could exceed the genuine specifications of the native node.
In the event of the amplification coefficient being adjusted downwards, the existing Pods on the native node, due to their already prescribed schedule, will remain unaffected. Nonetheless, if the kubelet on the native node restarts, it might trigger Pod eviction since the aggregate of Pods' Requests on the native node could surpass the amplified specifications of the native node after the amplification.
Users witness the inconsistency between Node resources in the Kubernetes cluster and corresponding CVM node resources.
In the future, issues related to excessive load and instability could possibly arise.
After the amplification of the node specifications, the node kubelet layer and the resource QoS-related modules might be affected. For instance, kubelet's binding cores, when a 4-core node is treated as an 8-core node for scheduling, the Pods' binding cores could possibly be impacted.

Component Permission Description

Crane Scheduler Permission

Permission Description

The permission of this component is the minimal dependency required for the current feature to operate.

Permission Scenarios

Feature
Involved Object
Involved Operation Permission
It is required to keep track of the updates and changes to the node, as well as the utilization of the access node.
nodes
get/watch/list
Track the updates and changes of pods, and determine the scheduling priority of nodes based on the recent scheduling situation of pods within the cluster.
pods/namespaces
get/watch/list
It is required to update node utilization to node resources, thereby achieving the decoupling of scheduling and query logic.
nodes/status
patch
It is required to support multiple replicas to ensure component availability.
leases
create/get/update
It is required to track the updates and changes of the configmap, implementing the feature of scheduling specified pods to native nodes.
configmap
get/list/watch

Permission Definition

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: crane-scheduler
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- namespaces
verbs:
- list
- watch
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- extensions
- apps
resources:
- deployments/scale
verbs:
- get
- update
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- get
- update
- apiGroups:
- "scheduling.crane.io"
resources:
- clusternoderesourcepolicies
- noderesourcepolicies
- craneschedulerconfigurations
verbs:
- get
- list
- watch
- update
- create
- patch

Crane Descheduler Permission

Permission Description

The permission of this component is the minimal dependency required for the current feature to operate.

Permission Scenarios

Feature
Involved Object
Involved Operation Permission
It is required to keep track of the updates and changes to the node, as well as the utilization of the access node.
nodes
get/watch/list
Track the updates and changes of the pods, determining the pods to be evicted first based on the information of the pods within the clusters.
pods
get/watch/list
Drain the pod.
pods/eviction
create
It is required to determine whether the number of ready workloads where the pod resides constitutes half or more of the total requirements to decide whether to drain the pod.
replicasets/deployments/statefulsets/statefulsetpluses/job
get
Report events when draining Pods.
create
events

Permission Definition

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: crane-descheduler
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["nodes/status"]
verbs: ["patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: ["*"]
resources: ["replicasets"]
verbs: ["get"]
- apiGroups: ["*"]
resources: ["deployments"]
verbs: ["get"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get"]
- apiGroups: ["platform.stke"]
resources: ["statefulsetpluses"]
verbs: ["get"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create"]
- apiGroups: ["*"]
resources: ["jobs"]
verbs: ["get"]
- apiGroups: [ "coordination.k8s.io" ]
resources: [ "leases"

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon